awk (by default) uses blanks as field-separators, and lynx is rendering a blank in a dumped url as a blank. Work around it as I suggested in a bug report:
lynx -listonly -dump http://www.somesite/here/ | \
awk '/\.pdf$/{ sub("^[ ]*[0-9]+.[ ]*","",$0); print}' > ~/Desktop/links.txt
If the content happens to be in UTF-8 encoding, lynx unescapes the text (undoes URL-encoding such as %20
), showing a blank in this case (making that two or more fields for awk, depending on the number of blanks in the name).
That unescaping was done for Debian #398274, in 2013 (i.e., you've got that feature with Ubuntu 18.04).
Adding the -listonly
option reduces the number of incorrect matches, by looking only at the list of URLs.
If you wanted to look for multiple file-types, you could list the suffixes as alternatives in the regular expression, e.g., something like this:
awk '/\.(pdf|odt|doc|docx)$/{ sub("^[ ]*[0-9]+.[ ]*","",$0); print}' > ~/Desktop/links.txt
lynx --dump
that demonstrates the filename? I think, rather, that you're seeingawk
print only one field of a white-space delimited line (once you get past the missing closing quote). – Jeff Schaller Aug 28 '18 at 13:07grep http
? – Jeff Schaller Aug 28 '18 at 13:29lynx -dump http://somesite | grep pdf > ~/Desktop/links.txt
– Linter Aug 28 '18 at 13:34-source
option "for" you? I can repro withlynx -dump -source ...
but not withlynx -dump ...
– Jeff Schaller Aug 28 '18 at 20:20