How to get lynx to recognize whitespace in filenames

Question

I'm using a script to find a list of all .pdf files on a url. But lynx seems to have a problem with spaces in filenames. Here's the script:

lynx --dump http://www.somesite/here/ | awk '/http/{print $2} | grep pdf > ~/Desktop/links.txt

This works as expected until there is a .pdf with whitespace in the filename. Lynx seems to truncate the filename at the whitespace. Is there some way to prevent this?

are you sure that it's Lynx that is truncating the filename? Could you provide a sample from that lynx --dump that demonstrates the filename? I think, rather, that you're seeing awk print only one field of a white-space delimited line (once you get past the missing closing quote). — Jeff Schaller, Aug 28 '18 at 13:07
When I do it, it creates the file links.txt but it is blank. — Linter, Aug 28 '18 at 13:14
Could you edit into the question the lynx-dump output piped through grep http? — Jeff Schaller, Aug 28 '18 at 13:29
ok, so this works fine: lynx -dump http://somesite | grep pdf > ~/Desktop/links.txt — Linter, Aug 28 '18 at 13:34
Do you have a ~/.lynxrc file that is setting the -source option "for" you? I can repro with lynx -dump -source ... but not with lynx -dump ... — Jeff Schaller, Aug 28 '18 at 20:20

Thomas Dickey · Accepted Answer · 2018-08-29T00:20:55.547

awk (by default) uses blanks as field-separators, and lynx is rendering a blank in a dumped url as a blank. Work around it as I suggested in a bug report:

lynx -listonly -dump http://www.somesite/here/ | \
awk '/\.pdf$/{ sub("^[ ]*[0-9]+.[ ]*","",$0); print}' > ~/Desktop/links.txt

If the content happens to be in UTF-8 encoding, lynx unescapes the text (undoes URL-encoding such as %20), showing a blank in this case (making that two or more fields for awk, depending on the number of blanks in the name).

That unescaping was done for Debian #398274, in 2013 (i.e., you've got that feature with Ubuntu 18.04).

Adding the -listonly option reduces the number of incorrect matches, by looking only at the list of URLs.

If you wanted to look for multiple file-types, you could list the suffixes as alternatives in the regular expression, e.g., something like this:

awk '/\.(pdf|odt|doc|docx)$/{ sub("^[ ]*[0-9]+.[ ]*","",$0); print}' > ~/Desktop/links.txt

This does fix my issue; thank you. One question: If I wanted to use awk to read for multiple file types (.pdf, .odt, .doc, docx), how would I alter this code? — Linter, Aug 29 '18 at 00:00

How to get lynx to recognize whitespace in filenames

1 Answers1