0

I'm using a script to find a list of all .pdf files on a url. But lynx seems to have a problem with spaces in filenames. Here's the script:

lynx --dump http://www.somesite/here/ | awk '/http/{print $2} | grep pdf > ~/Desktop/links.txt

This works as expected until there is a .pdf with whitespace in the filename. Lynx seems to truncate the filename at the whitespace. Is there some way to prevent this?

Linter
  • 159
  • Would be surprising. Do you have a sample or two? – RudiC Aug 28 '18 at 13:05
  • 4
    are you sure that it's Lynx that is truncating the filename? Could you provide a sample from that lynx --dump that demonstrates the filename? I think, rather, that you're seeing awk print only one field of a white-space delimited line (once you get past the missing closing quote). – Jeff Schaller Aug 28 '18 at 13:07
  • When I do it, it creates the file links.txt but it is blank. – Linter Aug 28 '18 at 13:14
  • Could you edit into the question the lynx-dump output piped through grep http? – Jeff Schaller Aug 28 '18 at 13:29
  • ok, so this works fine: lynx -dump http://somesite | grep pdf > ~/Desktop/links.txt – Linter Aug 28 '18 at 13:34
  • Do you have a ~/.lynxrc file that is setting the -source option "for" you? I can repro with lynx -dump -source ... but not with lynx -dump ... – Jeff Schaller Aug 28 '18 at 20:20

1 Answers1

3

awk (by default) uses blanks as field-separators, and lynx is rendering a blank in a dumped url as a blank. Work around it as I suggested in a bug report:

lynx -listonly -dump http://www.somesite/here/ | \
awk '/\.pdf$/{ sub("^[ ]*[0-9]+.[ ]*","",$0); print}' > ~/Desktop/links.txt

If the content happens to be in UTF-8 encoding, lynx unescapes the text (undoes URL-encoding such as %20), showing a blank in this case (making that two or more fields for awk, depending on the number of blanks in the name).

That unescaping was done for Debian #398274, in 2013 (i.e., you've got that feature with Ubuntu 18.04).

Adding the -listonly option reduces the number of incorrect matches, by looking only at the list of URLs.

If you wanted to look for multiple file-types, you could list the suffixes as alternatives in the regular expression, e.g., something like this:

awk '/\.(pdf|odt|doc|docx)$/{ sub("^[ ]*[0-9]+.[ ]*","",$0); print}' > ~/Desktop/links.txt
Thomas Dickey
  • 76,765
  • This does fix my issue; thank you. One question: If I wanted to use awk to read for multiple file types (.pdf, .odt, .doc, docx), how would I alter this code? – Linter Aug 29 '18 at 00:00