Using wget
alone won't work because of the way the URLs are handled by JavaScript. You will have to parse the page using xmllint
and then process the URLs into a format wget
can handle.
Start by extracting and processing the URLs handled by JavaScript and outputting it to urls.txt
:
wget -O - 'https://bcs.wiley.com/he-bcs/Books?action=resource&bcsId=10685&itemId=1119299160&resourceId=42647' | \
xmllint --html --xpath "//li[@class='resourceColumn']//a/@href" - 2>/dev/null | \
sed -e 's# href.*Books#https://bcs.wiley.com/he-bcs/Books#' -e 's/amp;//g' -e 's/&newwindow.*$//' > urls.txt
Now download the PDF files found from opening each URL in urls.txt
:
wget -O - -i urls.txt | grep -o 'https.*pdf' | wget -i -
curl
alternative:
curl 'https://bcs.wiley.com/he-bcs/Books?action=resource&bcsId=10685&itemId=1119299160&resourceId=42647' | \
xmllint --html --xpath "//li[@class='resourceColumn']//a/@href" - 2>/dev/null | \
sed -e 's# href.*Books#https://bcs.wiley.com/he-bcs/Books#' -e 's/amp;//g' -e 's/&newwindow.*$//' > urls.txt
curl -s $(cat urls.txt) | grep -o 'https.*pdf' | xargs -l curl -O
&
as one command, everything up to the next&
as another command, and everything up to the third&
as a third command. Why? Because&
terminates a command like;
does, and also causes it to run as a background job. See Why does my shell script choke on whitespace or other special characters? for more about the importance of quoting. – cas May 15 '22 at 11:55wget
command line. With your suggestion I made a step forward. Now, quoting the URL,wget
downloads as files the links to the pages that contains the pdfs, but I would like it to download those pages and the pdfs within them as well. – user9952796 May 15 '22 at 13:17lynx -dump -listonly -nonumbers -hiddenlinks=ignore "$URL"
instead ofwget
and then extract the URLs I was interested in with grep or awk. And then fetch THEM with wget. – cas May 15 '22 at 13:22wget
's mirroring options. Thewget
man page is pretty comprehensive and takes a bit of reading before it makes sense, but there's a lot you can do with it once you understand its various options. – cas May 15 '22 at 13:25-r -l 0
option inwget
, but I don't understand why in this particular site it doesn't go ahead to the pages linked in the menus – user9952796 May 15 '22 at 13:33