How to download this web page properly?

Question

I'm trying to download the content of this web page: https://bcs.wiley.com/he-bcs/Books?action=index&itemId=1119299160&bcsId=10685. In particular, I'm interested to the pdf files that can be reached from the menus "Browse by Chapter", "Browse by Resource", etc. you can see above in the web page. I tried to download the page by wget, but without success.

I already used the -r l 0 option in wget and also quoted the URL (as discussed below in the comments).

Could you help me? Thank you in advance!

What happens when you try? What exactly are the messages you receive? What's the exact command you're trying? Please add these to your question — Chris Davies, May 15 '22 at 11:18
and did you quote the URL on the wget command line? without quoting, your shell will try to run everything before the & as one command, everything up to the next & as another command, and everything up to the third & as a third command. Why? Because & terminates a command like ; does, and also causes it to run as a background job. See Why does my shell script choke on whitespace or other special characters? for more about the importance of quoting. — cas, May 15 '22 at 11:55
@cas No, I didn't quote the URL on the wget command line. With your suggestion I made a step forward. Now, quoting the URL, wget downloads as files the links to the pages that contains the pdfs, but I would like it to download those pages and the pdfs within them as well. — user9952796, May 15 '22 at 13:17
You'll need to extract the URLs for the links and PDFs from the first URL you downloaded. Personally, I'd probably write a bot in perl using libwww-perl or maybe the Web::Scraper module. Another useful method, a quick hack more suited to a once-off job, is to use lynx -dump -listonly -nonumbers -hiddenlinks=ignore "$URL" instead of wget and then extract the URLs I was interested in with grep or awk. And then fetch THEM with wget. — cas, May 15 '22 at 13:22
or, depending on how many links you're NOT interested in are linked to from that first URL, just use wget's mirroring options. The wget man page is pretty comprehensive and takes a bit of reading before it makes sense, but there's a lot you can do with it once you understand its various options. — cas, May 15 '22 at 13:25
I'm using the -r -l 0 option in wget, but I don't understand why in this particular site it doesn't go ahead to the pages linked in the menus — user9952796, May 15 '22 at 13:33
@user9952796 please update your question with new information. Make it easy for people to help you by keeping everything in the same place — Chris Davies, May 15 '22 at 14:25
Now the quoting has been applied properly it's a duplicate of How to download all PDF files linked from a single page using wget — Chris Davies, May 15 '22 at 14:27

Snake · Accepted Answer · 2022-05-17T23:19:36.790

4

Using wget alone won't work because of the way the URLs are handled by JavaScript. You will have to parse the page using xmllint and then process the URLs into a format wget can handle.

Start by extracting and processing the URLs handled by JavaScript and outputting it to urls.txt:

wget -O - 'https://bcs.wiley.com/he-bcs/Books?action=resource&bcsId=10685&itemId=1119299160&resourceId=42647' | \
xmllint --html --xpath "//li[@class='resourceColumn']//a/@href" - 2>/dev/null | \
sed -e 's# href.*Books#https://bcs.wiley.com/he-bcs/Books#' -e 's/amp;//g' -e 's/&newwindow.*$//' > urls.txt

Now download the PDF files found from opening each URL in urls.txt:

wget -O - -i urls.txt | grep -o 'https.*pdf' | wget -i -

curl alternative:

curl 'https://bcs.wiley.com/he-bcs/Books?action=resource&bcsId=10685&itemId=1119299160&resourceId=42647' | \
xmllint --html --xpath "//li[@class='resourceColumn']//a/@href" - 2>/dev/null | \
sed -e 's# href.*Books#https://bcs.wiley.com/he-bcs/Books#' -e 's/amp;//g' -e 's/&newwindow.*$//' > urls.txt

curl -s $(cat urls.txt) | grep -o 'https.*pdf' | xargs -l curl -O

edited May 17 '22 at 23:19

answered May 15 '22 at 19:05

Snake

1,650

Thank you for the explanation! Unfortunately, the command yields --2022-05-17 21:57:19-- https://bcs.wiley.com/he-bcs/Books?action=resource&bcsId=10685&itemId=1119299160&resourceId=42647 Resolving bcs.wiley.com (bcs.wiley.com)... 104.18.37.205, 172.64.150.51, 2606:4700:4400::ac40:9633, ... Connecting to bcs.wiley.com (bcs.wiley.com)|104.18.37.205|:443... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: ‘STDOUT’ --.-KB/s in 0s Cannot write to ‘-’ (Success). – user9952796 May 17 '22 at 19:59
1

@user9952796 I am unable to reproduce that error. Do you have wget configured using a ~/.wgetrc? If so, you can try adding --no-config to wget. Does it still output urls.txt despite the error? Make sure xmllint is installed. – Snake May 17 '22 at 22:42
1

@user9952796 Give the curl alternative shown above a try if all else fails. – Snake May 17 '22 at 23:21
ok, now it's working fine! Does your command store all files, even the type I can't know (as it's reasonable to think for a site that you don't know) ? – user9952796 May 18 '22 at 11:28
1

@user9952796 It only downloads PDF files but if you want other files you could change the file extension in the grep command. The paths in the first command line will also likely require modifying for different pages. If you found my answer helpful, please consider marking it answered. Thanks. – Snake May 18 '22 at 16:59
How could I extend the grep command to cover all type of files ? – user9952796 May 18 '22 at 18:35
1

@user9952796 The provided commands were really only meant for this specific case because of how that website is structured. If other file types are handled the same way as PDFs then you can specify more file types with the -e option, e.g., grep -o -e 'https.*pdf' -e 'https.*doc. For other cases, it may be better to simply use wget -r when possible to get all types of files. – Snake May 18 '22 at 20:32

How to download this web page properly?

1 Answers1