How to download all PDF files linked from a single page using wget

Question

I need to download all PDF files that are linked from this webpage https://www.in.tum.de/i07/lehre/ss22/theo/.

Everything I have tried so far either does not get the files or downloads the entire website recursively.

I however am only interested in the PDFs linked to directly from this PAGE (not website).

Thanks.

score 4 · Answer 1 · edited Aug 09 '22 at 05:38

You can use wget's --no-parent (-np) and --level=depth (-l) options to control how much of the site will be mirrored by the -r option. The --no-host-directories (-nH) and --no-directories (-nd) options will also prevent wget from duplicating the remote directory structure. (Double hyphen precedes options when they are written in full, , while single hyphen precedes options when they are written in short form --no-parent is -np)

e.g. something like this:

wget -r -l 1 -nH -nd -np --ignore-case -A '*.pdf' https://www.in.tum.de/i07/lehre/ss22/theo/

By default, that will save the .pdf files in the current directory. You can use the -P option to specify a different output dir.

wget is very flexible and has a lot of options - so many that the man page can be overwhelming when you first read it, but it is definitely worth putting in some effort to read and experiment with.

score 2 · Answer 2 · answered Apr 28 '22 at 11:24

You can extract the list of PDF files from the web page using wget or curl to download it, and xmlstarlet to parse the resulting HTML/XML:

curl https://www.in.tum.de/i07/lehre/ss22/theo/ |
    xmlstarlet format -H index.html 2>/dev/null |
    xmlstarlet select -t -m '//a[contains(@href,"pdf")]' -v '@href' -n

The first xmlstarlet converts HTML to XML. The second one find all the a elements and extracts each href attribute value that contains pdf.

From there it's straightforward to download each of the extracted links. Pipe the output of the preceding block into a loop

while IFS= read -r url
do
    file="${url%\?*}"                            # Strip trailing ? parameters
    file="${file##*/}"                           # Strip leading URL path
    printf "Saving %s as %s\n" "$url" "$file"    # Report action
    curl "$url" >"$file"
done

score 0 · Answer 3 · answered Nov 22 '23 at 16:06

More broadly, You can use wget to download all PDFs from a webpage by using:

wget -r -l1 -H -t1 -nd -N -np -A.pdf -erobots=off --wait=2 --random-wait --limit-rate=20k [URL]

-r: Recursive download.
-l1: Only one level deep (i.e., only files directly linked from this page).
-H: Span hosts (follow links to other hosts).
-t1: Number of retries is 1.
-nd: Don't create a directory structure, just download all the files into the current directory.
-N: Turn on timestamping.
-np: Do not follow links to parent directories.
-A.pdf: Accept only files that end with .pdf.
-erobots=off: Ignore the robots.txt file (use carefully, respecting site's terms and conditions).
--wait=2: Wait 2 seconds between each retrieval.
--random-wait: Wait from 0.5 to 1.5 * --wait seconds between retrievals.
--limit-rate=20k: Limit the download rate to 20 kilobytes per second.

This parameters will avoid the "429: Too Many Requests" error.

How to download all PDF files linked from a single page using wget

3 Answers3

Linked