4

I need to download all PDF files that are linked from this webpage https://www.in.tum.de/i07/lehre/ss22/theo/.

Everything I have tried so far either does not get the files or downloads the entire website recursively.

I however am only interested in the PDFs linked to directly from this PAGE (not website).

Thanks.

3 Answers3

4

You can use wget's --no-parent (-np) and --level=depth (-l) options to control how much of the site will be mirrored by the -r option. The --no-host-directories (-nH) and --no-directories (-nd) options will also prevent wget from duplicating the remote directory structure. (Double hyphen precedes options when they are written in full, , while single hyphen precedes options when they are written in short form --no-parent is -np)

e.g. something like this:

wget -r -l 1 -nH -nd -np --ignore-case -A '*.pdf' https://www.in.tum.de/i07/lehre/ss22/theo/

By default, that will save the .pdf files in the current directory. You can use the -P option to specify a different output dir.

wget is very flexible and has a lot of options - so many that the man page can be overwhelming when you first read it, but it is definitely worth putting in some effort to read and experiment with.

JOSHIN
  • 3
cas
  • 78,579
2

You can extract the list of PDF files from the web page using wget or curl to download it, and xmlstarlet to parse the resulting HTML/XML:

curl https://www.in.tum.de/i07/lehre/ss22/theo/ |
    xmlstarlet format -H index.html 2>/dev/null |
    xmlstarlet select -t -m '//a[contains(@href,"pdf")]' -v '@href' -n

The first xmlstarlet converts HTML to XML. The second one find all the a elements and extracts each href attribute value that contains pdf.

From there it's straightforward to download each of the extracted links. Pipe the output of the preceding block into a loop

while IFS= read -r url
do
    file="${url%\?*}"                            # Strip trailing ? parameters
    file="${file##*/}"                           # Strip leading URL path
    printf "Saving %s as %s\n" "$url" "$file"    # Report action
    curl "$url" >"$file"
done
Chris Davies
  • 116,213
  • 16
  • 160
  • 287
0

More broadly, You can use wget to download all PDFs from a webpage by using:

wget -r -l1 -H -t1 -nd -N -np -A.pdf -erobots=off --wait=2 --random-wait --limit-rate=20k [URL]
  • -r: Recursive download.
  • -l1: Only one level deep (i.e., only files directly linked from this page).
  • -H: Span hosts (follow links to other hosts).
  • -t1: Number of retries is 1.
  • -nd: Don't create a directory structure, just download all the files into the current directory.
  • -N: Turn on timestamping.
  • -np: Do not follow links to parent directories.
  • -A.pdf: Accept only files that end with .pdf.
  • -erobots=off: Ignore the robots.txt file (use carefully, respecting site's terms and conditions).
  • --wait=2: Wait 2 seconds between each retrieval.
  • --random-wait: Wait from 0.5 to 1.5 * --wait seconds between retrievals.
  • --limit-rate=20k: Limit the download rate to 20 kilobytes per second.

This parameters will avoid the "429: Too Many Requests" error.