7

Let's say I'm on a webpage on which there are several hyperlinks to pdf documents. I want to download those pdf. Is there a way to get a list of those documents (in the ls way) which then allows to choose better which file to download with wget or curl?

ecjb
  • 403

3 Answers3

7

You can use lynx or links (text mode web browsers) to download and display a list of links in a web page, then pipe that into grep to extract only the PDF links. For example:

URL='https://www.example.com/files/'
lynx -dump -listonly -nonumbers "$URL" | grep -i '\.pdf$'

NOTE: it is important to double-quote "$URL", especially if the URL contains spaces or shell metacharacters (such as ; or &, which are very frequently seen in URLs). Save yourself a headache and always quote URL strings and variables containing URLs when you use them (in fact, it's pretty nearly always a good idea to double-quote variables when you use them, whether they contain URLs or not - see Why does my shell script choke on whitespace or other special characters?).

You could then redirect grep's output to a file, edit it with a text editor to delete the PDF files you're not interested in, and then use wget's -i (--input-file=file) option to download all the URLs in the file. Or you could manually download them one at a time with wget or curl.


BTW, wget also has a -m (--mirror) option for mirroring web sites, and numerous options for controlling exactly what gets downloaded (e.g. -A and -R to accept or reject files matching suffixes or certain glob-like patterns - -A pdf or -A '*.pdf' - and --accept-regex & --reject-regex to do the same with regular expressions), and to control whether wget will follow links to other sites (and to which other sites), whether to follow links to parent or child directories (and how many levels deep), and more. There are a lot of options, and even more interactions between combinations of options, so don't expect to master it instantly.

cas
  • 78,579
  • Many thanks for your answer @cas. It works good and therefore I upvoted the question. Another answer below seems to work also well and therefore I cannot accept either as I think there are equally good – ecjb Jan 02 '22 at 16:45
  • 1
    Accept the one you like best. Or wait for a better answer to come along. – cas Jan 03 '22 at 01:32
3

You didn't specify what webpage you're referring to but if it provides listing of files such as https://ftp.gnu.org/gnu/tar you can use lftp:

$ lftp https://ftp.gnu.org/gnu/tar/
cd ok, cwd=/gnu/tar
lftp ftp.gnu.org:/gnu/tar> ls
(...)
-rw-r--r--          181  2021-02-13 06:32  tar-latest.tar.bz2.sig
-rw-r--r--   4.2M   2021-02-13 06:32  tar-latest.tar.gz
-rw-r--r--          181  2021-02-13 06:32  tar-latest.tar.gz.sig
-rw-r--r--   2.1M   2021-02-13 06:33  tar-latest.tar.xz
-rw-r--r--          181  2021-02-13 06:33  tar-latest.tar.xz.sig

You can now create a directory on your local filesystem, change to it and download a file:

lftp ftp.gnu.org:/gnu/tar> !mkdir /tmp/download
lftp ftp.gnu.org:/gnu/tar> lcd /tmp/download
lcd ok, local cwd=/tmp/download
lftp ftp.gnu.org:/gnu/tar> get tar-latest.tar.xz
2022-01-02 14:54:21 https://ftp.gnu.org/gnu/tar/tar-latest.tar.xz -> /tmp/download/tar-latest.tar.xz 0-2226068 1.72 MiB/s
2226068 bytes transferred in 1 second (1.72 MiB/s)

Or multiple files using mget command.

  • Many thanks for your answer @ArkadiuszDrabczyk. It works good and therefore I upvoted the question. Another answer seems to work also well and therefore I cannot accept either as I think there are equally good – ecjb Jan 02 '22 at 16:45
1

Open the developer console in your browser with Ctrl+Shift+I and go to the Console tab. Then paste this code and press Enter:

let allLinks = ""
document.querySelectorAll("a").forEach(item => {
  if(item.href.endsWith("pdf")){
    allLinks += item.href + "\n"
  }
})
console.log(allLinks)

This will list all the pdf links in the console, which you can then copy to a text editor and further edit.