11

I am trying to download all links from aligajani.com. There are 7 of them, excluding the domain facebook.com–which I want to ignore. I don't want to download from links that start with facebook.com domain.

Also, I want them saved in a .txt file, line by line. So there would be 7 lines.

Here's what I've tried so far. This just downloads everything. Don't want that.

wget -r -l 1 http://aligajani.com
  • 2
    Yes, wget downloads whole pages. Why do you think it might only save the links? – michas Feb 26 '14 at 06:44
  • Well wget has a command that downloads png files from my site. It means, somehow, there must be a command to get all the URLS from my site. I just gave you an example of what I am trying to do currently. – Ali Gajani Feb 26 '14 at 06:46
  • 3
    You're trying to use completely the wrong tool for the job, this is not at all what wget is designed to do. In future, please don't assume a particular tool for a job when asking for a question; it's not a good way to ask questions in general, but it's especially poor practice in a technical venue. – Chris Down Feb 26 '14 at 06:52
  • Apologies, I am a noob to wget. I thought wget is powerful functionality built in for tasks like web crawling and more, so I was assuming it would do something like this. I did see the man page for wget and didn't find anything w.r.t what I wanted, so I came here. – Ali Gajani Feb 26 '14 at 07:08

4 Answers4

25

wget does not offer such an option. Please read its man page.

You could use lynx for this:

lynx -dump -listonly http://aligajani.com | grep -v facebook.com > file.txt

From its man page:

   -listonly
          for -dump, show only the list of links.
terdon
  • 242,166
michas
  • 21,510
  • It dumps the links of a single page. Is there a way to do this recursively? – That Brazilian Guy May 27 '16 at 14:28
  • Have a look at the man page of lynx. I'm pretty sure it has no option for this. Maybe you want wget -r. Have a look at man wget. Maybe you want to ask a separate question saying what you really want to do and why. – michas May 27 '16 at 14:33
  • @michas it doesn't work for https://piazza.com/university_of_dhaka/summer2017/cse3205/resources There are many dropbox link for the resource, but lynx results nothing about dropbox. This may be because the site is password protected. – alhelal Nov 22 '17 at 12:33
  • For password-protected sites, you can add -auth=ID:PASSWD – markgraf Aug 15 '19 at 07:07
  • This usage did no download. – SmallChess Apr 30 '21 at 04:55
17

Use the following in the terminal:

wget -r -p -k http://website

or

wget -r -p -k --wait=#SECONDS http://website

Note: The second one is for websites that may flag you if downloading too quickly; may also cause a loss of service, so use the second one for most circumstances to be courteous. Everything will be placed in a folder named the same as the website in your root folder directory or whatever directory you have a terminal in at the time of executing the command.

2

As others have pointed out, wget is not designed for this. You can however parse its output to get what you want:

$ wget http://aligajani.com -O - 2>/dev/null | 
    grep -oP 'href="\Khttp:.+?"' | sed 's/"//' | grep -v facebook > file.txt

That creates a file called file.txt with the following contents:

http://www.linkedin.com/pub/ali-ayaz-gajani/17/136/799
http://www.quora.com/Ali-Gajani
http://www.mrgeek.me/
http://twitter.com/aligajani
http://www.mrgeek.me
http://aligajani.com
terdon
  • 242,166
0

You can use -o log for that, then navigate and extract links from the log file using this https://www.garron.me/en/bits/wget-download-list-url-file.html .-