How do I use wget to download all links from my site and save to a text file?

Question

I am trying to download all links from aligajani.com. There are 7 of them, excluding the domain facebook.com–which I want to ignore. I don't want to download from links that start with facebook.com domain.

Also, I want them saved in a .txt file, line by line. So there would be 7 lines.

Here's what I've tried so far. This just downloads everything. Don't want that.

wget -r -l 1 http://aligajani.com

Yes, wget downloads whole pages. Why do you think it might only save the links? — michas, Feb 26 '14 at 06:44
Well wget has a command that downloads png files from my site. It means, somehow, there must be a command to get all the URLS from my site. I just gave you an example of what I am trying to do currently. — Ali Gajani, Feb 26 '14 at 06:46
You're trying to use completely the wrong tool for the job, this is not at all what wget is designed to do. In future, please don't assume a particular tool for a job when asking for a question; it's not a good way to ask questions in general, but it's especially poor practice in a technical venue. — Chris Down, Feb 26 '14 at 06:52
Apologies, I am a noob to wget. I thought wget is powerful functionality built in for tasks like web crawling and more, so I was assuming it would do something like this. I did see the man page for wget and didn't find anything w.r.t what I wanted, so I came here. — Ali Gajani, Feb 26 '14 at 07:08

score 25 · Accepted Answer · edited Feb 26 '14 at 15:39

25

wget does not offer such an option. Please read its man page.

You could use lynx for this:

lynx -dump -listonly http://aligajani.com | grep -v facebook.com > file.txt

From its man page:

   -listonly
          for -dump, show only the list of links.

edited Feb 26 '14 at 15:39

terdon

242,166

answered Feb 26 '14 at 06:51

michas

21,510

It dumps the links of a single page. Is there a way to do this recursively? – That Brazilian Guy May 27 '16 at 14:28
Have a look at the man page of lynx. I'm pretty sure it has no option for this. Maybe you want wget -r. Have a look at man wget. Maybe you want to ask a separate question saying what you really want to do and why. – michas May 27 '16 at 14:33
@michas it doesn't work for https://piazza.com/university_of_dhaka/summer2017/cse3205/resources There are many dropbox link for the resource, but lynx results nothing about dropbox. This may be because the site is password protected. – alhelal Nov 22 '17 at 12:33
For password-protected sites, you can add -auth=ID:PASSWD – markgraf Aug 15 '19 at 07:07
This usage did no download. – SmallChess Apr 30 '21 at 04:55

score 17 · Answer 2 · edited Nov 26 '23 at 10:36

Use the following in the terminal:

wget -r -p -k http://website

or

wget -r -p -k --wait=#SECONDS http://website

Note: The second one is for websites that may flag you if downloading too quickly; may also cause a loss of service, so use the second one for most circumstances to be courteous. Everything will be placed in a folder named the same as the website in your root folder directory or whatever directory you have a terminal in at the time of executing the command.

score 2 · Answer 3 · answered Feb 26 '14 at 15:44

As others have pointed out, wget is not designed for this. You can however parse its output to get what you want:

$ wget http://aligajani.com -O - 2>/dev/null | 
    grep -oP 'href="\Khttp:.+?"' | sed 's/"//' | grep -v facebook > file.txt

That creates a file called file.txt with the following contents:

http://www.linkedin.com/pub/ali-ayaz-gajani/17/136/799
http://www.quora.com/Ali-Gajani
http://www.mrgeek.me/
http://twitter.com/aligajani
http://www.mrgeek.me
http://aligajani.com

Please don't parse html with regex: http://stackoverflow.com/a/1732454/1870481 — michas, Feb 26 '14 at 21:43

score 0 · Answer 4 · answered Jul 28 '16 at 13:11

0

You can use -o log for that, then navigate and extract links from the log file using this https://www.garron.me/en/bits/wget-download-list-url-file.html .-

answered Jul 28 '16 at 13:11

Cris Hernandez

1

How do I use wget to download all links from my site and save to a text file?

4 Answers4