Download recursively with wget

Question

I have a problem with the following wget command:

wget -nd -r -l 10 http://web.archive.org/web/20110726051510/http://feedparser.org/docs/

It should download recursively all of the linked documents on the original web but it downloads only two files (index.html and robots.txt).

How can I achieve recursive download of this web?

score 47 · Accepted Answer · edited Mar 27 '17 at 04:56

47

wget by default honours the robots.txt standard for crawling pages, just like search engines do, and for archive.org, it disallows the entire /web/ subdirectory. To override, use -e robots=off,

wget -nd -r -l 10 -e robots=off http://web.archive.org/web/20110726051510/http://feedparser.org/docs/

edited Mar 27 '17 at 04:56

rubo77

28,966

answered Nov 25 '11 at 17:52

Ulrich Schwarz

15,989

Thank you. Is there some option to store every link only once? Maybe I should decrease 10 to lower number, but it's hard to guess. Now there is a file introduction.html, introduction.html.1, introduction.html.2 and I rather ended the process. – xralf Nov 25 '11 at 18:14
And the links are directing to the web. Is the --mirror option for the links to direct to the filesystem? – xralf Nov 25 '11 at 18:16
1

@xralf: well, you are using -nd, so different index.htmls are put in the same directory, and without -k, you'll not get rewriting of the links. – Ulrich Schwarz Nov 25 '11 at 18:35

Nikhil Mulley · Answer 2 · 2012-01-21T08:36:29.490

17

$ wget --random-wait -r -p -e robots=off -U Mozilla \
    http://web.archive.org/web/20110726051510/http://feedparser.org/docs/

Downloads recursively the content of the url.

--random-wait - wait between 0.5 to 1.5 seconds between requests.
-r - turn on recursive retrieving.
-e robots=off - ignore robots.txt.
-U Mozilla - set the "User-Agent" header to "Mozilla". Though a better choice is a real User-Agent like "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729)".

Some other useful options are:

--limit-rate=20k - limits download speed to 20kbps.
-o logfile.txt - log the downloads.
-l 0 - remove recursion depth (which is 5 by default).
--wait=1h - be sneaky, download one file every hour.

edited Jan 21 '12 at 08:36

answered Jan 21 '12 at 08:07

Nikhil Mulley

8,315

2

-l 0 - remove recursion depth (which is 5 by default) +1 – Dani May 13 '19 at 11:30
Specifying -U Mozilla skipped "Are you a robot question" – Vikas Kandari Feb 01 '21 at 16:50

Download recursively with wget

2 Answers2

Linked