Why is "wget -r -e robots=off http://ccachicago.org" not acting recursively?

Question

I am trying to recursively download http://ccachicago.org, and am getting exactly one file, the root index.html, downloaded.

I've looked at Download recursively with wget and started using the recommended -e robots=off, but it still behaves the same.

How, with wget or some other tool, can I download a copy of the site?

Possible duplicate of this question? – Geeb Jan 27 '14 at 16:12 — Geeb, Jan 27 '14 at 16:12

score 8 · Accepted Answer · answered Jan 27 '14 at 16:16

you are asking wget to do a recursive download of http://ccachicago.org, but this URL doesn't provide any direct content. instead it is just a re-direct to http://www.ccachicago.org (which you haven't told wget to fetch recursively)..

if you tell wget to download the correct URL it will work:

wget -r -e robots=off http://www....

score 7 · Answer 2 · answered Jan 27 '14 at 16:15

It's because wget defaults to only doing recursive download within the hostname that you used when you started.

http://ccachicago.org issues a redirect to http://www.ccachicago.org. Since all further links are under www.ccachicago.org, wget will consider those links as being off-site and won't follow them.

The easiest solution here is of course to start with wget -r http://www.ccachicago.org.

You could also add www.ccachicago.org to the list of domains to follow:

wget -r -D www.ccachicago.org http://ccachicago.org

For the future, you can find this kind of information by adding the debug flag. When I did that, I got

Deciding whether to enqueue "http://www.ccachicago.org/".
This is not the same hostname as the parent's (www.ccachicago.org and ccachicago.org).
Decided NOT to load it.
Redirection "http://www.ccachicago.org/" failed the test.

Why is "wget -r -e robots=off http://ccachicago.org" not acting recursively?

2 Answers2