5

I am trying to recursively download http://ccachicago.org, and am getting exactly one file, the root index.html, downloaded.

I've looked at Download recursively with wget and started using the recommended -e robots=off, but it still behaves the same.

How, with wget or some other tool, can I download a copy of the site?

2 Answers2

8

you are asking wget to do a recursive download of http://ccachicago.org, but this URL doesn't provide any direct content. instead it is just a re-direct to http://www.ccachicago.org (which you haven't told wget to fetch recursively)..

if you tell wget to download the correct URL it will work:

wget -r -e robots=off http://www....
umläute
  • 6,472
7

It's because wget defaults to only doing recursive download within the hostname that you used when you started.

http://ccachicago.org issues a redirect to http://www.ccachicago.org. Since all further links are under www.ccachicago.org, wget will consider those links as being off-site and won't follow them.

The easiest solution here is of course to start with wget -r http://www.ccachicago.org.

You could also add www.ccachicago.org to the list of domains to follow:

wget -r -D www.ccachicago.org http://ccachicago.org

For the future, you can find this kind of information by adding the debug flag. When I did that, I got

Deciding whether to enqueue "http://www.ccachicago.org/".
This is not the same hostname as the parent's (www.ccachicago.org and ccachicago.org).
Decided NOT to load it.
Redirection "http://www.ccachicago.org/" failed the test.
Jenny D
  • 13,172