13

I have a site on a server that is basically a bunch of HTML pages, pictures and sounds.

I have lost my password to that server and I need to grab everything that is stored there. I can go page by page and save everything but the site has more than 100 pages.

I am using OSX. I have tried to use wget but I think the server is blocking that.

Is there any alternative I can use to grab that content?

Duck
  • 4,674
  • If you have physical access to the server , boot into single user mode and recover your password. http://www.debuntu.org/how-to-recover-root-password-under-linux-with-single-user-mode/ – spuder Aug 17 '13 at 18:12

2 Answers2

15

If the server is blocking wget, it is most likely doing it on the basis of the "User-agent:" field of the http header, since that is the only way for it to know in the first place. It could also be blocking your IP, in which case using different software won't help, or some scheme which identifies automation on the basis of how rapid a set of requests are (since real people don't browse 100 pages in 3.2 seconds). I have not heard of anyone doing that, but it is possible.

I also have not heard of a way to slow down wget, but there is a way to spoof the user-agent field:

wget --user-agent=""

Will according to the man page drop "User-agent:" completely, since it is not mandatory. If the server doesn't like that, try --user-agent="Mozilla/5.0" which should be good enough.

Of course, it would help if you explained better why you "think the server is blocking that". Does wget say anything, or just time out?

goldilocks
  • 87,661
  • 30
  • 204
  • 262
6

I usually use httrack for downloading/mirroring web content from a site.

$ httrack http://2011.example.com -K -w -O . -%v --robots=0 -c1 %e0

After it runs you're left with a directory structure that's local and browseable. For example:

$ ls -l
total 304
-rw-r--r--  1 saml saml   4243 Aug 17 10:20 backblue.gif
-rw-r--r--  1 saml saml    828 Aug 17 10:20 fade.gif
drwx------  3 saml saml   4096 Aug 17 10:20 hts-cache
-rw-rw-r--  1 saml saml    233 Aug 17 10:20 hts-in_progress.lock
-rw-rw-r--  1 saml saml   1517 Aug 17 10:20 hts-log.txt
-rw-------  1 saml saml 271920 Aug 17 10:22 hts-nohup.out
-rw-r--r--  1 saml saml   5141 Aug 17 10:20 index.html
drwxr-xr-x 10 saml saml   4096 Aug 17 10:21 2011.example.com

As it downloads you'll see the following type of output:

Bytes saved:    21,89KiB           Links scanned:   12/45 (+4)
Time:   2s                         Files written:   4
Transfer rate:  2,65KiB/s (2,65KiB/s)  Files updated:   1
Active connections:     1          Errors:  7

Current job: parsing HTML file (57%)
 request -  2011.example.com/cgi-bin/hostnames.pl   0B /    8,00KiB

It can be backgrounded and/or aborted and later resumed. This is just the tip of the iceberg in terms of it's features. There is also a GUI for both setting up a download and monitoring it as it progresses.

There is extensive documentation on the httrack website and by googling.

slm
  • 369,824