I'm unable to get wget to succesfully be recursive while mirroring a blog

Question

I want to make an offline mirror of this blog: http://blogs.gamefilia.com/lord-areg

I'm using:

wget --recursive --level=0 --span-hosts --no-clobber --page-requisites --html-extension --convert-links --no-parent -e robots=off --wait=4 --random-wait --adjust-extension --no-check-certificate --user-agent=Mozilla http://blogs.gamefilia.com/lord-areg/‍

But I do only get index.html, can't get any of the subdirectories... And I need all of them

For example:

lord-areg/15-01-2012/47781/boveda-de-articulos-de-silent-hill

lord-areg/01-02-2012/48151/eddie-dombrowski-la-pistola-y-la-pizza-misteriosa

etc.

You could start by removing some options; here is a command that worked for someone else. Then once you have it working, add options to adjust it to your preferences. — Wildcard, May 09 '16 at 21:08
Hello @Wildcard It didn't work (same results, just index.html was downloaded), but thanks for your answer. Anyways, I don't think start removing options could help, since I've just tried wget --recursive -e robots=off --no-parent http://blogs.gamefilia.com/lord-areg/‍ and it isn't working either. — 830b et, May 10 '16 at 01:46
Try again without the trailing /. The initial redirection makes a difference (for me) but I haven't yet worked out why. — JigglyNaga, May 10 '16 at 20:14
Unfortunately, If I try without the trailing ‍/‍ wget tries to mirror the entire gamefilia site, and it's huge... I'm sure there is a way to download individual blogs, but I just can't figure it out. Thank you very much anyways. — 830b et, May 11 '16 at 01:07
Use -I lord-areg as well as omitting the trailing /. See the Note under --no-parent in the wget manual. — JigglyNaga, May 11 '16 at 07:17

score 0 · Accepted Answer · answered May 12 '16 at 07:36

Running with -d shows what's going on:

Location: http://blogs.gamefilia.com/lord-areg [following]
    ....
Deciding whether to enqueue "http://blogs.gamefilia.com/lord-areg".
Going to "" would escape "lord-areg" with no_parent on.
Decided NOT to load it.
Redirection "http://blogs.gamefilia.com/lord-areg" failed the test.

The redirected page was outside the area specified, so although it was retrieved, its contents aren't followed when recursing.

Removing the final / means there's no redirection, but as you found, also means wget doesn't treat lord-areg as a directory, and uses the previous /, so the whole site matches:

Note that, for HTTP (and HTTPS), the trailing slash is very important to ‘--no-parent’. HTTP has no concept of a “directory”—Wget relies on you to indicate what’s a directory and what isn’t. In ‘http://foo/bar/’, Wget will consider ‘bar’ to be a directory, while in ‘http://foo/bar’ (no trailing slash), ‘bar’ will be considered a filename (so ‘--no-parent’ would be meaningless, as its parent is ‘/’).

(4.3 Directory Based Limits)

So you need to restrict the results in some other manner. -I lord-areg nearly works, but will skip pages of the form /lord-areg?page=1. To match those too, describe the required URLs in more detail:

--accept-regex '^http:\/\/blogs\.gamefilia\.com\/lord-areg[?/]'

I'm unable to get wget to succesfully be recursive while mirroring a blog

1 Answers1