2

I am using this command to get a page and all its images and javascripts on iTunes with wget. All I want is this page and all its images and scripts.

 wget -kKErpNF --no-check-certificate --html-extension  -nd -A jpg,jpeg,png,js  -nH https://itunes.apple.com/us/app/megamilhoes-megasena-gerador/id854897303?mt=12

The command is working almost well but it is not saving the page itself because the page is dynamic and built in the browser. There is no html/html extension on the page. How do I get that?

--html-extension is causing no effect. I am on OSX Mavericks.

Duck
  • 4,674

1 Answers1

3

Apple by default rejects the html file download. I used the commands that you had specified in my machine. If you carefully look at the output, you will get something like this.

Loading robots.txt; please ignore errors.
--2014-05-24 10:43:50--  https://itunes.apple.com/robots.txt
Connecting to itunes.apple.com|23.206.210.217|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 234 [text/plain]
Saving to: `robots.txt'

So as per this answer we can ignore the robots.txt file by using -e robots=off in the command.

Wget by default honours the robots.txt standard for crawling pages, just like search engines do, and for archive.org, it disallows the entire /web/ subdirectory. To override, use -e robots= off,

So, I modified your command to add -e robots= off and when I ran the command again, I got the below output.

Connecting to itunes.apple.com|23.204.162.217|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `id854897303?mt=12.html'

    [ <=>                                                                                                                                                  ] 33,456      --.-K/s   in 0.001s  

2014-05-24 10:48:38 (30.1 MB/s) - `id854897303?mt=12.html' saved [33456]

Removing id854897303?mt=12.html since it should be rejected.

As you can see, the file download is prevented by apple and we cannot do anything about it.

EDIT: Even without -e robots=off, we are not able to download the html file. It is saying rejected with your original wget too. So, I suspect apple is disallowing wget downloads.

Ramesh
  • 39,297
  • thru finder you can see that wget generates a HTML file for a brief period of time then deletes that. – Duck May 24 '14 at 15:52
  • Yeah, I see it. I am afraid that you will not be able to download the contents. – Ramesh May 24 '14 at 15:55
  • 2
    +1 If Apple is disallowing wget specifically (there are sites which do), you might get different results by experimenting with the --user-agent switch, to trick the server into believing this is a normal web browser. – goldilocks May 24 '14 at 16:30