24

I stumbled upon this website which talk about this.

So when downloading whole website by getting the gzipped version, what is the right command?

I've tested this command, but I don't know if wget really getting the gzipped version:

wget --header="accept-encoding: gzip" -m -Dlinux.about.com -r -q -R gif,png,jpg,jpeg,GIF,PNG,JPG,JPEG,js,rss,xml,feed,.tar.gz,.zip,rar,.rar,.php,.txt -t 1 http://linux.about.com/
Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
jomnana
  • 243
  • You say you've tested that command, but @EightBitTony's answer below seems to say that what you would get out of that would be a gzip file of the first hit without any recursion through site the site for more files. Was that the result you got? – Caleb Jun 17 '11 at 14:59
  • linux.about.com is gzip-compressed, and this command recurse the entire site. I've tested this command on other website and it recurse the entire site too. Thats why I a bit confused whether it really download the gzipped version or not – jomnana Jun 17 '11 at 15:04

3 Answers3

27

If you request gzip'ed content (using the accept-encoding: gzip header, which is correct), then it's my understanding that wget can't then read the content. So you will end up with a single, gzipped file on disk, for the first page you hit, but no other content.

i.e. you can't use wget to request gzipped content and to recurse the entire site at the same time.

I think there's a patch that allows wget to support this function but it's not in the default distribution version.

If you include the -S flag you can tell if the web server is responding with the correct type of content. For example,

wget -S --header="accept-encoding: gzip" wordpress.com
--2011-06-17 16:06:46--  http://wordpress.com/
Resolving wordpress.com (wordpress.com)... 72.233.104.124, 74.200.247.60, 76.74.254.126
Connecting to wordpress.com (wordpress.com)|72.233.104.124|:80... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Server: nginx
  Date: Fri, 17 Jun 2011 15:06:47 GMT
  Content-Type: text/html; charset=UTF-8
  Connection: close
  Vary: Accept-Encoding
  Last-Modified: Fri, 17 Jun 2011 15:04:57 +0000
  Cache-Control: max-age=190, must-revalidate
  Vary: Cookie
  X-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
  X-Pingback: http://wordpress.com/xmlrpc.php
  Link: <http://wp.me/1>; rel=shortlink
  X-nananana: Batcache
  Content-Encoding: gzip
Length: unspecified [text/html]

The content encoding clearly states gzip, however for linux.about.com (currently),

wget -S --header="accept-encoding: gzip" linux.about.com
--2011-06-17 16:12:55--  http://linux.about.com/
Resolving linux.about.com (linux.about.com)... 207.241.148.80
Connecting to linux.about.com (linux.about.com)|207.241.148.80|:80... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Date: Fri, 17 Jun 2011 15:12:56 GMT
  Server: Apache
  Set-Cookie: TMog=B6HFCs2H20kA1I4N; domain=.about.com; path=/; expires=Sat, 22-Sep-12 14:19:35 GMT
  Set-Cookie: Mint=B6HFCs2H20kA1I4N; domain=.about.com; path=/
  Set-Cookie: zBT=1; domain=.about.com; path=/
  Vary: *
  PRAGMA: no-cache
  P3P: CP="IDC DSP COR DEVa TAIa OUR BUS UNI"
  Cache-Control: max-age=-3600
  Expires: Fri, 17 Jun 2011 14:12:56 GMT
  Connection: close
  Content-Type: text/html
Length: unspecified [text/html]

It's returning text/html.

Because some older browsers still have issues with gzip encoded content, many sites only enable it based on browser identification. They often turn it off by default and only turn it one when they know the browser can support it - and they usually don't include wget in that list. This means you may find wget never returns gzip content even if the site appears to do so for your browser.

EightBitTony
  • 21,373
  • But I got bunch of files, and not a single gzipped file... or does my wget version is different? (using Ubuntu 11.04) – jomnana Jun 17 '11 at 15:07
  • If you use -S you can see the headers returned from the server, and when you do that against linux.about.com you can clearly see it's returning html, not gzip content.

    wget -S --header="accept-encoding: gzip" linux.about.com

    Content-Type: text/html

    – EightBitTony Jun 17 '11 at 15:09
  • Because not all browsers support gzip encoding (IE has major issues), many websites only enable gzip encoding on a per browser basis and don't bother doing it for wget. That probably explains why linux.about.com doesn't gzip when asked by wget. But it doesn't fix the main issue that (AFAIK) wget can't recurse gzipped content. – EightBitTony Jun 17 '11 at 15:15
  • 1
    Just tried this: the wget output is still Content-Type: text/html; charset=UTF-8, but there's also Content-Encoding: gzip. It wouldn't be transparent compression if using it forced the MIME type of everything to gzip... I ran strace -s 128 wget ... to actually see some of the bytes read from the socket / written to disk. They are non-ASCII. So while I think in 2011 your command didn't receive a gzipped version, in 2015 the same command did. (wget 1.15). – Peter Cordes Mar 25 '15 at 10:19
  • I like to do "-O -" to get the page going to stdout and then pipe it into gunzip to make sure that it is garbled and small when not piped through gzip and big and html when piped through gzip... – nroose Nov 11 '15 at 17:26
1

simple command to get the html page and compressed it or get any file and compressed.

$ wget -qO - <url> | gzip -c > file_name.gz

for more information about the option . use man command .

-1

according to mikeserv et.al. in above responses for bash (at about version 4.3) its developers have once adopted an IEEE spec on how to maintain LINENO so that this value is always set to 1 for the time the argument of the EXIT signal is evaluated. (in fact its the current line and its the first line in those execution context.)

there are various methods for a work around already listed. as i feel its quite simple one compared to anything else - lets provide my proof of concept for the problem here:

#!/bin/bash
trap 'catch EXIT $? $debug_line_old' EXIT
trap 'debug_line_old=$debug_line;debug_line=$LINENO' DEBUG # note: debug is invoked before line gets executed!
catch() {
  echo "event=$1, rc=$2, line=$3, file=$0"
}
exit 1

when run as a result you should see this:

event=EXIT, rc=1, line=7, file=./trap_exit_get_lineno.bash

by the way - its beneficial when NOT mangling any type of trap event into just a single line - keep separate lines for separate signals helps a lot in coding it. and second it seems that a DEBUG trap is quite restricted in terms of calling anything else.