8

so I am trying to wget a specific webpage using this command in bash scripting:

wget --no-cookies --header="Accept: text/html" --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0" -O $2/content.html $1

And the result is that I get the bot page from the website because wget is reusing the existing connection (I think). This command was working before I spam tested and now my server is getting a bot test redirect from the site (can't use this).

--2017-12-12 19:16:42--  https://www.kayak.co.uk/h/bots/human-redirect.vtl?url=%2Fflights%2FDUB-LAX%2F2018-06-04%2F2018-06-25%2F2adults%3Fsort%3Dbestflight_a
Reusing existing connection to [www.kayak.co.uk]:443.
HTTP request sent, awaiting response... 200 OK

My question is: is there anyway to stop wget from using the existing connection and reconnect the site to download each time?

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
  • wget is not (can not) reusing an old connection. Your previous attempts probably triggered a system on the website that now thinks you are a bot and redirects you to a page probably letting you know about that and maybe giving you a captcha to solve the situation. Try to open the page in your browser to read it. It is probably your IP that has now been considered a bot, so no new wget commands could change the result (you may try changing the headers but this could also make the situation even worse and ban you for real) – Patrick Mevzek Dec 12 '17 at 19:31
  • I doubt wget will save the connections over multiple invocations, it would require some rather interesting acrobatics to save and pass the file descriptor – ilkkachu Dec 12 '17 at 19:32

1 Answers1

5

I know this is an old issue, but perhaps this will help others who come across it as I have.

To disable the "keep-alive" feature, use the --no-http-keep-alive argument.

From the man page:

Turn off the "keep-alive" feature for HTTP downloads. Normally, Wget asks the server to keep the connection open so that, when you download more than one document from the same server, they get transferred over the same TCP connection. This saves time and at the same time reduces the load on the server.

Using this argument is typically needed in cases where a new, clean request is necessary. Although not strictly related, the --no-cache and --no-cookies arguments might also be relevant in cases where the --no-http-keep-alive argument is used.

So the OP's command would probably be:

wget --no-http-keep-alive --no-cache --no-cookies --header="Accept: text/html" --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0" -O $2/content.html $1
shovavnik
  • 151
  • 1
  • 4