12

I am trying to download a file through HTTP from a web site using wget.

When I use:

wget http://abc/geo/download/?acc=GSE48191&format=file

I get only a file called index.html?acc=GSE48191.

When I use:

wget http://abc/geo/download/?acc=GSE48191&format=file -o asd.rpm

I get asd.rpm, but I want to download with actual name, and don't want to have manually change the name of the downloaded file.

Kusalananda
  • 333,661
  • You might want to ask this sort of question on [bioinformatics.se] next time. It's on topic here as well, and welcome to stay, but you might get more help from people who work in the field. – terdon Sep 26 '17 at 12:49
  • 3
    @terdon How is asking about wget and *nix shell behavior on topic on [bioinformatics.se]? – user Sep 26 '17 at 13:35
  • 1
    @MichaelKjörling extracting information from NCBI would be, that's why I suggested it. An answer there would likely involve a simpler, more direct approach to get at the information the OP is looking for rather than a shell solution. Something like "you can get this information more easily from here" for instance. – terdon Sep 26 '17 at 13:40
  • Look at the --trust-server-names argument to wget - – ivanivan Sep 26 '17 at 17:11
  • 3
    It's important to note that there is no such thing as "the actual name" of a resource referenced by a URL. A web server responds to a request with some content, and possibly some headers which describe that content in some way, but there doesn't have to be a file involved at all. – IMSoP Sep 26 '17 at 21:08

2 Answers2

34
wget --content-disposition 'https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE48191&format=file'

The file you are downloading is a tar archive (a binary file), provided by a dynamic link from a web server. wget would normally save the file using part of the URL that you're using, but in this case that's just a REST API endpoint (or something similar) so the name would be unfriendly to work with (it would still be a valid name and the file contents would be the same).

However, in this case the server provides a "Content Disposition" header containing the actual file name, which wget is able to use if you use the --content-disposition option. This option is marked "experimental" in my manual for wget.

You also need to quote the URL so that the shell does not interpret the & and ? characters in it.


The equivalent thing using curl:

curl -J -O 'https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE48191&format=file'

Or, using the equivalent long options:

 curl --remote-header-name --remote-name 'https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE48191&format=file'

Once you have downloaded the file, you need to unpack it:

tar -xvf GSE48191_RAW.tar

Due to the way that this particular archive was created, this will unpack the archive's files into the current directory (so creating a new directory, moving the archive there and unpacking it there may be a good idea). The files in this archive are gzip-compressed CEL files.

Kusalananda
  • 333,661
8

The shell does the usual interpretation of characters, especially ? as wildcard (which doesn't matter here) and & as "put into background". You should have noticed the latter, because the shell response is different from a direct command.

So you need to quote:

wget 'http://abc/geo/download/?acc=GSE48191&format=file'
dirkt
  • 32,309