3

I was always wondering how to download the files through the Linux shell (I have wget, curl) that do not have a full URL of the file to be downloaded, but the full URL is passed e.g. to the browser only when specific URL is visited. However, when I try downloading it through Linux shell (with either wget or curl), all I get is an HTML file.

For example, I'm looking to download several MB large file from here:

http://www.ebi.ac.uk/ena/data/view/U00096.3&display=fasta&download=fasta&filename=entry.fasta

so when I paste this into the browser, I get the Save As dialog, offering me to save 'entry.fasta' file and not another HTML file. I tried curl -O -L -J as suggested in this question, but it didn't work either.

IGS
  • 33
  • 1
  • 3
  • What do you mean by without full URL? The link you provided works fine with wget "URL" and curl -O "URL". Complete genome saved to disk. There is no redirect involved. (Did you remember to quote the URL?) – Runium Nov 07 '15 at 21:44
  • Oh damn... I never used the quotes, but yeah you're right. Thanks!

    (I meant without a specific URL to entry.fasta.)

    – IGS Nov 07 '15 at 21:50
  • @Sukminder may as well post that as an answer. – terdon Nov 07 '15 at 22:22

3 Answers3

3

Your provided URL downloads fine using for example:

wget "URL"
curl -O "URL"

As mentioned in comments: quote. Always quote!

Letters like & has a special meaning in shells and the URL is not going to be interpreted as you want without them.


As for downloading without knowing the file name – I'm still not quite sure what you mean, but some notes:

This is site specific for ebi.ac.uk

The URL provided is a special form of URI. You are most likely interested in the query part, and more specifically the first section: U00096.3.

You can change this to represent other files and ranges. for example to download U00000 to U00096 say:

curl -O "http://www.ebi.ac.uk/ena/data/view/U00000-U00096&display=fasta&download=fasta&filename=U00000-U00096.fasta"
                                            ^^^^ data ^^^

The file name part is simply a suggestion on what to name the file. You can change this to anything you want. For example: filename=myown.fasta – is not going to change what is downloaded only what name is proposed by the server -> web-browser, and can also be used by curl etc.


There are a lot of search and listing possibilities on the site and you have to poke around.

More on what is happening

As you click the download link, or use tools like curl or wget a request is sent to the server at ebi.ac.uk for a specific file. In your example it likely has a referer set to:

http://www.ebi.ac.uk/ena/data/view/U00096.3

and a GET query reported as:

query['display'] = fasta
query['download'] = fasta
query['filename'] = entry.fasta

The sever responds with something, among others, like:

Content-Disposition: attachment; filename=entry.fasta

This is a way for the server to rely a suggestion for file name back to the client. If you use a curl version that has the -J option you can use this to save the file by this name: I.e.:

curl -OJ "URL"

As mentioned

This is quite site specific and the way the URL is interpreted on the server has to do with how the site is set up.

On a different host using another setup with a query part as filename=foo.txt could just as well be that you are served the an actual file named foo.txt from the server.

As for this site, ebi.ac.uk, the file is not a file but dynamically generated content using queries to databases. The result of the query is merged into a file and served to the end user.

Runium
  • 28,811
  • Oh this an awesome answer, thank you so much!

    Now I understand much better what's happening.

    – IGS Nov 08 '15 at 00:07
2

Without quotes, the shell sees the & and interprets that to mean "run everything in the line up to the & in the background, and then continue interpreting/running the rest of the line". With quotes, the & is just part of the URL string.

There are three &s in your URL, so without quotes it would run four commands, the first three as background jobs:

wget http://www.ebi.ac.uk/ena/data/view/U00096.3 &
display=fasta &
download=fasta &
filename=entry.fasta

The fix is to quote the URL:

wget 'http://www.ebi.ac.uk/ena/data/view/U00096.3&display=fasta&download=fasta&filename=entry.fasta'

Single-quotes are fine here, but if you wanted to embed the value(s) of any variable(s) in the URL, you'd need to use double-quotes.

cas
  • 78,579
  • Thanks very much for clarification! It totally makes sense when you write it line by line. I'm learning bash bit by bit too along the way too. – IGS Nov 08 '15 at 00:11
1

Perhaps you could carefully use the recursive download facility of wget. So if you

wget -r http://gcc-melt.org/

you'll download "every" reachable file from gcc-melt.org site

(but read the documentation of wget before trying)

PS. I am the owner and author of the http://gcc-melt.org/ site so please don't overload it.