78

I need to download a file using wget, however I don't know exactly what the file name will be.

https://foo/bar.1234.tar.gz

According to the man page, wget lets you turn off and on globbing when dealing with a ftp site, however I have a http url.

How can I use a wildcard while using a wget? I'm using gnu wget.

Things I've tried.

/usr/local/bin/wget -r "https://foo/bar.*.tar.gz" -P /tmp

Update

Using the -A causes all files ending in .tar.gz on the server to be downloaded.

/usr/local/bin/wget -r "https://foo/" -P /tmp -A "bar.*.tar.gz"

Update

From the answers, this is the syntax which eventually worked.

/usr/local/bin/wget -r -l1 -np "https://foo" -P /tmp -A "bar*.tar.gz"
spuder
  • 18,053
  • 6
    This is not exactly what you're looking for but it's related: Curl has the ability to use basic wildcards, eg: curl "http://example.com/picture[1-10].jpg" -o "picture#1.jpg" – Hello World Dec 13 '14 at 09:01
  • 2
    One gotcha for me was the -e robots=off parameter to not obey robots.txt: http://stackoverflow.com/a/11124664/1097104 – Juuso Ohtonen Oct 26 '16 at 11:57
  • I found adding the flags -nH and --cut-dirs=<number> was also useful – Randall Mar 12 '18 at 18:12

4 Answers4

94

I think these switches will do what you want with wget:

   -A acclist --accept acclist
   -R rejlist --reject rejlist
       Specify comma-separated lists of file name suffixes or patterns to 
       accept or reject. Note that if any of the wildcard characters, *, ?,
       [ or ], appear in an element of acclist or rejlist, it will be 
       treated as a pattern, rather than a suffix.

   --accept-regex urlregex
   --reject-regex urlregex
       Specify a regular expression to accept or reject the complete URL.

Example

$ wget -r --no-parent -A 'bar.*.tar.gz' http://url/dir/
slm
  • 369,824
  • Thanks for your answer. What happens if you have page with this: <td><a href="10yard.zip">10yard.zip</a> (<a href="10yard.zip/">View Contents</a>)</td>. How can you download only the 10yard.zip? I'm using wget -r -np -nc -l 1 -A zip https://archive.org/download/mame-merged/mame-merged/ and it downloads the zip file, delete the file downloaded, and create a folder named 10yard.zip. Thanks. – VansFannel Jan 30 '22 at 09:00
  • @VansFannel - If you have a new Q it's best to ask on the main site. It's OK if the Q extends/builds off of existing ones. – slm Jan 30 '22 at 19:58
  • Like the one I asked yesterday: https://superuser.com/questions/1702157/download-10yard-zip-with-wget-not-10yard-zip – VansFannel Jan 31 '22 at 18:40
18

There's a good reason that this can't work directly with HTTP, and that's that an URL is not a file path, although the use of / as a delimiter can make it look like one, and they do sometimes correspond.1

Conventionally (or, historically), web servers often do mirror directory hierarchies (for some -- e.g., Apache -- this is sort of integral) and even provide directory indexes much like a filesystem. However, nothing about the HTTP protocol requires this.

This is significant, because if you want to apply a glob on say, everything which is a subpath of http://foo/bar/, unless the server provides some mechanism to provide you with such (e.g. the aforementioned index), there's nothing to apply it the glob to. There is no filesystem there to search. For example, just because you know there are pages http://foo/bar/one.html and http://foo/bar/two.html does not mean you can get a list of files and subdirectories via http://foo/bar/. It would be completely within protocol for the server to return 404 for that. Or it could return a list of files. Or it could send you a nice jpg picture. Etc.

So there is no standard here that wget can exploit. AFAICT, wget works to mirror a path hierarchy by actively examining links in each page. In other words, if you recursively mirror http://foo/bar/index.html it downloads index.html and then extracts links that are a subpath of that.2 The -A switch is simply a filter that is applied in this process.

In short, if you know these files are indexed somewhere, you can start with that using -A. If not, then you are out of luck.


1. Of course an FTP URL is an URL too. However, while I don't know much about the FTP protocol, I'd guess based on it's nature that it may be of a form which allows for transparent globbing.

2. This means that there could be a valid URL http://foo/bar/alt/whatever/stuff/ that will not be included because it is not in any way linked to anything in the set of things linked to http://foo/bar/index.html. Unlike filesystems, web servers are not obliged to make the layout of their content transparent, nor do they need to do it in an intuitively obvious way.

goldilocks
  • 87,661
  • 30
  • 204
  • 262
6

Use option -nd to save all files to the current directory, without hierarchy of directories, example:

wget -r -nd --no-parent -A 'bar.*.tar.gz' http://url/dir/
jasper
  • 91
1

The above '-A pattern' solution may not work with some web pages. This is my work-around, with a double wget:

  1. wget the page
  2. grep for pattern
  3. wget the file(s)

Example: suppose it's a news podcast page, and I want 5 mp3 files from top of the page:

wget -nv -O- https://example/page/ |
 grep -o '[^"[:space:]]*://[^"[:space:]]*pattern[^"[:space:]]*\.mp3' |
  head -n5 | while read x; do
    sleep $(($RANDOM % 5 + 5))  ## to appear gentle and polite
    wget -nv "$x"
  done

The grep is looking for double-quoted no-space links that contain :// and my file name pattern.

nightshift
  • 11
  • 2