Is it possible to search through a .tar.gz file breadth-first?

Question

I want to download part of a large (199GB) .tar.gz file from here. To start, I used the following command to list all of the files in the .tar.gz file:

wget -qO- https://www.cs.cornell.edu/projects/megadepth/dataset/Megadepth_v1/MegaDepth_v1.tar.gz | tar -tz

Next, I tried to download the contents of a folder in the .tar.gz using the command:

wget -qO- https://www.cs.cornell.edu/projects/megadepth/dataset/Megadepth_v1/MegaDepth_v1.tar.gz | tar -xz phoenix/S6/zl548/MegaDepth_v1/0000

However, this takes too long because the tar command searches depth-first and recursively through each of the folders below phoenix/S6/zl548/MegaDepth_v1. I am only interested in the contents of the folder phoenix/S6/zl548/MegaDepth_v1/0000. Is there a way to download the contents of this folder without searching through the sub-folders of the other folders, such as

phoenix/S6/zl548/MegaDepth_v1/0162
phoenix/S6/zl548/MegaDepth_v1/0001
phoenix/S6/zl548/MegaDepth_v1/0132

In other words, is there a faster way to download the contents of the folder phoenix/S6/zl548/MegaDepth_v1/0000?

Some references for the above commands:

How to extract specific file(s) from tar.gz

How to download an archive and extract it without saving the archive to disk?

https://stackoverflow.com/q/2700306/13809128

score 19 · Accepted Answer · answered May 31 '22 at 20:04

tar writes a file header, then the file contents, then the next file header, the next file contents and so on.

There is no order associated with the entries and the only optimization you could come up with is skipping the contents of a file, to get to the next header, directly seeking it. For that you need to have a seekable file.

But your .gz is compressed, so you have no reliable way to skip ahead to the next entry, meaning that you will have to read (download) the whole file to get the contents. That's the answer: no, you cannot avoid reading/downloading the whole file.

So, since you will have to fully download it anyway, you might as well do it once and then solve everything in the local filesystem.

+1 , This is essentially what I said in my Answer. I also gave a "Alternate Suggestion" for user @mhdadk to contact the "Data Owners" to Provide "Individual Directory Access" via Multiple Small Downloads. — Prem, Jun 01 '22 at 04:01

score 6 · Answer 2 · answered May 31 '22 at 18:19

because the tar command searches depth-first and recursively...

Well, it doesn't, actually. It doesn't search at all, and instead just reads through the archive, looking at every file it encounters to see if they match what it wants. (You do get the depth-first behaviour, since that's the natural order for walking through a directory tree, and hence the order the files were added to the archive.)

This is because tar archives don't have indexes, they're not searchable. The name "tar" stands for "tape archive", and the usual mode for using tapes is to just read or write a single stream, without seeking. The format was made for that context, and might not be the best there is for your use-case.

I can't find a good citation on that, but it's mentioned in some answers on the site, and on Wikipedia.

Depth-first the natural order? Not in the age when tar was developed. File storage was heavily seek-limited; we didn't have SSD's yet. So when you read a directory, you finished reading the whole directory before moving to the next directory. When you encounter a subdirectory, you add it either to a queue (FIFO) or stack (LIFO). This choice determines if the search is breadth-first or depth-first, but there is no natural choice. — MSalters, Jun 02 '22 at 16:17

score 6 · Answer 3 · edited Jun 01 '22 at 13:19

6

Every time you were executing wget, you were attempting to download the whole tar file! You might have downloaded the "initial content" multiple times already, and thrown it out by sending the output to stdout!

Instead, the "faster" way would be to download it once to ./MegaDepth_v1.tar.gz in your current directory and untar it there.

wget -q -O MegaDepth_v1.tar.gz  https://www.cs.cornell.edu/projects/megadepth/dataset/Megadepth_v1/MegaDepth_v1.tar.gz  
tar -xz -f MegaDepth_v1.tar.gz phoenix/S6/zl548/MegaDepth_v1/0000

Once you have the necessary files, you can delete the downloaded tar file.

UPDATE: The original file seems to be about 200 GB in Size. Downloading itself will take up huge time and space. Extracting will then take up additional time. No win, in this case!
You might have to contact the MegaDepth team and ask them to provide individual directory access otherwise, it will always be slow.

Here, wget cannot skip over unwanted content and will always download the whole tar file from beginning to end. Additionally, (as mentioned in the answer by user ilkkach ) tar can not skip over (or seek through) the stdout stream.

edited Jun 01 '22 at 13:19

Michael Mior

240

answered May 31 '22 at 18:19

Prem

3,342

1

Technically, wget can skip to a specifiek offset in the file (if the server supports byte ranges), but because the gzip layer around the file, you need every byte for decoding – Ferrybig Jun 01 '22 at 15:41
I agree with your comment, in general; moreover wget does not know about the content ( tar or gz or whatever ) and hence cannot skip over "selectively". Theoretically, we could make a tool w-archive-get-file which downloads the archive headers, picks out the offset for the necessary file and continues the download from that Position. Currently, wget is agnostic !! @Ferrybig – Prem Jun 01 '22 at 15:57
If not the gzip layer and if the server supported range request, then I guess a FUSE mount (httpfs2?) instead of wget would allow tar (reading a "local" archive then) to skip. – Kamil Maciorowski Jun 02 '22 at 07:14
That looks like a nice idea, but tar seems to have no Directory listing in the Header and hence we can not skip to the necessary file in one shot. We have to check starting file, skip to the end of that file, check next file, skip to the end of that file, until we reach the necessary file. Even with the ability to skip or seek, it is going to be slow over the 200GB tar file, when the necessary file is at 150GB location ! OPTIMAL Possibility, I think, is the MegaDepth team giving Individual Directory Access via Multiple Downloads ! @KamilMaciorowski – Prem Jun 02 '22 at 08:03

Kamil Maciorowski · Answer 4 · 2022-06-02T11:04:22.143

Analysis

I agree with other answers saying there is no way for tar to seek the compressed archive. To find the file(s) you're after, the tool needs to process the archive from the beginning and not to skip anything.

However with GNU tar you don't necessarily need to process it to the end. Consider this scenario when creating an archive:

Supposing you change the file blues and then append the changed version to collection.tar. […], the original blues is in the archive collection.tar. If you change the file and append the new version of the file to the archive, there will be two copies in the archive. When you extract the archive, the older version of the file will be extracted first, and then replaced by the newer version when it is extracted.

^(source)

This means, when extracting a specific file, tar keeps processing the archive even after it extracts the file, because maybe another copy is later in the archive.

But then:

If you wish to extract the first occurrence of the file blues from the archive, use --occurrence option

^(ibid.)

If you are sure the file you're after occurs exactly one time in the archive, use tar --occurrence and tar will stop after extracting the file. Then your wget will abort due to SIGPIPE, it won't download the rest of the archive in vain.

Limited usefulness

Note this is not really useful in your exact case because phoenix/S6/zl548/MegaDepth_v1/0000 is a directory (right?). While extracting the directory with --occurrence, tar won't stop early, unless it encounters another entry for the directory itself. The reason is: there can always be a unique phoenix/S6/zl548/MegaDepth_v1/0000/foo at the very end of the archive. Before tar gets to the end, it cannot be sure the directory with all its content is complete.

Still if you were after one or few non-directories, if you knew the path(s) and if you knew there is exactly one instance of each in the archive, then --occurrence would allow you to download as little of the archive as necessary. If you were lucky and the file(s) happened to be near the beginning of the archive, then --occurrence would make a significant difference.

Probably this answer won't help you much. It's for users who can provide a list of non-directories.

Unless…

If you saved the output of wget -qO- … | tar -tz (when you most likely downloaded and processed the whole archive, and threw it away), you would now be able to provide a list of non-directories (possibly using --files-from= or --verbatim-files-from; especially useful if the list is too long for a single command line). In this case --occurrence may work for you. Additionally the saved output of tar -t would allow you to confirm that each non-directory you're after occurs exactly once in the archive, so you would know --occurrence won't make you miss an updated version.

The above assumes MegaDepth_v1.tar.gz on the server does not change. In general (if the archive may have changed) your saved output of tar -t may be no longer valid.

Let's assume you can create a list of non-directories to extract. The list must not specify any directory explicitly, or else --occurrence won't help you. Still tar will create necessary directories, but only for the purpose of placing non-directories in them, not because it will really extract the directories from the archive. In other words: archive members for the directories themselves won't matter. This means directories will be created, but options like --preserve-permissions won't apply to them.

Proof of concept

I used your first command (the one with tar -t) and found out that phoenix/S6/zl548/MegaDepth_v1/0162/dense0/depths/16384199365_2b34b42cf4_b.h5 is a non-directory near the beginning of the archive. This pipeline:

wget -qO- https://www.cs.cornell.edu/projects/megadepth/dataset/Megadepth_v1/MegaDepth_v1.tar.gz \
| tar -xvz phoenix/S6/zl548/MegaDepth_v1/0162/dense0/depths/16384199365_2b34b42cf4_b.h5

extracts the file and continues (I can Ctrl+c); but this one:

wget -qO- https://www.cs.cornell.edu/projects/megadepth/dataset/Megadepth_v1/MegaDepth_v1.tar.gz \
| tar --occurrence -xvz phoenix/S6/zl548/MegaDepth_v1/0162/dense0/depths/16384199365_2b34b42cf4_b.h5

extracts the file and terminates automatically.

(+1) Thank you for the detailed explanation! I never knew about the —occurrence option. — mhdadk, Jun 02 '22 at 10:44

Is it possible to search through a .tar.gz file breadth-first?

4 Answers4

Analysis

Limited usefulness

Unless…

Proof of concept