Find recursively all archive files of diverse archive formats and search them for file name patterns

Question

At best I would like to have a call like this:

$searchtool /path/to/search/ -contained-file-name "*vacation*jpg"

... so that this tool

does a recursive scan of the given path
takes all files with supported archive formats which should at least be the "most common" like zip, rar, 7z, tar.bz, tar.gz ...
and scan the file list of the archive for the name pattern in question (here *vacation*jpg)

I'm aware of how to use the find tool, tar, unzip and alike. I could combine these with a shell script but I'm looking for a simple solution that might be a shell one-liner or a dedicated tool (hints to GUI tools are welcome but my solution must be command line based).

detly · Answer 1 · 2013-07-05T00:19:10.280

13

If you want something simpler that the AVFS solution, I wrote a Python script to do it called arkfind. You can actually just do

$ arkfind /path/to/search/ -g "*vacation*jpg"

It'll do this recursively, so you can look at archives inside archives to an arbitrary depth.

edited Jul 05 '13 at 00:19

answered Jul 05 '13 at 00:13

detly

5,160

1

Thanks, nice contribution! Especially if AVFS is no option. – mdo Jul 05 '13 at 07:39
It would be great if it supports jar files. – Chemik Oct 09 '13 at 10:52
@Chemik - noted! I'll do a bit more work on it this weekend :) JAR shouldn't be too hard, I believe it's really just a zip file to the outside world. – detly Oct 09 '13 at 11:11
@Chemik - I just tried it, and it should support JAR files in its current form anyway. Can you test it out, and if it doesn't work as you expect, file a bug on the Github page? (I did just fix a bug, so be sure to update your copy.) – detly Oct 12 '13 at 02:10
1

Yes I see now, it works. You can add "JAR files" to README :) – Chemik Oct 12 '13 at 11:45
It works for me when passing an archive as argument, but not a directory: IOError: [Errno 21] Is a directory: '.' – golimar Jul 27 '17 at 08:24
@golimar That's weird, I thought I tested it on directories. I'll look into it. – detly Jul 27 '17 at 22:47

score 10 · Accepted Answer · edited Apr 13 '17 at 12:36

(Adapted from How do I recursively grep through compressed archives?)

Install AVFS, a filesystem that provides transparent access inside archives. First run this command once to set up a view of your machine's filesystem in which you can access archives as if they were directories:

mountavfs

After this, if /path/to/archive.zip is a recognized archive, then ~/.avfs/path/to/archive.zip# is a directory that appears to contain the contents of the archive.

find ~/.avfs"$PWD" \( -name '*.7z' -o -name '*.zip' -o -name '*.tar.gz' -o -name '*.tgz' \) \
     -exec sh -c '
                  find "$0#" -name "*vacation*.jpg"
                 ' {} 'Test::Version' \;

Explanations:

Mount the AVFS filesystem.
Look for archive files in ~/.avfs$PWD, which is the AVFS view of the current directory.
For each archive, execute the specified shell snippet (with $0 = archive name and $1 = pattern to search).
$0# is the directory view of the archive $0.
{\} rather than {} is needed in case the outer find substitutes {} inside -exec ; arguments (some do it, some don't).

Or in zsh ≥4.3:

mountavfs
ls -l ~/.avfs$PWD/**/*.(7z|tgz|tar.gz|zip)(e\''
     reply=($REPLY\#/**/*vacation*.jpg(.N))
'\')

Explanations:

~/.avfs$PWD/**/*.(7z|tgz|tar.gz|zip) matches archives in the AVFS view of the current directory and its subdirectories.
PATTERN(e\''CODE'\') applies CODE to each match of PATTERN. The name of the matched file is in $REPLY. Setting the reply array turns the match into a list of names.
$REPLY\# is the directory view of the archive.
$REPLY\#/**/*vacation*.jpg matches *vacation*.jpg files in the archive.
The N glob qualifier makes the pattern expand to an empty list if there is no match.

Rodrigo Gurgel · Answer 3 · 2024-01-12T16:00:19.730

My usual solution:

find -iname '*.zip' -exec unzip -l {} \; 2>/dev/null | grep '\.zip\|DESIRED_FILE_TO_SEARCH'

Example:

find -iname '*.zip' -exec unzip -l {} \; 2>/dev/null | grep '\.zip\|characterize.txt'

Resuls are like:

foozip1.zip:
foozip2.zip:
foozip3.zip:
    DESIRED_FILE_TO_SEARCH
foozip4.zip:
...

If you want only the zip file with hits on it:

find -iname '*.zip' -exec unzip -l {} \; 2>/dev/null | grep '\.zip\|FILENAME' | grep -B1 'FILENAME'

FILENAME here is used twice, so you can use a variable.

With find you might use PATH/TO/SEARCH

score 3 · Answer 4 · answered Apr 08 '16 at 16:02

3

Another solution that works is zgrep

zgrep -r filename *.zip

answered Apr 08 '16 at 16:02

John Oxley

171

2

What implementation of zgrep is that? That doesn't work with the one shipped with GNU gzip (/bin/zgrep: -r: option not supported, zgrep (gzip) 1.6) – Stéphane Chazelas Sep 23 '16 at 08:44

Yordan Georgiev · Answer 5 · 2016-09-26T07:04:07.947

3

IMHO user-friendliness should be a thing in bash as well :

 while read -r zip_file ; do echo "$zip_file" ; unzip -l "$zip_file" | \
 grep -i --color=always -R "$to_srch"; \
 done < <(find . \( -name '*.7z' -o -name '*.zip' \)) | \
 less -R

and for tar ( this one is untested ... )

 while read -r tar_file ; do echo "$tar_file" ; tar -tf  "$tar_file" | \
 grep -i --color=always -R "$to_srch"; \
 done < <(find . \( -name '*.tar.gz' -o -name '*.tar' \)) | \
 less -R

edited Sep 26 '16 at 07:04

answered Sep 23 '16 at 06:28

Yordan Georgiev

279

What unzip implementation can deal with 7z or tar.gz files? – Stéphane Chazelas Sep 23 '16 at 08:42
yeah that is a bug ... corrected ... one should definitely use the correct binaries for the correct file types ... I just aimed to demonstrate the one-liner .. jee this one almost will get to the state of being ready as how-to receipt ... – Yordan Georgiev Sep 26 '16 at 07:03

Stéphane Chazelas · Answer 6 · 2017-10-31T19:12:39.957

libarchive's bsdtar can handle most of those file formats, so you could do:

find . \( -name '*.zip' -o     \
          -name '*.tar' -o     \
          -name '*.tar.gz' -o  \
          -name '*.tar.bz2' -o \
          -name '*.tar.xz' -o  \
          -name '*.tgz' -o     \
          -name '*.tbz2' -o    \
          -name '*.7z' -o      \
          -name '*.iso' -o     \
          -name '*.cpio' -o    \
          -name '*.a' -o       \
          -name '*.ar' \)      \
       -type f                 \
       -exec bsdtar tf {} '*vacation*jpg' \; 2> /dev/null

Which you can simplify (and improve to match case-insensitively) with GNU find with:

find . -regextype egrep \
       -iregex '.*\.(zip|7z|iso|cpio|ar?|tar(|\.[gx]z|\.bz2)|tgz|tbz2)' \
       -type f \
       -exec bsdtar tf {} '*vacation*jpg' \; 2> /dev/null

That doesn't print the path of the archive where those *vacation*jpg files are found though. To print that name you could replace the last line with:

-exec sh -ac '
   for ARCHIVE do
     bsdtar tf "$ARCHIVE" "*vacation*jpg" |
       awk '\''{print ENVIRON["ARCHIVE"] ": " $0}'\''
   done' sh {} + 2> /dev/null

which gives an output like:

./a.zip: foo/blah_vacation.jpg
./a.zip: bar/blih_vacation.jpg
./a.tar.gz: foo/blah_vacation.jpg
./a.tar.gz: bar/blih_vacation.jpg

Or with zsh:

setopt extendedglob # best in ~/.zshrc
for archive (**/*.(#i)(zip|7z|iso|cpio|a|ar|tar(|.gz|.xz|.bz2)|tgz|tbz2)(.ND)) {
  matches=("${(f@)$(bsdtar tf $archive '*vacation*jpg' 2> /dev/null)"})
  (($#matches)) && printf '%s\n' "$archive: "$^matches
}

Note that there are a number of other file formats that are just zip or tgz files in disguise like .jar or .docx files. You can add those to your find/zsh search pattern, bsdtar doesn't care about the extension (as in, it doesn't rely on the extension to determine the type of the file).

Note that *vacation*.jpg above is matched on the full archive member path, not just the file name, so it would match on vacation.jpg but also on vacation/2014/file.jpg.

To match on the filename only, one trick would be to use the extract mode, use -s (substitution) which uses regexps with a p flag to print the names of the matching files and then make sure no file is extracted, like:

bsdtar -'s|.*vacation[^/]*$||' -'s|.*||' -xf "$archive"

Note that it would output the list on stderr and append >> to every line. In any case, bsdtar, like most tar implementations may mangle the file names on display if they contain some characters like newline or backslash (rendered as \n or \\).

Find recursively all archive files of diverse archive formats and search them for file name patterns

6 Answers6

Linked

Related