19

I'm trying to find out what modules use Test::Version in cpan. So I've used minicpan to mirror it. My problem is that I need to iterate through the archives that are downloaded, and grep the files that are in the archives. Can anyone tell me how I might do this? preferably in a way that tells me which file in the archive and what line it's on.

(note: they aren't all tarballs some are zip files)

xenoterracide
  • 59,188
  • 74
  • 187
  • 252

9 Answers9

21

Ok, let's apply the unix philosophy. What are the components of this task?

  • Text search: you need a tool to search text in a file, such as grep.
  • Recursive: you need a tool to go looking for files in a directory tree, such as find.
  • Archives: you need a tool to read them.

Most unix programs operate on files. So to operate easily on archive components, you need to access them as files, in other words you need to access them as directories.

The AVFS filesystem presents a view of the filesystem where every archive file /path/to/foo.zip is accessible as a directory ~/.avfs/path/to/foo/zip#. AVFS provides read-only access to most common archive file formats.

mountavfs
find ~/.avfs"$PWD" \( -name '*.zip' -o -name '*.tar.gz' -o -name '*.tgz' \) \
     -exec sh -c '
                  find "$0#" -name "*.pm" -exec grep "$1" {\} +
                 ' {} 'Test::Version' \;
fusermount -u ~/.avfs   # optional

Explanations:

  • Mount the AVFS filesystem.
  • Look for archive files in ~/.avfs$PWD, which is the AVFS view of the current directory.
  • For each archive, execute the specified shell snippet (with $0 = archive name and $1 = pattern to search).
  • $0# is the directory view of the archive $0.
  • {\} rather than {} is needed in case the outer find substitutes {} inside -exec ; arguments (some do it, some don't).
  • Optional: finally unmount the AVFS filesystem.

Or in zsh ≥4.3:

mountavfs
grep 'Test::Version' ~/.avfs$PWD/**/*.(tgz|tar.gz|zip)(e\''
     reply=($REPLY\#/**/*.pm(.N))
'\')

Explanations:

  • ~/.avfs$PWD/**/*.(tgz|tar.gz|zip) matches archives in the AVFS view of the current directory and its subdirectories.
  • PATTERN(e\''CODE'\') applies CODE to each match of PATTERN. The name of the matched file is in $REPLY. Setting the reply array turns the match into a list of names.
  • $REPLY\# is the directory view of the archive.
  • $REPLY\#/**/*.pm matches .pm files in the archive.
  • The N glob qualifier makes the pattern expand to an empty list if there is no match.
  • this creates the other intesting problem of having to mount and then unmount all of the archives, as part of the the problem is that there are 22k archives that need to be searched through – xenoterracide May 28 '11 at 14:24
  • @xenoterracide: How is that a problem? With AVFS, you have a single mount point (~/.avfs), and access to each archive is automatic (~/.avfs/path/to/archive.zip\# is an ordinary directory on the AVFS filesystem, not a mount point). Sure, each archive you access means a little performance hit, but that's intrinsic to the problem. – Gilles 'SO- stop being evil' May 28 '11 at 14:31
  • @gilles only the fact that now I have to go through and figure out how to mount them first, which seems like a bit of a bad idea, better to mount them as I go and unmount after being searched. – xenoterracide May 28 '11 at 15:14
  • @xenoterracide: Again: no, you don't need to mount them individually. The full workflow (apart from installing AVFS if needed) is in my code snippets. – Gilles 'SO- stop being evil' May 28 '11 at 15:18
  • @gilles well I'll have to dig into this a bit... because I get find: missing argument to-exec'and lots of this from zshzsh: Input/output error: Data-Maker-0.27` – xenoterracide May 28 '11 at 15:36
  • @xenoterracide: There was a typo (missing {\}) in the inner find call. Both shell snippets should work now. – Gilles 'SO- stop being evil' May 29 '11 at 18:02
  • @gilles grep: /home/xenoterracide/.avfs/srv/http/cpan/authors/id/J/JI/JINGRAM/Data-Maker-0.27.tar.gz#/Data-Maker-0.27/lib/Data/PaxHeader/Maker.pm: Input/output error – xenoterracide May 29 '11 at 23:48
  • @xenoterracide: This archive was produced with star and includes extensions that aren't supported by AVFS, nor by tar on my system (GNU tar 1.23; tar tzf Data-Maker-0.27.tar.gz doesn't find the PaxHeader directory and prints many errors such as tar: Ignoring unknown extended header keyword `SCHILY.dev'). AVFS can be told to call external programs, so you could install star and have AVFS call it: define an star vfs suffix (see the AVFS README) and use foo.tgz#star/dir instead of foo.tgz#/dir. – Gilles 'SO- stop being evil' May 30 '11 at 00:17
  • cpan's heterogeneousness can be a massive PITA :P – xenoterracide May 30 '11 at 00:26
2

It appears that I can do it this way

find authors/ -type f -exec zgrep "Test::Version" '{}' +  

However, this gives results like:

authors/id/J/JO/JONASBN/Module-Info-File-0.11.tar.gz:Binary file (standard input) matches

which is not very specific to where in the tarball. Hopefully someone can come up with a better answer.

xenoterracide
  • 59,188
  • 74
  • 187
  • 252
1

ugrep recursively searches compressed files (gz/Z/bz2/lzma/xz/lz4/zstd) and archives (cpio/tar/pax/zip) with option -z. Options -z --zmax=2 searches compressed files and archives embedded within compressed files and archives (hence zmax=2 levels).

0

After installing p7zip-* you are able to do this:

ls | xargs -I {} 7z l {} | grep whatever | less

You don't have to use ls before the first pipe, whatever list the compressed files will work. The final less only will show the PATH of the listet life inside the compressed archive, but not the name of this.

drs
  • 5,453
0

Thanks for the challenge, I came up with:

#!/bin/bash
#

# tarballs to check in
find authors/ -type f | while read tarball; do

    # get list of files in tarball (not dirs ending in /):
    tar tzf $tarball | grep -v '/$' | while read file; do       

        # get contents of file and look for string
        tar -Ozxf conform.tar.gz $file | grep -q 'Text::Version' && echo "Tar ($tarball) has matching File ($file)"

    done

done
  • Just saw your line number requirement. That can probably work with some combination of grep -n and awk to capture the line number. Can't be as simple as grep -H to list filename since it's always stdin, so might require more lines. – Kyle Smith May 25 '11 at 14:24
  • errors out when run on my system, infinite repeated : tar (child): conform.tar.gz: Cannot open: No such file or directory tar (child): Error is not recoverable: exiting now tar: Child returned status 2 tar: Error is not recoverable: exiting now – xenoterracide May 25 '11 at 14:43
  • also I didn't realize when I first posted this that some of the archives on cpan are zip files. – xenoterracide May 25 '11 at 14:45
  • Hm, I tested with a structure of only .tar.gz files -- it could be made more robust to take appropriate actions based on file type, but this should give a decent starting point. – Kyle Smith May 25 '11 at 18:35
0

Use find to locate all necessary files, and that zgrep to look into compressed files:

find <folder> -type f -name "<search criteria[*gz,*bz...]>" -execdir zgrep -in "<grep expression>" '{}' ';'

Didn't test this on tarballs though

0

Maybe my answer will helpfull for someone:

#!/bin/bash

findpath=$(echo $1 | sed -r 's|(.*[^/]$)|\1/|')

# tarballs to check in
find $findpath -type f | while read tarball; do

    # get list of files in tarball (not dirs ending in /):
    if [ -n "$(file --mime-type $tarball | grep -e "application/jar")" ]; then

        jar tf $tarball | grep -v '/$' | while read file; do
            # get contents of file and look for string
            grepout=$(unzip -q -c $tarball $file | grep $3 -e "$2")

            if [ -n "$grepout" ]; then
                echo "*** $tarball has matching file ($file):"
                echo $grepout
            fi

        done

    elif tar -tf $tarball 2>/dev/null; then

        tar -tf $tarball | grep -v '/$' | while read file; do
            # get contents of file and look for string
            grepout=$(unzip -q -c $tarball $file | grep $3 -e "$2")

            if [ -n "$grepout" ]; then
                echo "*** $tarball has matching file ($file):"
                echo $grepout
            fi

        done

    else
        file=""
        grepout=$(grep $3 -e "$2" $tarball)

        if [ -n "$grepout" ]; then
            echo "*** $tarball has matching:"
            echo $grepout
        fi

    fi

done
0

Here's a dash (also tested on bash, zsh, ksh shells) script that can search inside: .zip, .bz2, .xz, .tar.*, .tgz, .tar, .gz archives:

In order to run it: run it with no parameters and it will ask for the necessary info (reads input from keyboard):

#!/bin/dash

PrintInTitle () { printf "\033]0;%s\007" "$1" } PrintJustInTitle () { PrintInTitle "$1">"$print_to_screen" }

CleanUp () { trap - INT trap - TSTP if [ -n "$TEP" ] && [ -n "$TEF" ]; then rm -R -f "$output_dir/"* fi unset IFS PrintJustInTitle "" if [ "$1" = "1" ]; then printf "Aborted\n">"$print_to_screen" kill -s PIPE -- -$$ 2>/dev/null fi }

StoreArchiveFilePath () { eval k=$(($k + 1)) eval archive_files_$k=&quot;$archive_file&quot; }

PrintMatch () { printf '\n%s\n\n' "$search_path$current_archive_file/$inside_current_archive_file"|grep --color -F "$search_path$current_archive_file/$inside_current_archive_file" for i in $(seq 1 $search_strings_0); do eval current_search_string=&quot;$search_strings_$i&quot; printf '\n%s\n\n' "$current_search_string:"|grep --color -F "$current_search_string:" cat "$inside_current_archive_file"|grep -i -n -F "$current_search_string" 2>/dev/null done }

GetCurrentContent () { cd "$TEP" && { eval current_archive_file=&quot;$$2&quot; case "$current_archive_file" in '.zip' ) unzip "$full_search_path/""$current_archive_file" -d "$TEF" >/dev/null 2>/dev/null ;; '.bz2' ) bzip2 "$full_search_path/""$current_archive_file" -d "$TEF" >/dev/null 2>/dev/null ;; '.xz' ) xz "$full_search_path/""$current_archive_file" -d "$TEF" >/dev/null 2>/dev/null ;; '.tar.'* | '.tgz' | '.tar' ) tar -xvf "$full_search_path/""$current_archive_file" -C "$TEF" >/dev/null 2>/dev/null ;; *'.gz' ) cp "$full_search_path/""$current_archive_file" "./$TEF"; gzip -d "./$TEF/""$current_archive_file" >/dev/null 2>/dev/null ;; esac

    cd &quot;$output_dir&quot;
    for inside_current_archive_file in $(for t in $(seq 1 $inside_archive_file_path_filters_0); do eval current_inside_archive_file_path_filter=\&quot;\$inside_archive_file_path_filters_$t\&quot;; eval find . -type f -path &quot;$current_inside_archive_file_path_filter&quot;|sort --numeric-sort; done;); do
        gcc_found=&quot;true&quot;
        for i in $(seq 1 $search_strings_0); do
            eval current_search_string=\&quot;\$search_strings_$i\&quot;
            gcc_stored_content=&quot;$(cat &quot;$inside_current_archive_file&quot;|grep -i -n -F &quot;$current_search_string&quot; 2&gt;/dev/null;)&quot;;
            if [ -z &quot;$gcc_stored_content&quot; ]; then
                gcc_found=&quot;false&quot;
                break
            fi
        done
        if [ &quot;$gcc_found&quot; = &quot;true&quot; ]; then
            if [ &quot;$1&quot; = &quot;StoreArchiveFilePath&quot; ]; then
                StoreArchiveFilePath
                break
            elif [ &quot;$1&quot; = &quot;PrintMatch&quot; ]; then
                PrintMatch
            fi
        fi
    done
    if [ -n &quot;$TEP&quot; ] &amp;&amp; [ -n &quot;$TEF&quot; ]; then rm -R -f &quot;$output_dir/&quot;*; fi
    cd &quot;$full_search_path&quot;
}

}

set +f #Enable globbing (POSIX compliant) setopt no_nomatch 2>/dev/null #Enable globbing (zsh)

IFS=' ' print_to_screen='/dev/tty' initial_dir="$PWD"

case "$(uname -s)" in "Linux" ) TEP='/dev/shm' #TEMPORARY_EXTRACT_PATH ;; "Darwin" | "BSD" | * ) TEP="$HOME" #TEMPORARY_EXTRACT_PATH ;; esac TEF='TEMP_EXTRACT_FOLDER' #TEMP_EXTRACT_FOLDER

output_dir=""

error="false" { cd "$TEP" && { if [ ! -e "$TEF" ]; then printf '%s\n' "The specified temporary directory: &quot;$TEF&quot; - does not exist in the specified location: &quot;$TEP&quot; - do you want to create it? [ Yes / No ] (default=Enter=No): ">"$print_to_screen" read answer if [ "$answer" = "Yes" ] || [ "$answer" = "yes" ] || [ "$answer" = "Y" ] || [ "$answer" = "y" ]; then mkdir "$TEF" || error="true" fi fi cd "$TEF" && output_dir="$PWD" || error="true" } || error="true" } 2>/dev/null if [ "$error" = "true" ]; then printf '%s\n' "Error: Could not access temporary folder &quot;$TEF&quot; in the extract location: &quot;$TEP&quot;!">&2 read temp exit 1 fi

trap 'CleanUp 1' INT trap 'CleanUp 1' TSTP

cd "$HOME" printf '%s\n' "Search Path (blank=default=current folder=$PWD): " read search_path if [ -z "$search_path" ]; then search_path="." fi

printf '\n%s\n' "Inside archive file path filters: (what file path to lookup inside the archive) (concatenated internaly with logical OR) (default=Enter=''): ">"$print_to_screen" i=0 while [ "1" = "1" ]; do printf '%s' ">> inside archive path filter: >> " IFS= read -r current_inside_archive_file_path_filter unset IFS if [ -z "$current_inside_archive_file_path_filter" ]; then break fi i=$(($i + 1)) eval inside_archive_file_path_filters_$i=&quot;$current_inside_archive_file_path_filter&quot; done inside_archive_file_path_filters_0=$i unset IFS #Reset IFS if [ "$inside_archive_file_path_filters_0" = "0" ]; then inside_archive_file_path_filters_1="'"''"'" inside_archive_file_path_filters_0="1" fi

printf '\n%s\n' "Search strings (concatenated internaly with logical AND):">"$print_to_screen"; i=0 while [ "1" = "1" ]; do printf '%s' ">> add search string: >> " IFS= read -r current_search_string unset IFS if [ -z "$current_search_string" ]; then break; fi; i=$(($i + 1)) eval search_strings_$i=&quot;$current_search_string&quot; done search_strings_0=$i if [ "$search_strings_0" = "0" ]; then search_strings_1="" search_strings_0=1 fi

i=0 PrintJustInTitle "Loading list of archive files to analyze..."

IFS=' ' cd "$search_path" && { full_search_path="$PWD" { find . ( -type f -path '.zip' -o -path '.bz2' -o -path '.xz' -o -path '.tar.' -o -path '.tgz' -o -path '.tar' -o -path '.gz' ) -exec printf "%s\n" "{}" ;|sort --numeric-sort;
printf '%s\n' "..."; index=defined; while [ -n "$index" ]; do read index; printf '%s\n' $index; done; }|{ j=0; k=0 while read -r line; do j=$(($j + 1)) if [ "$line" = "..." ]; then break else PrintJustInTitle "Analyzing archive file $j..." archive_file="$line" fi GetCurrentContent StoreArchiveFilePath archive_file done archive_files_0="$k" PrintJustInTitle ""

    if [ ! &quot;$k&quot; = &quot;0&quot; ]; then
        index=&quot;defined&quot;
        count=&quot;$archive_files_0&quot;
        while [ -n &quot;$index&quot; ]; do
            for i in $(seq 1 $count); do
                eval current_archive_file=\&quot;\$archive_files_$i\&quot;
                printf '\033[0;31m%s\033[0m\n' &quot;$i = &quot;&quot;$current_archive_file&quot;
            done
            printf '\n%s\n' &quot;Print results (one at a time - blank = exit)?: [ 1 - $count ]: &quot;
            read index
            if [ -z &quot;$index&quot; ]; then
                break
            fi
            printf '%s\n' '---------------------------------------------------------------'
            GetCurrentContent PrintMatch archive_files_$index
            printf '%s\n\n' '---------------------------------------------------------------'
        done
    fi
}

}

PrintJustInTitle "" CleanUp

It consumes a lot of resources but the output is detailed.

0

In zsh and with bsdtar + GNU tar + GNU grep, that could be:

set -o extendedglob
for f (**/*.(#i)(zip|tar|(t|tar.)(xz|gz|bz2))(N.))
  bsdtar cf - @$f | ARCHIVE=$f tar -xf - --to-command='
    if [ "$TAR_FILETYPE" = f ]; then
      grep -H --label="$ARCHIVE[$TAR_FILENAME]" Test::Version
    fi
    true'

Where

  • the zsh glob looks for regular (. glob qualifier) unhidden files whose name ends in .zip, .tar, .tar.gz, .tgz... (case insensitively).
  • bsdtar converts the file to a ustar archive format that GNU tar supports
  • we use GNU tar's --to-command to pipe the contents of each file to grep
  • grep finds the matches, labelling them with file.gz[file/in/archive].
  • we terminate the --to-command script with true to avoid the warnings by tar when that command returns with a non-zero exit status:

To restrict the search to archive members whose name ends in .pm, you can change the script for --to-command to:

    case $TAR_FILETYPE$TAR_FILENAME in
      (f*.pm) grep -H --label="$ARCHIVE[$TAR_FILENAME]" Test::Version
    esac
    true