4

I have a folder with several files. These files are either .xml or .zip files. These .zip files contain .xml and/or .zip files. These .zip contains also .xml or .zip, and so on... until we finally found .xml files.

In others words, I can have several "levels" of zip before finding my .xml files (cf. example below).

My requirement is to detect which root ZIP files contain at least one XML file that is bigger than 100Mb. When a ZIP file is in such case, it should be moved to another directory (let say ~/big-files). Also, if a non zipped .xml file is bigger than 100Mb, then it should be moved to this directory.

For example:

foo1.xml
foo2.xml
baz.xml [MORE THAN 100Mb]
one.zip
  +- foo.xml
  +- bar.xml [MORE THAN 100Mb]
  +- foo.xml
two.zip
  +- foo.xml
  +- zip-inside1.zip
  |   +- bar.xml [MORE THAN 100Mb]
  +- foo.xml
three.zip
  +- foo.xml
  +- zip-inside2.zip
  |   +- zip-inside3.zip
  |       +- foo.xml
  |       +- bar.xml [MORE THAN 100Mb]
  +- foo.xml
four.zip
  +- foo.xml
  +- zip-inside1.zip
      +- foo.xml

In this example, baz.xml, one.zip, two.zip and three.zip should be moved to ~/big-files as they host at least one XML file bigger than 100Mb, but not four.zip.

How can I achieve that in bash shell?

Thanks.

George M
  • 13,959
  • I don't think you can. Certainly find does not look inside zip files. You will need to write a script to do this in a more powerful language (Python, Ruby, Perl, etc.) – James Youngman Jul 06 '12 at 14:47
  • @JamesYoungman - Find can look look into zip files using -exec unzip -l. – jordanm Jul 06 '12 at 17:29
  • So you would need to use something like -exec sh -c "unzip -l $@ ... | grep" xxx. That's a (small) shell script. – James Youngman Jul 07 '12 at 17:09

2 Answers2

2

First, install AVFS, a filesystem that provides transparent access inside archives, and run the command mountavfs. See How do I recursively grep through compressed archives? for background.

After this, if /path/to/archive.zip is a recognized archive, then ~/.avfs/path/to/archive.zip# is a directory that appears to contain the contents of the archive.

Write an auxiliary script called has_large_file_rec that looks for a large XML file in the zip file passed as argument and calls itself recursively on every embedded zip file. This script produces some output if it finds a large XML file inside. The output is truncated for efficiency, since once we've found one large XML file we might as well stop searching.

#!/bin/sh
## auxiliary script has_large_file_rec
find "$1#" -name '*.zip' -type f -exec has_large_file_rec {} \; \
        -o -name '*.xml' -type f -size +1024k -print | head -n 1

At the top level, if you find a large file, move it to the big file directory.

find "~/.avfs$PWD" \
  -name '*.zip' -sh -c '
      a=$(has_large_file_rec "$0")
      if [ -n "$a" ]; then mv "$0" ~/big-files/; fi
                       ' {} \; -o \
  -name '*.xml' -type f -size +1024k -exec mv {} ~/big-files/ \;
1

One way using perl.

Content of script.pl:

use warnings;
use strict;
use Archive::Extract;
use List::Util qw|first|;
use File::Copy qw|move|;
use File::Spec;
use File::Path qw|remove_tree|;

## Path to save 'xml' and 'zip' files.
my $big_files_dir = qq|$ENV{HOME}/big_files/|;

## Temp dir to extract files of 'zips'.
my $zips_path = qq|/tmp/zips$$/|;

## Size in bytes to check 'xml' files.
my $file_max_size_bytes = 100 * 1024 * 1024;

my (@zips_to_move, $orig_zip);

## Get files to process.
my @files = <*.xml *.zip>;                                                                                                                                                                                                                   

## From previous list, copy 'xml' files bigger than size limit.                                                                                                                                                                              
for my $file ( @files ) {                                                                                                                                                                                                                    
        if ( substr( $file, -4 ) eq q|.xml| and -s $file > $file_max_size_bytes ) {                                                                                                                                                          
                move $file, $big_files_dir;                                                                                                                                                                                                  
        }                                                                                                                                                                                                                                    
}                                                                                                                                                                                                                                            

## Process now 'zip' files. For each one remove temp dir to avoid mixing files                                                                                                                                                               
## from different 'zip' files.                                                                                                                                                                                                               
for ( grep { m/\.zip\Z/ } @files ) {                                                                                                                                                                                                         
        remove_tree $zips_path;                                                                                                                                                                                                              
        $orig_zip = $_;                                                                                                                                                                                                                      
        handle_zip_file( $orig_zip );                                                                                                                                                                                                        
}                                                                                                                                                                                                                                            

## Copy 'zip' files got until now.                                                                                                                                                                                                           
for my $zip_file ( @zips_to_move ) {                                                                                                                                                                                                         
        move $zip_file, $big_files_dir;                                                                                                                                                                                                      
}                                                                                                                                                                                                                                            

## Traverse recursively each 'zip file. It will look for 'zip' file in the                                                                                                                                                                   
## subtree and will extract all 'xml' files to a temp dir. Base case is when                                                                                                                                                                 
## a 'zip' file only contains 'xml' files, then I will read size of all 'xmls'                                                                                                                                                               
## and will copy the 'zip' if at least one of them if bigger than the size limit.                                                                                                                                                            
## To avoid an infinite loop searching into 'zip' files, I delete them just after                                                                                                                                                            
## the extraction of its content.                                                                                                                                                                                                            
sub handle_zip_file {                                                                                                                                                                                                                        
        my ($file) = @_;                                                                                                                                                                                                                     

        my $ae = Archive::Extract->new(                                                                                                                                                                                                      
                archive => $file,                                                                                                                                                                                                            
                type => q|zip|,                                                                                                                                                                                                              
        );                                                                                                                                                                                                                                   

        $ae->extract(
                to => $zips_path,
        );

        ## Don't check fails. I don't worry about them, ¿perhaps should I?
        unlink( File::Spec->catfile( 
                                (File::Spec->splitpath( $zips_path ))[1], 
                                (File::Spec->splitpath( $file ))[2],
                        )
        );

        my $zip = first { substr( $_, -4 ) eq q|.zip| } <$zips_path/*>;
        if ( ! $zip ) {
                for my $f ( <$zips_path/*.xml> ) {
                        if ( substr( $f, -4 ) eq q|.xml| and -s $f > $file_max_size_bytes ) {
                                push @zips_to_move, $orig_zip;
                                last;
                        }
                }
                return;
        }

        handle_zip_file( $zip );
}

Some issues:

  • xml files with same name in the subtree of a zip file will be overwritten when copied to the temp dir.
  • This program extracts content of all zip files of the same tree and then checks for a xml bigger than 100MB. It would be faster to check each time I extract a zip file. It can be improved.
  • It doesn't cache zip files processed more than once.
  • ~/big_files must exists and be writable.
  • The script doesn't accept arguments. You must run it in the same directory as the zip and xml files.

It's not perfect as you can see in previous points but it worked in my test. I hope it can be useful for you.

Run it like:

perl script.pl
Birei
  • 8,124