1

I have a folder with 77K PDFs(~500 GB in size). I want to zip them into 77 zipped files, each containing 1000 PDFs, so that they are easier to upload and share with peers. I don't know how to write for loops in bash and use the zip commands there, but I saw some examples in this question. Can someone help me out here?

The filenames go like this :

FinalRoll_MR_ACNo_@PartNo_%%.pdf

The @ and %% are numbers. If I can zip the first 1000 files in a file like archive_1.tar.gz and so on, then it'd be great! If we can retain the order of files in alphabetical order, then it'll be even better!

I am using an AWS ec2 instance that's running Ubuntu.

Thanks in advance!

Vibhu
  • 111
  • You say "zip" (which is an available utility), but then show a sample output filename of .tar.gz, which implies a gzipped tar file. Which would you like? – Jeff Schaller Sep 13 '19 at 01:12
  • Hey! Thanks for pointing that out! I didn’t have any particular zipping format in mind while writing the question and any format will work for me. – Vibhu Sep 13 '19 at 05:48

4 Answers4

1
#!/usr/bin/perl

use strict;
use List::MoreUtils qw(natatime);
use Sort::Naturally;

# specify directory on command line, or default to .
my $dir = shift || '.';

# Find all the PDF files. 
#
# NOTE: you could use perl's `Find::File` module instead of
# readdir() to do a recursive search like `find`.
opendir(DIR, $dir) || die "Can't open $dir: $!\n";
my @pdfs = nsort grep { /\.pdf$/i && -f "$dir/$_" } readdir(DIR);
closedir(DIR);

my $size=1000;

my $i=1;
my $iter = natatime $size, @pdfs;
while( my @tmp = $iter->() ){
  my $tarfile="archive_" . sprintf('%02i',$i++) . ".tar.gz";
  #print join(" ", ('tar','cfz',$tarfile, @tmp)),"\n";
  system('echo','tar','cfz',$tarfile, @tmp);
}

This uses the natatime() ("n-at-a-time") function in perl's List::MoreUtils library module to iterate over the list of PDF files 1000 at a time.

It also uses the Sort::Naturally module to natural-sort the PDF filenames. Drop that (and the call to nsort on the my @pdfs = ... line) if you don't need or want that.

The tar filenames have 2-digit zero-padded numbers in them so that they sort correctly. Change it to 3 or more digits if you have enough PDF files to fill more than 99 tar archives.

The code, as written, is a dry-run. Delete the 'echo', from the system() function call to make it actually tar up the batches of PDF files.

For verbose output while running without echo, uncomment the print statement. BTW, it would be easy to make it print a timestamp, e.g. seconds since the epoch with the perl built-in time(), or nicely formatted with the Date::Format module. e.g:

print join(" ", (time(),'tar','cfz',$tarfile, @tmp)),"\n";

Save as, e.g., vibhu.pl, make it executable with chmod +x vibhu.pl. Here's a sample run (in a directory with only 10 ".pdf" files):

$ touch {1..10}.pdf
$ ./vibhu.pl 
tar cfz archive_01.tar.gz 1.pdf 2.pdf 3.pdf 4.pdf 5.pdf 6.pdf 7.pdf 8.pdf 9.pdf 10.pdf

If you change $size=1000 to, e.g., $size=3, you can see that it is actually doing N at a time pdf files:

$ ./vibhu.pl 
tar cfz archive_01.tar.gz 1.pdf 2.pdf 3.pdf
tar cfz archive_02.tar.gz 4.pdf 5.pdf 6.pdf
tar cfz archive_03.tar.gz 7.pdf 8.pdf 9.pdf
tar cfz archive_04.tar.gz 10.pdf

The List::MoreUtils and Sort::Naturally modules are available from CPAN. They may already be packaged for your distribution. e.g. on Debian:

sudo apt-get install liblist-moreutils-perl libsort-naturally-perl
cas
  • 78,579
1

Using the bash shell, you could put the filenames into an array (sorting naturally with the wildcard expansion), then slice out 1000 at a time in an indexed loop:

#!/bin/bash

filenames=( *.pdf )
for((index=1; index <= $(( (${#filenames[@]} / 1000) + 1)); index++))
do
  start=$(( (index-1) * 1000 ))
  tar czf archive"${index}".tar "${filenames[@]:start:999}"
done

The for loop runs as many times as is needed to get 1000 files per run. The start variable indicates where the array slice should start. The tar command creates an indexed tar file of the 1000 files in the array starting at start through the next 999 files (or as many as are left, at the end).

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
0

You can use this awk scripts to create shell script. Review compress.sh and then execute it:

ls *.pdf | awk 'BEGIN {ORS=""; print "#!/bin/sh"; } NR%1000 == 1 {  print "\nzip Archive_" NR ".zip"; } { print " \\\n" $0; }' > compress.sh
WhiteWind
  • 344
  • 1
  • 7
0

An alternative with find and xargs, because you shouldn't parse ls:

export numfile="$(mktemp)"
echo 0 > "$numfile"

find lots_of_files/ -name '*.pdf' -print0 \
| sort -V -z \
| xargs -0r -L 1000  \
bash -c 'NUM=$(cat "$numfile") ; ((NUM++)); echo "$NUM" > "$numfile"; \
  tar -czf archive_$(printf '%03d' "$NUM" ).tar.gz "$@"' tar_in_batches

rm "$numfile"
unset numfile

You'll get your archives nicely numbered with leading zeroes and the files within the archives will be in the right order, too.

This version won't break if there are spaces or newlines in your filenames.

markgraf
  • 2,860
  • 1
    nice idea. the find -print0 and xargs -0 are good, but piping the output of xargs into while read like that undoes the good work of NUL-separating the filenames. maybe run a script or sh -c '...' from xargs instead (script would have to use a tmpfile or something to maintain state for the NUM counter). – cas Sep 13 '19 at 08:36
  • @cas Good catch! I'll have to think about that... – markgraf Sep 13 '19 at 08:43
  • 1
    something like numfile="$(mktemp)"; echo 0 > "$numfile"; find ... -print0 | sort -V -Z | xargs -0r -L 1000 sh -c 'NUM=$(cat "$1") ; ((NUM++)); echo "$NUM" > "$1"; shift; tar ... "$@"' sh "$numfile"; rm "$numfile" would do it. – cas Sep 13 '19 at 08:48
  • 1
    @cas Updated as per your suggestion. We need bash, though, sh didn't want to do ((NUM++)). Thanks a lot! – markgraf Sep 13 '19 at 09:51
  • 1
    +1 but that won't work quite right. you'll be losing the first filename argument because you haven't provided a name as the zeroth argument to bash -c. should be bash -c '...' arbitrary_name "$@". also, if you're going to export numfile, you should unset it when you no longer need it....that's why I had it as $1 for the script, to avoid the need for export & unset. – cas Sep 13 '19 at 10:07
  • i meant bash -c '...' arbitrary_name, without the "$@" on the end - didn't notice the error in time. – cas Sep 13 '19 at 10:14
  • e.g. running your script (edited to put echo before the tar) results in tar -czf archive_001.tar.gz ./2.pdf ./3.pdf ./4.pdf ./5.pdf ./6.pdf ./7.pdf ./8.pdf ./9.pdf ./10.pdf. Notice the missing 1.pdf. Running it with any word after the final ' of the bash script results in tar -czf archive_001.tar.gz ./1.pdf ./2.pdf ./3.pdf ./4.pdf ./5.pdf ./6.pdf ./7.pdf ./8.pdf ./9.pdf ./10.pdf – cas Sep 13 '19 at 11:06
  • I used The_Tarinator as the arbitrary process name, and that's how it would show up in ps, pgrep, etc. e.g. bash -c '....... tar -czf archive_$(printf '%03d' "$NUM" ).tar.gz "$@"' The_Tarinator – cas Sep 13 '19 at 11:13
  • And I was wondering... Note to self: give bash -c a name. Updated. Thanks again! – markgraf Sep 13 '19 at 11:46