Putting m files in a directory into n tar files by size?

Question

Does anyone know if it's possible to create n tar files (of roughly equal size) out of a larger collection of files in a directory, in such a way that they can be extracted individually?

I was looking at tar --multi-line option, but unfortunately it looks like ALL of the resulting tar files are necessary to extract the original files. Even more so with tar-ing then split-ing the files.

If they didn't have to be roughly the same size, I'd say do ls | wc to get the number of files in the directory, then split the filenames into equal-sized sets (something like: ls | tail -n900| head -n100), and pass those to tar. But you may end up with pretty large size variations.

Any ideas?

Somewhat related: Tar directory into independent archive files no larger than a certain size. — Stephen Kitt, Sep 02 '16 at 21:05
This is a variation of the Knapsack problem. You won't find a simple solution to it. — Satō Katsura, Sep 02 '16 at 22:16

ilkkachu · Accepted Answer · 2016-09-11T07:55:13.123

You could write a script that looks at the sizes of the files and distributes them into bins, taking care not to exceed the maximum size. The optimal solution may not be simple, but a greedy algorithm should do.

A minor problem would be taking into account the bookkeeping space taken by tar in addition to the file contents. (Also, how to deal with directories and special files?)

A bigger problem will appear if you want to compress the archives. Since the usual idiom is to put the files together with tar and compress the tar file with a separate utility, it's not that simple to split the resulting archive along file boundaries. You'd pretty much need to know the compressed sizes of the files in advance. If you compress the files before taring them together, you know the sizes, but lose the space advantages of compressing all files in one go.

Actually, I made a simple awk script to do just that at some point. Code below, use with

find dir/ -printf "%s\t%p\n" | sort -n | awk -vmax=$maxsizeinbytes -f pack.awk

(Output goes to bins.list.NNN. No warranty, won't work with filenames containing whitespace, probably other bugs too etc.)

#!/usr/bin/awk
# pack.awk
{ 
    if ($1 > max) {
        printf "too big (%d, max %d): ", $1, max, $2 > "/dev/stderr";
        exit 1;
    }
    for (x in bins) {
        if (free[x] >= $1) { 
            bins[x] = bins[x] "\n" $2; 
            count[x]++; free[x] -= $1; 
            next 
        }
    }; 
    bins[++i] = $2; free[i] = max - $1; count[i] = 1;
} 
END {
    for (i in bins) {
        printf "bin %d: entries: %d size: %d \n", i, count[i], max - free[i]; 
        print bins[i] > "bins.list." i
    }
}

Putting m files in a directory into n tar files by size?

1 Answers1