4

Example: Say I have 20 000 images and I need to group them into folders in order to burn them to a CD (max 700MB / folder).

General: I have N files and I need to group them into M groups, so that all groups are about the same size (as close as possible)

splitting into M groups or splitting into groups of size M ... any would be fine

It seems such an easy task ... but, how can I do that?

  • Possible duplicate of http://unix.stackexchange.com/questions/10158/splitting-large-directory-tree-into-specified-size-chunks – steve Dec 04 '15 at 12:49

1 Answers1

2

Assumptions: you wish to split a folder containing thousands of files totaling up to more than 700MB into individual directories of 700 MB each - ready for burning onto multiple CDs.

On Linux, you can use a script like dsplit or dirsplit - part of the genisoimage (on Debian / Ubuntu). If you prefer Windows / Wine, you can use an application like Folder Axe.

Examples

Test scenario

# Create 2000 files of 1MB (sparse) each.
mkdir allimages && cd $_
for i in {1..2000}
do 
     dd if=/dev/zero of=image$i.jpg bs=1 count=0 seek=1M
done

I now have 2000 files (2GB) that I want to split across 3 directories.

$ ls -la | tail
-rw-rw-r--  1 cmihai cmihai 1048576 Dec  4 12:54 image992.jpg
-rw-rw-r--  1 cmihai cmihai 1048576 Dec  4 12:54 image993.jpg

Install dirsplit. On ubuntu, this is included in the genisoimage package.

$ apt-cache search dirsplit
genisoimage - Creates ISO-9660 CD-ROM filesystem images

$ sudo apt-get install genisoimage

dirsplit

# Get usage / help
dirsplit -H

# Dry run (get list of changes):
dirsplit --no-act --size 700M --expmode 1 allimages/

# Actual run:
$ dirsplit --size 700M --expmode 1 allimages/
Building file list, please wait...
Calculating, please wait...
....................
Calculated, using 3 volumes.
Wasted: 105254 Byte (estimated, check mkisofs -print-size ...)

# List of files per directory can be found in catalog files you can use with mkisofs.
$ ls
allimages  vol_1.list  vol_2.list  vol_3.lis

dsplit

Note: by default the files are hard-linked to the source

$ wget https://raw.githubusercontent.com/mayanez/dsplit/master/dsplit.py

$ python dsplit.py -s 700 -v allimages/ out/
Volume 01:
  allimages/: 700 files (700.00 MB).
Total: 700.00 MB (700 files, 1 dirs)
Volume 02:
  allimages/: 700 files (700.00 MB).
Total: 700.00 MB (700 files, 1 dirs)
Volume 03:
  allimages/: 600 files (600.00 MB).
Total: 600.00 MB (600 files, 1 dirs)

Gotchas:

  • I've used sparse files in my test - you'll want to check how dsplit / dirsplit handle sparse files, hardlinks and softlinks.
  • Hello, Mihai, I'm running debian ... so (Linux / UNIX ) ... your dirsplit solution seems to use a mathematical approach. And that sounds good. It states that it splits them "into approximately uniformly distributed subsets". Will this be the most uniform distribution? Is there an algorithm that could give us "the most uniform distribution" ? – Tancredi-Paul Grozav Dec 04 '15 at 13:02
  • Using multi volume tar / dar / whatever archives would probably be the most 'efficient' way to archive this, but that's not as flexible...

    dirsplit -H does provide an example on how to 'compare the required size of the created catalogs' as well take a look at that.

    – Criveti Mihai Dec 04 '15 at 13:17
  • Wow, thank you for taking the time to post these examples and test scenarios. I'll need some time to process this ^.^ – Tancredi-Paul Grozav Dec 04 '15 at 13:30
  • OK, so dsplit is clearly not "the best" choice, because the "most uniform" distribution in this case would be : 666+667+667. Not 700+700+600 – Tancredi-Paul Grozav Dec 04 '15 at 13:57