13

I have a directory tree that I would like to back up to optical disks. Unfortunately, it exceeds the size of any one disk (it's about 60GB). I am looking for a script that would split this tree into appropriately sized chunks with hard links or whatnot (leaving the original untouched). I could then feed these bite-size trees into the backup process (add PAR2 redundancy, etc.).

It's not a fancy script, but it seems like it might have already been done. Suggestions?

(Spanning and writing in one step is a no-go because I want to do more stuff before the files get burned.)

Reid
  • 459

7 Answers7

8

There exists an application designed for this: dirsplit

It usually lives in cdrkit or dirsplit packages.

It can create ready-to-use folders with links to easily create DVDs with K3b or other GUI software

6

You can also try fpart, a tool I've written (BSD-licensed) : https://sourceforge.net/projects/fpart/

Martymac
  • 61
  • 1
  • 2
2

I once made an ugly script for a similar purpose. It is just a kludge, but when I wrote it I didn't care about execution time or prettiness. I'm sure there are more "productified" versions of the same concept around, but If you wish to get some ideas or something to start hacking on, here goes (did it in 2008, so use at your own risk!) :-)

#!/bin/sh -
REPO=/export/foton/PictureStore
LINKS=/export/foton/links
SPLITTIX=`date '+%y%m%d-%H%M'`

# kilobytes
DVDSIZE=4400000
PARTPREFIX="DVD-"
REPOSIZE=`du -sk -- ${REPO} | awk '{print $1}'`
NUMPARTS=`expr $REPOSIZE / $DVDSIZE`
SPLITDIR=${LINKS}/splits/${SPLITTIX}
mkdir -p -- "$SPLITDIR"

PARTNUM=1
PARTSIZ=0
DONESIZ=0
PARTNUM=`echo $PARTNUM | awk '{printf("%03x", $0)}'`
mkdir -p -- "${SPLITDIR}/${PARTPREFIX}${PARTNUM}"
for D in "${REPO}"/..?* "${REPO}"/.[!.]* "${REPO}"/*
do
  if [ ! -e "$D" ]; then continue; fi  # skip ..?*, .[!.]* and * if there are no matching files
  D=${D#$REPO/}
  D_SIZ=`du -sk -- "${REPO}/$D" | awk '{print $1}'`
  if test `expr $D_SIZ + $PARTSIZ` -le $DVDSIZE
  then
    # link to D in this part
    ln -s -- "$REPO/$D" "${SPLITDIR}/${PARTPREFIX}${PARTNUM}/$D"
    # adjust counters
    PARTSIZ=`expr $PARTSIZ + $D_SIZ`
    DONESIZ=`expr $DONESIZ + $D_SIZ`
  else
    # next part and link to D in that
    echo PART $PARTNUM: $PARTSIZ kb '(target' $DVDSIZE 'kb)'
    PARTNUM=`expr $PARTNUM + 1`
    PARTNUM=`echo $PARTNUM | awk '{printf("%03x", $0)}'`
    PARTSIZ=$D_SIZ
    DONESIZ=`expr $DONESIZ + $D_SIZ`
    mkdir -p -- "${SPLITDIR}/${PARTPREFIX}${PARTNUM}"
    ln -s -- "$REPO/$D" "${SPLITDIR}/${PARTPREFIX}${PARTNUM}/$D"
  fi
done
echo "wrote $DONESIZ kb in $PARTNUM parts in $SPLITDIR"

I think I had the result shared through samba to a windows host that burned discs from it. If you use the above unaltered, you may wish to use mkisofs or another archiver that resolves symlinks.

MattBianco
  • 3,704
  • I've made a few changes to your script to cope with special characters in file names (whitespace, initial dashes and dots, \[?*). Suggested reading: don't parse the output of ls, $VAR vs ${VAR} and to quote or not to quote. Note that I haven't tested the resulting script. If you don't understand one my changes, feel free to ask. – Gilles 'SO- stop being evil' Mar 28 '11 at 19:03
  • @Gilles: I've done plenty of reading since 2008 ;-) Changes to make the script more generic are good. (I dislike the introduction of [ as opposed to test, though)... – MattBianco Mar 29 '11 at 14:54
  • You should lower case most of those variables. By convention, we capitalize environment variables (PAGER, EDITOR, SHELL, ...) and internal shell variables. All other variable names should contain at least one lowercase letter. This convention avoids accidentally overriding environmental and internal variables. – Chris Down Sep 18 '11 at 21:56
2

I once wrote a script to solve a similar problem -- I called it "distribute" (you can read the main code of the script or the file with the help message, or download it as a package); from its description:

distribute -- Distribute a collection of packages on multiple CDs (especially good for future use with APT)

Description: `distribute' program makes doing the tasks related to creating a CD set for distribution of a collection of packages easier. The tasks include: laying out the CDs filesystem (splitting the large amount of packages into several discs etc.), preparing the collection for use by APT (indexing), creating ISO images and recording the discs.

Periodical updates to the initially distributed collection can be issued with help of `distribute'.

It does the whole process in several stages: at one stage, it creates the furure disk "layouts" by using symlinks to the original files -- so you can intervene and change the future disk trees.

The details about its usage can be read in the help message printed by the script (or by looking into the source code).

It was written with a more trickier use case in mind (issuing updates as a "diff"--the set of added new files--to the originally recorded collection of files), so it includes one extra initial stage, namely, "fixing" the current state of the collection of files (for simplicity, it does this by replicating the original collection of files by means of symlinks, in a special working place for saving the states of the collection; then, some time in the future, it will be able to create a diff between a future current state of the collection of files and this saved state). So, although you might not need this feature, you can't skip this initial stage, AFAIR.

Also, I'm not sure now (I wrote it quite a few years ago) whether it treats complex trees well, or it is supposed to split only plain (one level) directories of files. (Please look into the help message or the source code to be sure; I'll look this up, too, a bit later, when I'll have some time.)

The APT-related stuff is optional, so don't pay attention that it can prepare package collections to be used by APT if you don't need this.

If you get interested, of course, feel free to rewrite it to your needs or suggest improvements.

(Please pay attention that the package includes additional useful patches not applied in the presented code listing at the Git repo linked above!)

2

We shouldn't forget that the essence of the task is indeed quite simple; as put in a tutorial on Haskell (which is written around the working through of the solution for this task, incrementally refined)

Now let's think for a moment about how our program will operate and express it in pseudocode:

main = Read list of directories and their sizes.
       Decide how to fit them on CD-Rs.
       Print solution.

Sounds reasonable? I thought so.

Let's simplify our life a little and assume for now that we will compute directory sizes somewhere outside our program (for example, with "du -sb *") and read this information from stdin.

(from Hitchhikers guide to Haskell, Chapter 1)

(Additionaly, in your question, you'd like to be able to tweak (edit) the resulting disk layouts, and then use a tool to burn them.)

You could re-use (adapt and re-use) a simple variant of the program from that Haskell tutorial for splitting your file collection.

Unfortunately, in the distribute tool that I've mentioned here in another answer, the simplicity of the essential splitting task is not matched by the complexity and bloatedness of the user interface of distribute (because it was written to combine several tasks; although performed in stages, but still combined not in the cleanest way I could think of now).

To help you make some use of its code, here's an excerpt from the bash-code of distribute (at line 380) that serves to do this "essential" task of splitting a collection of files:

# Splitting:

function splitMirrorDir() {
  if [[ ! -d "$THIS_BASES_DIR/$BASE/$type" ]]; then
    echo $"No base fixed for $type" >&2
    exit 1
  fi

  # Getting the list of all suitable files:
  local -a allFiles
  let 'no = 0' ||:
  allFiles=()
  # no points to the next free position in allFiles
  # allFiles contains the constructed list
  for p in "$THIS_BASES_DIR/$BASE/$type"/*.rpm; do
      if [[ ! -e "$p" ]]; then
      # fail on non-existent files
      echo $"Package file doesn't exist: " "$p" >&2
      return 1 
      fi
      if [[ "$ONLY_REAL_FILES" == "yes" && ! -f "$p" ]]; then
      continue
      fi
      if [[ "$DIFF_TO_BASE" ]]; then
          older_copy="$DIFF_TO_BASE/$type/${p##*/}" # using shell param expansion instead of `basename' to speed up
          if [[ -h "$older_copy" || -a "$older_copy" ]]; then
          continue
      fi
      fi
      allFiles[$(( no++ ))]="$p"
  done
  readonly -a allFiles

  # Splitting the list of all files into future disks:
  # 
  local -a filesToEat allSizes
  let 'no = 0' ||:
  filesToEat=()
  allSizes=($(getSize "${allFiles[@]}"))
  readonly -a allSizes
  # allSizes contains the sizes corrsponding to allFiles
  # filesToEat hold the constructed list of files to put on the current disk
  # no points to the next free position in filesToEat
  # totalSize should hold the sum of the sizes 
  #  of the files already put into filesToEat;
  #  it is set and reset externally.
  for p in "${allFiles[@]}"; do 
      if (( totalsize + ${allSizes[$(( no ))]} > CDVOLUME )); then
      eatFiles "${filesToEat[@]}"
          filesToEat=()
          finishCD
      startTypedCD
    fi
      let "totalsize += ${allSizes[$(( no ))]}" ||:
      filesToEat[$(( no++ ))]="$p"
  done
  eatFiles "${filesToEat[@]}"
}

function eatFiles() {
    #{ oldIFS="$IFS"; IFS=$'\n'; echo "$FUNCNAME: args: " "$*" | head >&2;  IFS="$oldIFS"; }
    zeroDelimited "$@" | xargs -0 --no-run-if-empty \
    cp -s \
    --target-dir="$THIS_LAYOUTS_DIR/cd$(( cdN ))/$PREFIX/$type$DOT_SUFFIX"/ \
    --
}

function startTypedCD() {
#  set -x
  mkdir -p "$THIS_LAYOUTS_DIR/cd$(( cdN ))/$PREFIX/$type$DOT_SUFFIX"
  start_action $" %s with %s" "$(( cdN ))" "$type"
#  set +x
}

function finishCD() {

(read more after line 454)

Note that the eatFiles function prepares the layouts of the future disks as trees where the leaves are symlinks to the real files. So, it is meeting your requirement that you should be able to edit the layouts before burning. The mkisofs utility has an option to follow symlinks, which is indeed employed in the code of my mkiso function.

The presented script (which you can take and rewrite to your needs, of course!) follows the simplest idea: to sum the sizes of files (or, more precisely, packages in the case of distribute) just in the order they were listed, don't do any rearrangements.

The "Hitchhikers guide to Haskell" takes the optimization problem more seriously and suggests program variants that would try to re-arrange the files smartly, in order for them to fit better on disks (and require less disks):

Enough preliminaries already. let's go pack some CDs.

As you might already have recognized, our problem is a classical one. It is called a "knapsack problem" (google it up, if you don't know already what it is. There are more than 100000 links).

let's start from the greedy solution...

(read more in Chapter 3 and further.)

Other smart tools

I've been told also that Debian uses a tool to make its distro CDs that is smarter than my distribute w.r.t. collections of packages: its results are nicer because it cares about inter-package dependencies and would try to make the collection of packages that gets on the first disk closed under dependencies, i.e., no package from the 1st disk should require a package from another disk (or at least, I'd say, the number of such dependencies should be minimized).

1

backup2l can do a lot of this work. Even if you don't use the package directly, you might get some script ideas from it.

0

The rar archiver can be instructed to automatically split the archive it creates up into chunks of a specific size with the -vsize flag.

Archiving that directory tree named foo into chunks of, say, 500 megabytes apiece you'd specify
rar a backup.rar -v500m foo/

  • 2
    Than why rar? tar (+bz2) + split is more native approach for *nix. – rvs Mar 28 '11 at 09:55
  • "bite-size trees" doesn't quite sound like rar, unless you unpack each "part" again into its own directory, which of course won't work, since the parts are not designed like that, and not split on file boundaries. – MattBianco Mar 28 '11 at 11:16
  • 1
    If talking about tools that give tar+split-like results, then there's also dar; here's the note about its relevant feature: "(SLICES) it was designed to be able to split an archive over several removable media whatever their number is and whatever their size is". Compared to tar+split, I assume, it allows some easier ways to access the archived files. (BTW, it has also a feature resembling distribute: "DIFFERENTIAL BACKUP" & "DIRECTORY TREE SNAPSHOT", but one may not like that the result is a special format, not an ISO with a dir tree.) – imz -- Ivan Zakharyaschev Mar 29 '11 at 23:43