1

I am hoping to achieve some cleanup functionality on about 20TB for my NAS with rsync in linux by excluding whole directories and contents for directories that would contain a ".protect" file

I generate really large caches in subfolders like

cache/simulation_v001/reallybigfiles_*.bgeo

cache/simulation_v002/reallybigfiles_*.bgeo

cache/simulation_v003/reallybigfiles_*.bgeo

and if a file existed like this- cache/simulation_v002/.protect

Then i'd like to build an rsync operation to move all folders to a temp /recycle location excluding cache/simulation_v002/ and all its contents.

I've done something like this before with python, but I'm curious to see if the operation can be simplified with rsync or another method.

  • 2
    rsync alone can't do this - but you could use find to construct an exclude file for rsync. e.g. starting with something like find . -name .protect -printf '%h/***\n' – cas Jul 31 '19 at 03:32
  • thanks I will try and experiment with that! – openCivilisation Aug 01 '19 at 05:28
  • this doesn't seem to work. using the find command will generate items in the list like - ./simulation_v002/*** but this will then still end up including files it shouldn't rsync -a -m --remove-source-files --exclude-from='cache/exclude_list.txt' cache/ cache_trash is it possible for find to generate simulation_v002/*** instead? – openCivilisation Aug 02 '19 at 13:31
  • use sed or something to edit find's output before saving to a file. e.g. sed -e 's=^\./=='. don't expect one tool to do everything - it's normal to combine multiple small tools to achieve a desired result, each tool being good at its own job. find to get the list of files, sed to transform it into the required format, rsync to do th copy. – cas Aug 02 '19 at 14:28

1 Answers1

1

Thanks to tips from cas I was able to create this workflow to solve the problem with a bash script. Its not ideal because it would be better if it did a move for faster operation (I wish rsync had this ability). The script will search below the current folder for files with find, create an exclusion list, then use rsync from the base volume to move all other folders to a trash folder, retaining the full path underneath so any mistakes can be restored non destructively.

Link to current state if this solution in git dev branch - https://github.com/firehawkvfx/openfirehawk-houdini-tools/blob/dev/scripts/modules/trashcan.sh

#!/bin/bash

# trash everything below the current path that does not have a .protect file in
# the folder.  it should normally only be run from the folder such as
# 'job/seq/shot/cache' to trash all data below this path.

# see opmenu and firehawk_submit.py for tools to add protect files based on
# a top net tree for any given hip file.

argument="$1"

echo ""
ARGS=''

if [[ -z $argument ]] ; then
  echo "DRY RUN. To move files to trash, use argument -m after reviewing the exclude_list.txt and you are sure it lists everything you wish to protect from being moved to the trash."
  echo ""
  ARGS1='--remove-source-files'
  ARGS2='--dry-run'
else
  case $argument in
    -m|--move)
      echo "MOVING FILES TO TRASH."
      echo ""
      ARGS1='--remove-source-files'
      ARGS2=''
      ;;
    *)
      raise_error "Unknown argument: ${argument}"
      return
      ;;
  esac
fi

current_dir=$(pwd)
echo "current dir $current_dir"
base_dir=$(pwd | cut -d/ -f1-2)
echo "base_dir $base_dir"


source=$(realpath --relative-to=$base_dir $current_dir)/
echo "source $source"
target=trash/
echo "target $target"

# ensure trash exists at base dir.
mkdir -p $base_dir/$target
echo ""
echo "Build exclude_list.txt contents with directories containing .protect files"
find . -name .protect -print0 |
    while IFS= read -r -d '' line; do
        path=$(realpath --relative-to=. "$line")
        dirname $path
    done > exclude_list.txt

path_to_list=$(realpath --relative-to=. exclude_list.txt)
echo $path_to_list >> exclude_list.txt

cat exclude_list.txt

cd $base_dir

# run this command from the drive root, eg /prod.
rsync -a $ARGS1 --prune-empty-dirs --inplace --relative --exclude-from="$current_dir/exclude_list.txt" --include='*' --include='*/' $source $target $ARGS2 -v
cd $current_dir
cas
  • 78,579
  • 1
    +1. you should use double quotes around your variables except where you actually want the shell's white space argument expansion to occur. and, since you're using bash, you should probably use an array for the rsync args rather than $ARGS1, $ARGS2. – cas Aug 07 '19 at 03:31
  • Thanks cas, can you explain what you mean by 'except where you actually want the shell's white space argument expansion to occur' ? – openCivilisation Aug 12 '19 at 04:23
  • 1
    if you have a variable, say $ARGS, that contains word1 word2 word3 then wrapping it in double-quotes when you use it (e.g. newvar="$VAR") will prevent the shell from expanding it into 3 separate arguments. 99.999999% of the time, this is what you want. If you don't wrap it in double-quotes, then the shell will expand it into three arguments - this is useful in cases where you want to use that variable to hold one or more arguments to another command (e.g. rsync $ARGS ...). And in most of those cases, you're better off using an array (e.g. rsync "${ARGS[@]}" ...) – cas Aug 13 '19 at 08:43
  • 1
    more on "most of the time, this is what you want" - unquoted variables tend to result in unfortunate (possibly disastrous) consequences if variables contain unexpected spaces or other shell meta-characters. a (contrived) example: say DIR='foo bar', and you have a directory containing sub-directories foo, bar, and foo bar. rm -rf $DIR will delete directories foo and bar. rm -rf "$DIR" will delete directory 'foo bar'. spaces are important. so are other meta-chars like & or ;. proper quoting is extremely important. – cas Aug 13 '19 at 08:49
  • 1
    There are lots of good questions and answers on this site about quoting variables (search for "quote variable") but this explains it extremely well: https://unix.stackexchange.com/questions/131766/why-does-my-shell-script-choke-on-whitespace-or-other-special-characters/131767#131767 – cas Aug 13 '19 at 08:56