Delete all but the most recent n file for each group of files that share the same prefix in a directory

Question

My question is a bit different from some older questions simply asking for "deleting all but the most recent n files in a directory".

I have a directory that contains different 'groups' of files where each group of files share some arbitrary prefix and each group has at least one file. I do not know these prefixes in advance and I do not know how many groups there are.

EDIT: actually, I know something about the file names, that is they all follow the pattern prefix-some_digits-some_digits.tar.bz2. The only thing matters here is the prefix part, and we can assume that within each prefix there is no digit or dash.

I want to do the following in a bash script:

Go through the given directory, identify all existing 'groups', and for each group of files, delete all but the most recent n files of the group only.
If there are less than n files for a group, do nothing for that group, i.e. do not delete any file for that group.

What is a robust and safe way of doing the above in bash? Could you please explain the commands step-by-step?

Can we assume that the prefix doesn't end with digits or a dash? For example, could there be foo-1-1.tar.bz2 foo-1-1-1.tar.bz2 foo-1-2-3.tar.bz2 foo-1-2.tar.bz2 which mixes the prefixes foo and foo-1? If we can assume that there are no such cases, it's easier because each group is consecutive in lexical order. — Gilles 'SO- stop being evil', Nov 03 '15 at 22:08
@Gilles, yes, we can assume that no prefix ends with digits or a dash. — skyork, Nov 03 '15 at 22:17
please show a sample directory listing (e.g. ls -1 | head -30) or upload a full directory listing to a pastebin site (e.g. pastebin.com` and post the link. — cas, Nov 03 '15 at 22:30

ferdy · Answer 1 · 2015-11-03T22:59:38.127

The script:

#!/bin/bash

# Get Prefixes

PREFIXES=$(ls | grep -Po '^(.*)(?!HT\d{4})-(.*)-(.*).tar.bz2$' | awk -F'-' '{print $1}' | uniq)

if [ -z "$1" ]; then
  echo need a number of keep files.
  exit 1
else
  NUMKEEP=$1
fi

for PREFIX in ${PREFIXES}; do

  ALL_FILES=$(ls -t ${PREFIX}*)

  if [ $(echo ${ALL_FILES} | wc -w) -lt $NUMKEEP ]; then
    echo Not enough files to be kept. Quit.
    continue
  fi

  KEEP=$(ls -t ${PREFIX}* | head -n${NUMKEEP})

  for file in $ALL_FILES ; do
    if [[ "$KEEP" =~ "$file" ]]; then
      echo keeping $file
    else
      echo RM $file
    fi
  done
done

Explanation:

Calculate the prefixes:
- Look for all files following the something-something-something.tar.bz2 regex, cutting of only the first part up to the first dash and make it unique.
- the result is a normalized list of the PREFIXES
Iterate through all PREFIXES:
Calculate ALL_FILES with PREFIX
Check if the amount of ALL_FILES is less than the number of files to be kept -> if true, we can stop here, nothing to remove
Calculate the KEEP files which are the most recent NUMKEEP files
Iterate through ALL_FILES and check if the given file is not in the KEEP file list. If so: remove it.

Example result when running it:

$ ./remove-old.sh 2
keeping bar-01-01.tar.bz2
keeping bar-01-02.tar.bz2
RM bar-01-03.tar.bz2
RM bar-01-04.tar.bz2
RM bar-01-05.tar.bz2
RM bar-01-06.tar.bz2
keeping foo-01-06.tar.bz2
keeping foo-01-05.tar.bz2
RM foo-01-04.tar.bz2
RM foo-01-03.tar.bz2
RM foo-01-02.tar.bz2

$ ./remove-old.sh 8
Not enough files to be kept. Quit.
Not enough files to be kept. Quit.

Well, my answer is not correct since you want to delete nothing if len(KEEP) < 3 for this example. I'll look it up. — ferdy, Nov 03 '15 at 21:01
@ferdy, thanks to David's comments, I have clarified on the prefix part of my question, please see updated question above. — skyork, Nov 03 '15 at 21:09
@skyork You need a complete script for this, or are my cmdline snippets enough for you? — ferdy, Nov 03 '15 at 21:13
@ferdy, the way your solution works is for me to specify the prefix for each group. However, I do not know (at least not without manually noting down all existing prefixes) these prefixes in advance, and I'd like to run a script (or command) for the given directory and the script will figure out the rest. — skyork, Nov 03 '15 at 22:21

RobertL · Answer 2 · 2015-11-06T22:50:20.177

As requested, this answer tends towards "robust and safe" as you requested, as opposed to quick & dirty.

Portability: This answer works on any system which contains sh, find, sed, sort, ls, grep, xargs, and rm.

The script should never choke on a large directory. No shell filename expansion is performed (which could choke if too many files, but that's a huge number).

This answer assumes that the prefix will not contain any dash (-).

Note that, by design, the script only lists the files that will be removed. You can cause it to remove the files by piping the output of the while loop to xargs -d '/n' rm which is commented out in the script. This way you can easily test the script before enabling the remove code.

#!/bin/sh -e

NUM_TO_KEEP=$(( 0 + ${1:-64000} )) || exit 1

find . -maxdepth 1 -regex '[^-][^-]*-[0-9][0-9]*-[0-9][0-9]*.tar.bz2' |
sed 's/-.*//; s,^\./,,' |
sort -u |
while read prefix
do
    ls -t | grep  "^$prefix-.*-.*\.tar\.bz2$" | sed "1,$NUM_TO_KEEP d"
done # | xargs -d '\n' rm --

The N parameter (number of files to keep) defaults to 64000 (ie all the files are kept).

Annotated Code

Get the command line argument and check for integer by addition, if not given the paramter defaults to 64000 (effectively all):

NUM_TO_KEEP=$(( 0 + ${1:-64000} )) || exit 1

Find all files in the current directory which match the filename format:

find . -maxdepth 1 -regex '[^-][^-]*-[0-9][0-9]*-[0-9][0-9]*.tar.bz2' |

Get prefix: remove everything after the prefix and remove the "./" at beginning:

sed 's/-.*//; s,^\./,,' |

Sort the prefixes and remove duplicates (-u -- unique):

sort -u |

Read each prefix and process:

while read prefix
do

List all the files in directory sorted by time, select the files for the current prefix, and delete all lines beyond the files we want to keep:

    ls -t | grep  "^$prefix-.*-.*\.tar\.bz2$" | sed "1,$NUM_TO_KEEP d"

For testing comment out the code to remove the file. Using xargs to avoid any problems with command line length or spaces in filenames if any. If you want the script to produce a log, add -v to rm eg: rm -v --. Remove the # to enable the remove code:

done # | xargs -d '\n' rm --

If this works for you, please accept this answer and vote up. Thanks.

@don_crissti Thanks! I've updated the answer appropriately! – RobertL Nov 06 '15 at 22:51 — RobertL, Nov 06 '15 at 22:51

score 2 · Answer 3 · edited Apr 13 '17 at 12:36

I'll assume that the files are grouped together by prefix when listed in lexical order. This means that there aren't groups with a prefix that's a suffix of another group, e.g. no foo-1-2-3.tar.bz2 that would get in between foo-1-1.tar.bz2 and foo-1-2.tar.bz2. Under this assumption, we can list all the files, and when we detect a change of prefix (or for the very first file), we have a new group.

#!/bin/bash
n=$1; shift   # number of files to keep in each group
shopt extglob
previous_prefix=-
for x in *-+([0-9])-+([0-9]).tar.bz2; do
  # Step 1: skip the file if its prefix has already been processed
  this_prefix=${x%-+([0-9])-+([0-9]).tar.bz2}
  if [[ "$this_prefix" == "$previous_prefix" ]]; then
    continue
  fi
  previous_prefix=$this_prefix
  # Step 2: process all the files with the current prefix
  keep_latest "$n" "$this_prefix"-+([0-9])-+([0-9]).tar.bz2
done

Now we're down to the problem of determining the oldest files among an explicit list.

Assuming that the file names don't contain newlines or characters that ls doesn't display literally, this can be implemented with ls:

keep_latest () (
  n=$1; shift
  if [ "$#" -le "$n" ]; then return; fi
  unset IFS; set -f
  set -- $(ls -t)
  shift "$n"
  rm -- "$@"
)

score 1 · Answer 4 · answered Nov 04 '15 at 00:55

I know this is tagged bash but I think this would be easier with zsh:

#!/usr/bin/env zsh

N=$(($1 + 1))                         # calculate Nth to last
typeset -U prefixes                   # declare array with unique elements
prefixes=(*.tar.bz2(:s,-,/,:h))       # save prefixes in the array
for p in $prefixes                    # for each prefix
do
arr=(${p}*.tar.bz2)                   # save filenames starting with prefix in arr
if [[ ${#arr} -gt $1 ]]               # if number of elements is greather than $1
then
print -rl -- ${p}*.tar.bz2(Om[1,-$N]) # print all filenames but the most recent N 
fi
done

the script accepts one argument: n (the number of files)
(:s,-,/,:h) are glob modifiers, :s replaces the first - with / and :h extracts the head (the part up to the last slash which in this case is also the first slash as there's only one)
(Om[1,-$N]) are glob qualifiers, Om sorts the files starting with the oldest one and [1,-$N] selects from the first up to the Nth to last one
If you're happy with the result replace print -rl with rm to actually delete the files e.g.:

#!/usr/bin/env zsh

typeset -U prefixes
prefixes=(*.tar.bz2(:s,-,/,:h))
for p in $prefixes
arr=(${p}*.tar.bz2) && [[ ${#arr} -gt $1 ]] && rm -- ${p}*.tar.bz2(Om[1,-$(($1+1))])

Delete all but the most recent n file for each group of files that share the same prefix in a directory

4 Answers4

Annotated Code

Linked

Related