The task is to transform this text file, output of the utility fslint
, into a bash script with rm
command lines for the duplicate file(s) to delete and commented lines for file(s) to keep, according to a set of rules.
The rules basically say: delete duplicate files only in specific directories.
The goal is to cleanup about 1 TB of duplicates accumulated over the years on several OS (Mac OS X, Windows, Linux). All data has been copied to a Linux drive.
#3 x 697,612,024 (1,395,236,864) bytes wasted
/path/to/backup-100425/file_a.dat
/another/path/to/backup-disk-name/171023/file_a.dat
/yet/another/path/to/labs data/some/path/file_a.dat
#4 x 97,874,344 (293,634,048) bytes wasted
/path/to/backup-100425/file b.mov
/another/path/to/backup-140102/file b.mov
/backup-120708/Library/some/path/file b.mov
/some/other/path/to/backup-current/file b.mov
#2 x 198,315,112 (198,316,032) bytes wasted
/path/to/backup-100425/file_c.out
/another/path/to/backup-disk-name/171023/file_c.out
The first line says there are 3 identical copies of file_a.dat
and the following 3 lines list the paths to them. Ideally 2 copies should be deleted here. Directories with 6-digit numbers (dates in YYMMDD format) are what I call historical backup directories.
The rules, to be applied in this order to each group of identical files, are:
- If a file is in a path including a directory
Library
, keep it. - If a file is in
labs data
orbackup-current
, keep it, and delete all duplicates in historical backup directories. - If a file is in a historical backup directory, keep the file in the newest backup directory, and delete older duplicates.
- Otherwise keep the file(s).
Here is the desired output:
#!/bin/bash
#3 x 697,612,024 (1,395,236,864) bytes wasted
rm '/path/to/backup-100425/file_a.dat'
rm '/another/path/to/backup-disk-name/171023/file_a.dat'
#/yet/another/path/to/labs data/some/path/file_a.dat
#4 x 97,874,344 (293,634,048) bytes wasted
rm '/path/to/backup-100425/file b.mov'
rm '/another/path/to/backup-140102/file b.mov'
#/backup-120708/Library/some/path/file b.mov
#/some/other/path/to/backup-current/file b.mov
#2 x 198,315,112 (198,316,032) bytes wasted
rm '/path/to/backup-100425/file_c.out'
#/another/path/to/backup-disk-name/171023/file_c.out
I'm not very familiar with the shell tools awk, grep and sed, and after reading this thread I realized my first draft was conceptually wrong, "a naive translation of what [I] would do in an imperative language like C".
And in fact, we're not dealing here with files, but with one file's content.
Is it appropriate to use a shell script for this scenario?
If yes, how an efficient script would look like?
Edited: I tried to clarify the task and requirements, after reading the answer and code from @Ed, which perfectly solves the question.
awk
script, but I needed a confirmation I was in the right direction. – macxpat Dec 03 '19 at 02:41