Editing the output of fslint with shell tools awk | grep | sed

Question

The task is to transform this text file, output of the utility fslint, into a bash script with rm command lines for the duplicate file(s) to delete and commented lines for file(s) to keep, according to a set of rules.

The rules basically say: delete duplicate files only in specific directories.

The goal is to cleanup about 1 TB of duplicates accumulated over the years on several OS (Mac OS X, Windows, Linux). All data has been copied to a Linux drive.

#3 x 697,612,024        (1,395,236,864) bytes wasted
/path/to/backup-100425/file_a.dat
/another/path/to/backup-disk-name/171023/file_a.dat
/yet/another/path/to/labs data/some/path/file_a.dat
#4 x 97,874,344 (293,634,048)   bytes wasted
/path/to/backup-100425/file b.mov
/another/path/to/backup-140102/file b.mov
/backup-120708/Library/some/path/file b.mov
/some/other/path/to/backup-current/file b.mov
#2 x 198,315,112        (198,316,032)   bytes wasted
/path/to/backup-100425/file_c.out
/another/path/to/backup-disk-name/171023/file_c.out

The first line says there are 3 identical copies of file_a.dat and the following 3 lines list the paths to them. Ideally 2 copies should be deleted here. Directories with 6-digit numbers (dates in YYMMDD format) are what I call historical backup directories.

The rules, to be applied in this order to each group of identical files, are:

If a file is in a path including a directory Library, keep it.
If a file is in labs data or backup-current, keep it, and delete all duplicates in historical backup directories.
If a file is in a historical backup directory, keep the file in the newest backup directory, and delete older duplicates.
Otherwise keep the file(s).

Here is the desired output:

#!/bin/bash
#3 x 697,612,024        (1,395,236,864) bytes wasted
rm '/path/to/backup-100425/file_a.dat'
rm '/another/path/to/backup-disk-name/171023/file_a.dat'
#/yet/another/path/to/labs data/some/path/file_a.dat
#4 x 97,874,344 (293,634,048)   bytes wasted
rm '/path/to/backup-100425/file b.mov'
rm '/another/path/to/backup-140102/file b.mov'
#/backup-120708/Library/some/path/file b.mov
#/some/other/path/to/backup-current/file b.mov
#2 x 198,315,112        (198,316,032)   bytes wasted
rm '/path/to/backup-100425/file_c.out'
#/another/path/to/backup-disk-name/171023/file_c.out

I'm not very familiar with the shell tools awk, grep and sed, and after reading this thread I realized my first draft was conceptually wrong, "a naive translation of what [I] would do in an imperative language like C".

And in fact, we're not dealing here with files, but with one file's content.

Is it appropriate to use a shell script for this scenario?
If yes, how an efficient script would look like?

Edited: I tried to clarify the task and requirements, after reading the answer and code from @Ed, which perfectly solves the question.

No, you don't use shell to manipulate text. It'd be a single awk script. — Ed Morton, Nov 30 '19 at 20:58
Thank you very much, this is exactly the answer I was looking for! After reading more here and there, I was heading to a self-contained awk script, but I needed a confirmation I was in the right direction. — macxpat, Dec 03 '19 at 02:41

score 2 · Accepted Answer · answered Nov 30 '19 at 21:11

I don't understand your list of requirements given how much time I'm willing to put into trying to but here's a script to just categorize and print the types of file you seem to be interested in and hopefully you can figure out the rest:

$ cat tst.awk
/^#/ { prt(); print; next }
{ files[$0] }
END { prt() }

function prt(   file, isLibrary, isLabsBack, isNothing) {
    for (file in files) {
        if ( file ~ /(^|\/)Library(\/|$)/ ) {
            isLibrary[file]
        }
        else if ( file ~ /(^|\/)(labs data|backup-current)(\/|$)/ ) {
            isLabsBack[file]
        }
        else {
            isNothing[file]
        }
    }
    for (file in isLibrary) {
        print "Library", file
    }
    for (file in isLabsBack) {
        print "LabsBack", file
    }
    for (file in isNothing) {
        print "Nothing", file
    }
    delete files
}

.

$ awk -f tst.awk file
#3 x 697,612,024        (1,395,236,864) bytes wasted
LabsBack /yet/another/path/to/labs data/some/path/file_a.dat
Nothing /another/path/to/backup-disk-name/171023/file_a.dat
Nothing /path/to/backup-100425/file_a.dat
#4 x 97,874,344 (293,634,048)   bytes wasted
Library /backup-120708/Library/some/path/file b.mov
LabsBack /some/other/path/to/backup-current/file b.mov
Nothing /path/to/backup-100425/file b.mov
Nothing /another/path/to/backup-140102/file b.mov
#2 x 198,315,112        (198,316,032)   bytes wasted
Nothing /path/to/backup-100425/file_c.out
Nothing /another/path/to/backup-disk-name/171023/file_c.out

Wow! Thank you Ed. This is really a smart code! This perfectly answers my question. Thank you for your time and effort, and sorry for the confusing rules. I have modified the description of I what I wanted to obtain and the rules to filter the files to be deleted. Hope it makes more sense. I am adapting your (unexpected) great code and will post the result later. — macxpat, Dec 03 '19 at 02:58
Yeah, you're right @Ed. After starting to adapt your code I realized my requirements made no sense. I edited the question again. — macxpat, Dec 05 '19 at 16:30

macxpat · Answer 2 · 2019-12-28T16:13:48.840

Here's the code that gives the desired output mentioned in the question, for people who'd be interested. It's just a tiny adaptation of @Ed's really smart code.

BEGIN { print "#!/bin/bash" }
/^#/ { prt(); print; next }
{ files[$0] }
END { prt() }

function prt(   file, isDate, isKeep, isDelete, backup, latest, pats) {
    # file exists in a current backup directory (yes|no)
    backup = "no"
    # latest historical backup date
    latest = "000000"
    for (file in files) {
        if ( file ~ /\/Library\// ) {
            # files to check manually
            isKeep[file]
        }
        else if ( file ~ /\/(labs data|backup-current)\// ) {
            # backup files to keep
            isKeep[file]
            backup = "yes"
        }
        else if ( match(file, /\/(backup-disk-name\/|backup-)([0-2][0-9][0-1][0-9][0-3][0-9])\//, pats) != 0 ) {
            # files in historical backup directories
            if ( pats[2] > latest ) {
                latest = pats[2]
            }
            isDate[file] = pats[2]
        }
        else {
            # unclassified files to check manually
            isKeep[file]
        }
    }
    for (file in isDate) {
        if ( isDate[file] == latest && backup == "no") {
            isKeep[file]
        }
        else {
            isDelete[file]
        }
    }
    for (file in isKeep) {
        print "#", file
    }
    for (file in isDelete) {
        # use single quotes to escape special characters in file
        # use gensub() to escape single quotes in file
        print "rm", "'" gensub(/'/,"'\\\\''", "g", file) "'"
    }
    delete files
}

Finally, I would like to share some thoughts. I hope I'm not disgressing too much.
A few weeks ago I resolved to finally cleanup that monstruous backup data (some files have more than 10 duplicates). But I couldn't find a tool to automate the task. And I didn't want to fire up a C program for that and didn't want to go the Perl way. So I knew I had to (and I wanted to) go the shell way. But I didn't know where to start and got stuck on the first lines.

After reading a lot, I was still very confused. So I decided to post my question on SE.
When I first read @Ed's code I thought "What the hell!". Then, when I got it, I realized it's a brilliant piece of code, highly efficient and clear.

So here we are. About one week ago, I didn't know anything about awk and very few about RegExp. Now, thanks to the @Ed's contribution, I've been able to write "my" first awk script, better understand the RegExp world, and complete the task at hand. More importantly, I'm now confident enough dive deeper by myself into RegExp, awk and other text processing shell tools. It also motivates me to contribute more to SE.
I just wanted to share my personal experience, and give hope to others who may like me be stuck on a problem, like facing a mountain.

Editing the output of fslint with shell tools awk | grep | sed

2 Answers2