2

I have a folder of images that contain quite a bit of duplicates, I'd like to remove all duplicates except for one.

Upon Googling I found this clever script from this post that succinctly does almost what I want it to do:

#!/bin/sh -eu
find "${1:-.}" -type f ! -empty -print0 | xargs -0 md5 -r | \
    awk '$1 in a{sub("^.{33}","");printf "%s\0",$0}a[$1]+=1{}' | \
    xargs -0 rm -v --

Unfortunately I am still fairly green when it comes to UNIX shell scripting so I'm not sure what the actual commands/flags for each piece are doing here so I am unable to modify it for my specific needs.

From my understanding:

find "${1:-.}" -type f ! -empty -print0 - searches the current directory for non-empty files and prints the file names. (not sure what the piece "${1:-.}" means though)

| xargs -0 md5 -r - Pipes the results above (via the xargs -0 command?) into the md5 command to get the md5 hash signature of each file (-r reverses the output to make it a single line?)

awk '$1 in a{sub("^.{33}","");printf "%s\0",$0}a[$1]+=1{}' - This is where I get lost..

  • $1 in a{sub("^.{33}","") - takes the input up until the first whitespace character and replaces the first 33 characters from the start of the string with nothing (sub("^.{33}","")
  • printf "%s\0" - format prints the entire string
  • a{...,$0} - I'm not sure what this does
  • a[$1]+=1{} - Not sure either

xargs -0 rm -v -- - Pipes the results to the rm command, printing each file name via -v, but I'm not sure what the syntax -- is for.

When I run this, it outputs like this ./test3.jpg./test2.jpg./test.jpg: No such file or directory so there must be a formatting issue.

My question is:

  1. Can this be modified to remove all files except 1?
  2. Can someone help explain the gaps in what the commands/syntax means as I've outlined above?

I'm sure this is probably easy for someone who knows UNIX well but unfortunately that person is not me. Thank you in advance!

For context: I'm running this in ZSH in macOS BigSur 11.

  • This question is very broad. I'd suggest you have a quick look at a few man pages for man xargs, man md5, and man find (for -print0). ${1:-.} is explained here (bash and zsh use it the same way), -- here. Then try to make the question a bit more narrow. – FelixJN Dec 12 '21 at 22:00

1 Answers1

1

I'll focus on the awk-part here:

md5 -r returns the 32-character md5-sum and then the filename. The md5-sum thus is the first field in awk.

$1 in a{...}

means "if $1 (here: the md5-sum) is found as index in array a then do commands {...}". So a will be used as an array with the md5 sums as indices that have already been seen. Note that if this value is non-existent or 0, the command is not executed - thus the first time an md5-sum was seen, the file name is not returned. If it is any other value (including strings), the condition is true, and the command executed.

sub("^.{33}","");printf "%s\0",$0

will remove 33 characters from the beginning, i.e. the md5-sum and the following space, then print the rest (original filename) with NUL-delimiter at the end. NUL-delimiting is important for files with e.g. spaces. See -print0 in man find or -0 in man xargs. Note that this command is only run, if the md5-sum is already in array a, so the first match is not returned (i.e. only duplicates are shown and later on removed).

a[$1]+=1{}

"Add 1 to element $1 of array a", $1 is the md5-sum. So this value is set in a once an md5-sum has been seen. It is the duplicate counter. '{}' is the empty command. This is necessary as awk by default returns the full record if a condition is met and no command is given.


Warning

As far as I can see, the script works fine for files with spaces, but I think it will fail for files with newlines in their names as awk has not set NUL as record separator and will default to newlines then. Use BEGIN {RS="\x0"} at first in awk to set it.

FelixJN
  • 13,566
  • Thank you for the break down @FelixJN, I found it very helpful. I didn't know thats how arrays worked in shell, so I should be able to expand off that logic to fit my specific needs. Thank you for the links in your comment by the way, I'll be reading up on those. – cdouble.bhuck Dec 14 '21 at 05:25
  • @cdouble.bhuck - This is awk and not shell, so this is how awk-arrays work. Please be careful not to mix those up. – FelixJN Dec 14 '21 at 08:16