Solution in TXR Lisp
I made several duplicate files in the subdirectory linenoise
:
$ txr findup.tl linenoise/
---
969d22f22e167313 1c11d2 (linenoise/history.txt.link linenoise/history.txt)
---
c3211c8f2a6ac412 1c1e0d (linenoise/example.c)
c3211c8f2a6ac412 1cd21f (linenoise/example.c.dup)
---
e4cd0181a0e73fda 1cd20a (linenoise/LICENSE.lnk linenoise/LICENSE.dup)
e4cd0181a0e73fda 1c11d4 (linenoise/LICENSE)
When I run the program, it reports duplicate groups, each preceded by ---
. Within each group, it prints a portion of the SHA256, followed by the inode number, followed by the path(s).
The program identifies files that are hard links to each other (same inode number) as well as duplicates by hash, or combinations of the two.
Various cases are shown above.
The files history.txt.link
and history.txt
are links to each other. No other duplicate exists, so only one line is shown for them.
The files example.c
and example.c.dup
are identical, but distinct objects.
Then we have a mixed situation: LICENSE.lnk
and LICENSE.dup
are links to the same object, which is a duplicate of LICENSE
.
Code:
(let ((dir "."))
(match-case *args*
((@where) (set dir where))
(())
(@else (put-line "bad arguments" *stderr*)
(exit nil)))
(flow (build (ftw dir (lambda (path type stat . rest)
(if (eql type ftw-f)
(add stat)))))
(group-by .size)
hash-values
(keep-if cdr)
(each ((group @1))
(flow group
(group-by .ino)
hash-values
(collect-each ((group @1))
(let ((hash (with-stream (s (open-file (car group).path))
(sha256-stream s))))
(cons hash group)))
(sort-group @1 car)
(each ((subgr @1))
(when-match @(require ((@hash @stat . @more-stats) . @other-members)
(or other-members more-stats))
subgr
(put-line "---")
(each-match ((@nil . @stats) subgr)
(format t "~x ~08x ~a\n"
[hash 0..8] (car stats).ino
(mapcar .path stats)))))))))
The ftw
function is a wrapper for the POSIX nftw
function. It gives our callback lambda
information about each visited object, including stat
structure (Lisp version of struct stat
). The stat
structure has slots like ino
(inode number), size
, and path
(full relative path). We can do everything we need with these stat
objects.
We first group the objects by size, and keep only groups that have two or more members. Files with unique sizes do not have duplicates, as the question notes.
The approach is first to find groups of paths that have the same size.
We iterate over these groups; we group each group into subgroups by inode number. We then hash the leader of each group (which is a Lisp list of stat
structures), and tack the hash onto the group as a head item.
Lastly, we do a sort-group
operation on these groups, by hash. What that means is that the groups are sorted by hash, and runs of duplicate hashes are grouped together.
Then we just have to walk over these equal hash groups and dump them, where we are careful only to report groups that either have two or more members (duplicate objects) or else have multiple paths for one inode: (or other-members more-stats)
.
An improvement could be made to the code. In situations when all the files that have a given size are links to the same object (same inode), we do not need to calculate a hash for them; we know they are the same, and no copy exists in that subtree. We could substitute some fake value for the hash, like 0
, and just go through the sort-group
.
Also, the program neglects to do a full comparison to eliminate false positives. It reports files that have the same SHA256, not identical files.
Here is a possible way to elide the hash for pure hard link duplicates:
(collect-each ((group @1))
(let ((hash (if (cdr @1)
(with-stream (s (open-file (car group).path))
(sha256-stream s))
(load-time (make-buf 32)))))
(cons hash group)))
The output then being:
---
0000000000000000 1c11d2 (linenoise/history.txt.link linenoise/history.txt)
---
c3211c8f2a6ac412 1c1e0d (linenoise/example.c)
c3211c8f2a6ac412 1cd21f (linenoise/example.c.dup)
---
e4cd0181a0e73fda 1cd20a (linenoise/LICENSE.lnk linenoise/LICENSE.dup)
e4cd0181a0e73fda 1c11d4 (linenoise/LICENSE)
I substituted a full all-zero hash for that case, which is clearly visible. (load-time (make-buf 32))
makes a 32 byte all-zero buffer, same length as a SHA256. load-time
ensures that in compiled code, the calculation is carried out once when the code is loaded, not each time it is executed. The cdr
function is a Lisp idiom for "does this list have more than one item"; it retrieves the rest of the list, excluding the first item. If that is empty, it returns nil
, which is also Boolean false.
ls
. There seems to be broad agreement that that is a bad idea. – bxm Jul 01 '23 at 21:50hashes[hash]
tohashes[hash]=size
, then you will have access to the sizes too in thefor(hash in hashes) ...
with printing thehashes[hash]
. you can then use GNU awk array sorting functions to sort your hashes[] array on values (seePROCINFO["sorted_in"]
and its accepted values in theman awk
). And&& $5 > 35
part of the code skips the files with 35bytes size or smaller. And you are not correctly handling the filesName with spaces too, if you had a fileName with more than 6 whitespaces, it won't work. – αғsнιη Jul 02 '23 at 05:21fdupes
or even better,rdfind
? Both use similar approaches: only resorting to hashing for identically sized files – Chris Davies Jul 02 '23 at 07:36