I have two directories with hundreds of thousands of files in each. Most of the files are duplicated between the two locations, but some aren't, and I'd like to learn which ones aren't duplicated (meaning for example they aren't backed up, and then I could make a choice to back them up, or delete them). The key thing is that the path to each file may be completely different relative to the parent directories. Some files may be the same but have different names, the tool ought to compare checksums to eliminate those from the output.
Asked
Active
Viewed 258 times
2
2 Answers
0
There's a wonderful small program called fdupes that can help with this - just be careful as you can set it to delete all the duplicates too, as well as other fun things.

djsmiley2kStaysInside
- 1,406
-
-
@LucasW No, you have to list all the files and subtract the duplicates as a further processing step. Also, this will consider a file as a duplicate if it appears twice on the same side. – Gilles 'SO- stop being evil' Jan 06 '17 at 23:53
0
I haven't tested this much, but this is the fdupes solution so far:
#!/usr/bin/env python
# Load a list of files in dir1, dir2,
# and a list of dupes provided by fdupes,
# and produce a list of files that aren't duplicated.
#
# find dir1 -type f > dir1.txt
# find dir2 -type f > dir2.txt
# fdupes dir1 dir2 -1 > dupes.txt
import sys
# print sys.argv
dir1_file = sys.argv[1]
dir2_file = sys.argv[2]
dupes_file = sys.argv[3]
dir1 = {}
with open(dir1_file) as f:
for line in f:
dir1[line.strip()] = True
dir2 = {}
with open(dir2_file) as f:
for line in f:
dir2[line.strip()] = True
dupes = {}
with open(dupes_file) as f:
for line in f:
(dir1_dupe, dir2_dupe) = line.split()
rv1 = dir1.pop(dir1_dupe, None)
rv2 = dir2.pop(dir2_dupe, None)
# print "non dupes:"
for key in dir1.keys():
print key
for key in dir2.keys():
print key

Lucas Walter
- 223
hardlink -nv location1 location2
will give you the list of files which are same, if that helps out a bit. (While also listing the ones which are the same only in location1) – Jaleks Jan 06 '17 at 19:40