2

Due to moving files back and forth from my Linux PC to a Mac, I now have a few documents that duplicate. Their names seem identical, but apparently they are codified slightly differently, like in this question.

For instance, ls in a certain directory reports, among other things

'Voisin - Géométrie algébrique et espaces de modules.pdf'
'Voisin - Géométrie algébrique et espaces de modules.pdf'

These really look the same, but using the command ls | LC_ALL=C sed -n l as suggested in the above question, I get

Voisin - Ge\314\201ome\314\201trie alge\314\201brique et espaces de m\
odules.pdf$
Voisin - G\303\251om\303\251trie alg\303\251brique et espaces de modu\
les.pdf$

Now, I have a directory tree full of such "duplicates". Is there a way to

  • find them all
  • for each duplicate pair, move one of them to an external directory? (I don't want to delete them right now, just in case I mess something up)

I think that the content is also identical, so the diff should be nothing, but I am not sure, since I don't know a way to to be sure that I am running diff on the two copies, as the paths look identical

  • Have you tried any of the approaches described in Find duplicate files? – terdon May 18 '22 at 16:34
  • 1
    I believe MacOS has (or had?) a different way than other *nixes to encode diacritic characters and that might have led to this situation. See for example: https://gist.github.com/JamesChevalier/8448512 ( -> https://manpages.debian.org/convmv/convmv.1 ). – A.B May 18 '22 at 17:02
  • You could try normalising the names and moving them to a different directory if they don't match the original filename. – l0b0 May 19 '22 at 02:41
  • Thank you all, I was able to do the task with fdupes, which has an option --delete, which lest you review the duplicate pairs and keep a single copy for each pair. I was able to distinguish the file names because the terminal renders the accents slightly differently. I made a backup before deleting just in case – Andrea Ferretti May 19 '22 at 08:59

0 Answers0