case-insensitive search of duplicate file-names

Question

I there a way to find all files in a directory with duplicate filenames, regardless of the casing (upper-case and/or lower-case)?

score 16 · Answer 1 · edited Apr 13 '17 at 12:36

16

If you have GNU utilities (or at least a set that can deal with zero-terminated lines) available, another answer has a great method:

find . -maxdepth 1 -print0 | sort -z | uniq -diz

Note: the output will have zero-terminated strings; the tool you use to further process it should be able to handle that.

In the absence of tools that deal with zero-terminated lines, or if you want to make sure your code works in environments where such tools are not available, you need a small script:

#!/bin/sh
for f in *; do
  find . -maxdepth 1 -iname ./"$f" -exec echo \; | wc -l | while read count; do
    [ $count -gt 1 ] && echo $f
  done
done

What is this madness? See this answer for an explanation of the techniques that make this safe for crazy filenames.

edited Apr 13 '17 at 12:36

Community

1

answered Oct 18 '11 at 19:26

Shawn J. Goff

46,081

1

I was just going to post a similar... But worse answer :) – rozcietrzewiacz Oct 18 '11 at 19:28
2

Do you really need the -mindepth's? – rozcietrzewiacz Oct 18 '11 at 19:30
I'm using Solaris. Is /usr/bin/find the one you are talking about? I tried using it and gave me many errors. – lamcro Oct 18 '11 at 20:27
@lamcro No, Solaris doesn't use GNU's find; I've edited the answer to include a non-GNU solution. – Shawn J. Goff Oct 18 '11 at 21:38
Ok. Do I just paste it in a text file and give it execution rights? – lamcro Oct 18 '11 at 22:02
@lamcro that should do it. – Shawn J. Goff Oct 18 '11 at 22:34
Typo echo {} \; – Prince John Wesley Oct 19 '11 at 01:25
@Shawn J: +1.yes make sense. – Prince John Wesley Oct 19 '11 at 01:31
@prince-john-wesley No, I only wanted the empty newline from echo. In fact, echo {} breaks it, since filenames can contain newlines, and that is what I'm counting in the next step.
-- I just changed it to -printf "\n" to be more straightforward.
– Shawn J. Goff Oct 19 '11 at 01:38
Definitely the best method? find . -maxdepth 1 | sort | uniq -di is simpler and quicker. – Jamie Kitson Oct 26 '12 at 12:01
@JamieKitson That will fail if you have filenames with newline characters. – Shawn J. Goff Oct 26 '12 at 15:19
Erm, edge case? :) – Jamie Kitson Oct 26 '12 at 16:46
@ShawnJ.Goff Well, if you've got GNU tools, find . -maxdepth 1 -type f -print0 | sort -iz | uniq -diz I added in some other options that are missing. Also, that'll leave you with NUL-terminated strings; if you want it readable, pipe to tr '\0' '\n' at the end (but, of course, then those LF files are a problem again) – derobert Oct 26 '12 at 17:37
This is an O(n²) solution to an O(n) or O(n log n) problem. As in my other comment, GNU sort and uniq can handle NUL-terminated items. That's O(n log n). (And yes, that should be -f to sort, not -i. Sorrry) – derobert Oct 26 '12 at 17:40
@derobert - that's definitely better - I've somehow blocked out that feature from my brain because it's just not available in the environment I usually use. Edited answer to include it. – Shawn J. Goff Oct 26 '12 at 19:43
Hey, you stole our answer! :) – Jamie Kitson Oct 27 '12 at 00:44
@JamieKitson Say, what? Look again :-) I didn't know you had posted an answer other than the comment. – Shawn J. Goff Oct 27 '12 at 01:19
worked great too as find |sort -f |uniq -di – Aquarius Power Mar 14 '15 at 23:16

score 15 · Answer 2 · edited Oct 26 '12 at 17:45

15

There are many complicated answers above, this seems simpler and quicker than all of them:

find . -maxdepth 1 | sort -f | uniq -di

If you want to find duplicate file names in subdirectories then you need to compare just the file name, not the whole path:

find . -maxdepth 2 -printf "%f\n" | sort -f | uniq -di

Edit: Shawn J. Goff has pointed out that this will fail if you have filenames with newline characters. If you're using GNU utilities, you can make these work too:

find . -maxdepth 1 -print0 | sort -fz | uniq -diz

The -print0 (for find) and -z option (for sort and uniq) cause them to work on NUL-terminated strings, instead of newline terminated strings. Since file names can not contain NUL, this works for all file names.

edited Oct 26 '12 at 17:45

derobert

109,670

answered Oct 26 '12 at 12:08

Jamie Kitson

713
1
6
14

1

But see my comment on Shawn J. Goff's answer, you can add the -print0 option to find, and the -z option to uniq and sort. Also, you want -f on sort as well. Then it works. (I'm going to edit this into your answer, feel free to revert if you don't approve) – derobert Oct 26 '12 at 17:41
1

The last command is giving me output without carriage returns (result is all in one line). I'm using Red Hat Linux to run the command. The first command line works best for me. – Sun Aug 26 '15 at 16:42

score 3 · Answer 3 · answered Oct 19 '11 at 09:40

Sort the list of file names in a case-insensitive way and print duplicates. sort has an option for case-insensitive sorting. So does GNU uniq, but not other implementations, and all you can do with uniq is print every element in a set of duplicates except the first that's encountered. With GNU tools, assuming that no file name contains a newline, there's an easy way to print all the elements but one in each set of duplicates:

for x in *; do printf "%s\n" "$x"; done |
sort -f |
uniq -id

Portably, to print all elements in each set of duplicates, assuming that no file name contains a newline:

for x in *; do printf "%s\n" "$x"; done |
sort -f |
awk '
    tolower($0) == tolower(prev) {
        print prev;
        while (tolower($0) == tolower(prev)) {print; getline}
    }
    1 { prev = $0 }'

If you need to accommodate file names containing newlines, go for Perl or Python. Note that you may need to tweak the output, or better do your further processing in the same language, as the sample code below uses newlines to separate names in its own output.

perl -e '
    foreach (glob("*")) {push @{$f{lc($_)}}, $_}
    foreach (keys %f) {@names = @{$f{$_}}; if (@names > 1) {print "$_\n" foreach @names}}
'

Here's a pure zsh solution. It's a bit verbose, as there's no built-in way to keep the duplicate elements in an array or glob result.

a=(*)(N); a=("${(@io)a}")
[[ $#a -le 1 ]] ||
for i in {2..$#a}; do
  if [[ ${(L)a[$i]} == ${(L)a[$((i-1))]} ]]; then
    [[ ${(L)a[$i-2]} == ${(L)a[$((i-1))]} ]] || print -r $a[$((i-1))]
    print -r $a[$i]
  fi
done

score 2 · Answer 4 · answered Oct 19 '11 at 19:17

2

I finally managed it this way:

find . | tr '[:upper:]' '[:lower:]' | sort | uniq -d

I used find instead of ls cause I needed the full path (a lot of subdirectories) included. I did not find how to do this with ls.

answered Oct 19 '11 at 19:17

lamcro

893

2

Both sort and uniq have ignore-case flags, f and i respectively. – Jamie Kitson Oct 26 '12 at 12:03

score 1 · Answer 5 · answered Oct 19 '11 at 14:17

1

Without GNU find:

LANG=en_US ls | tr '[A-Z]' '[a-z]' | uniq -c | awk '$1 >= 2 {print $2}'

answered Oct 19 '11 at 14:17

Rudolf Adamkovic

11

2

tr is very likely to wreak havoc on any character set which uses more than a single byte per character. Only the first 256 characters of UTF-8 are safe when using tr. From Wikipedia tr (Unix)..
Most versions of tr, including GNU tr and classic Unix tr,
operate on SINGLE BYTES and are not Unicode compliant.. – Peter.O Oct 19 '11 at 15:24
1

Update to my previous comment.. only the first 128 characters of UTF-8 are safe. All UTF-8 characters above the ordinal range 0..127 are all multi-byte and can have individual byte values in other characters. Only the bytes in the range 0..127 have a one-to-one association to a unique character. – Peter.O Aug 28 '12 at 00:08
Plus uniq has a case-insensitive flag i. – Jamie Kitson Oct 26 '12 at 12:06

score 0 · Answer 6 · answered Dec 26 '20 at 21:54

The Question:

I there a way to find all files in a directory with duplicate filenames, regardless of the casing (upper-case and/or lower-case)?

An Answer:

I found this works for me on Ubuntu 20.04. I tested this in my home directory with a contrived duplication of filenames; i.e.:

$ touch filename.txt
$ touch FiLeNaMe.TxT

And then:

$ find . -maxdepth 1 -type f | sort -f | uniq -Di 
./filename.txt
./FiLeNaMe.TxT

find . : search begins in pwd - ~/ in this case
-maxdepth 1 : find defaults to full recursion; this limits that to pwd only
-type -f : "regular" files only - no directories, links, etc
sort -f : sort required as uniq requires adjacency; -f ignores case
uniq -Di : -D prints all dupes; -i ignores case

score 0 · Answer 7 · answered Dec 17 '21 at 16:16

If I'm understanding the question correctly lamcro wanted to be certain of finding every one of a suspected small number of such duplicates. I have found that using FileZilla to do a file transfer of an entire directory to a Windows-based file system causes each and every instance to be caught with a dialog which asks what action to take. Each dialog contains enough information to be certain where the duplicates are, though admittedly it is clumsy compared with a true search, which would generate a list.

score -1 · Answer 8 · answered Feb 28 '19 at 01:26

-1

For anyone else who wants to then rename etc one of the files:

find . -maxdepth 1 | sort -f | uniq -di | while read f; do echo mv "$f" "${f/.txt/_.txt}"; done

answered Feb 28 '19 at 01:26

JohnFlux

111

case-insensitive search of duplicate file-names

8 Answers8

The Question:

An Answer:

Linked