Find files whose basenames are the same, but their ext names are not

Question

In a directory, how can we find all the files whose base names are the same, but their extension names are not? E.g. 0001.jpg and 0001.png and 0001.tiff, and 0002.jpg and 0002.png.

unxnut · Answer 1 · 2014-11-14T01:39:38.247

7

If you want all the unique filenames, here you go:

ls -1 | sed 's/\([^.]*\).*/\1/' | uniq

If you want the files such that more than one of those has the same basename, then use:

ls -1 | sed 's/\([^.]*\).*/\1/' | uniq -c | sort -n | egrep -v "^ *\<1\>"

For filenames with multiple periods, use the following:

ls -1 | sed 's/\(.*\)\..*/\1/' | uniq -c | sort -n | egrep -v "^ *\<1\>"

edited Nov 14 '14 at 01:39

answered Nov 14 '14 at 00:51

unxnut

6,008

Thanks @jimmij. I have just added a third solution to fix that. – unxnut Nov 14 '14 at 01:40

Gilles Quénot · Answer 2 · 2014-11-14T12:29:12.650

4

A solution using perl (I avoid parsing ls output, it's not designed for this task and can cause bugs):

perl -E '
    while (<*>){
        ($full, $short) = (m/^((.*?)\..*)$/);
        next unless $short;
        push @{ $h->{$short} }, $full;
    }
    for $key (keys %$h) {
        say join " ", @{ $h->{$key} } if @{ $h->{$key} } > 1;
    }
' /home/sputnick

replace /home/sputnick by . or any directory you'd like ;)

edited Nov 14 '14 at 12:29

answered Nov 14 '14 at 00:35

Gilles Quénot

33,867

Might as well use for (glob("*")) { ... } in that case, no? – u1686_grawity Nov 14 '14 at 09:13
Yes grawity, I rewrote the code with glob – Gilles Quénot Nov 14 '14 at 12:00
@sputnick good plan, parsing printf * suffers from at least one of the problems that parsing ls does. – mr.spuratic Nov 14 '14 at 12:14
@mr.spuratic: Well, printf '%s\0' * is fine, and there's a very small chance that even '%s\n' would break (I mean, who uses newlines in file names?) But it's still unnecessary. – u1686_grawity Nov 14 '14 at 13:32

Drew Chapin · Answer 3 · 2014-11-14T13:19:31.803

Since the only answers here either use sed or perl and regular expressions, I thought I'd be different and post something debatably simpler.

for file in /path/to/your/files/*; do echo ${file%%.*}; done | uniq -d

In this example, ${file%%.*} matches the file path up to the first period (.). So, 0001.tar.gz would be treated as 0001.

The output would look like this

/path/to/your/files/0001
/path/to/your/files/0002

If you don't want the full path in the output, simply cd into the directory first and then run the command with just a asterisk (*) for the path.

cd /path/to/your/files
for file in *; do echo ${file%%.*}; done | uniq -d

Then the output would look like this

0001
0002

score 2 · Answer 4 · answered Nov 14 '14 at 12:11

If you have a GNU environment, here's a robust solution which prints out the common base names, using gawk (just to mix it up):

find . -maxdepth 1 -type f -printf "%f\0" | 
  gawk 'BEGIN{RS="\0"} {sub(/\.[^.]+$/,""); if (length($0))printf("%s\0",$0)}' | 
  sort -z | uniq -zd | 
  tr '\000' '\n'

This uses find with \0 (nul) delimited filenames, gawk with RS (record separator) set to \0 to match the input, and a sub(/regex/) to strip an extension.

The final tr command undoes the nul delimiting for printing to the screen, omit this for further (safe) processing of filenames.

(Normally I would do something like whatever | rev | cut -d. -f2- | rev | sort, but rev doesn't do nul-delimited input.)

If you want to limit it to only files with a .ext or more specific pattern you can use:

find . -maxdepth 1 -type f -name "*.*" -printf "%f\0" | ...

The first option above only prints the common base, if you want to print out the actual filenames:

find . -maxdepth 1 -type f -name "*.*" -printf "%f\0" |        
  gawk 'BEGIN{ RS="\0" } 
             { base=$0;sub(/\.[^.]+$/,"",base);seen[base][FNR]=$0} 
        END  { for (bb in seen) 
                 if (length(seen[bb])>1) 
                    for (ff in seen[bb]) printf("%s\0",seen[bb][ff])
              }' |    
  tr '\000' '\n'

(gawk v4.0 minimum required for multi-dimensional arrays!)

This uses an array (hash) seen[] to cache seen file names keyed by the base name, then at the end it iterates over the the base names in seen[bb] and prints out those with more than match (length(seen[bb])>1).

score 1 · Answer 5 · edited Nov 14 '14 at 13:13

1

If you aren't afraid to parse ls:

/bin/ls --color=no -1 | sed 's/\.[^.]*$//' | uniq -d

That will fail if the file names contain new lines.

edited Nov 14 '14 at 13:13

Drew Chapin

957

answered Nov 14 '14 at 01:23

jimmij

47,140

score 1 · Answer 6 · answered Nov 14 '14 at 20:14

ls -1 | awk -F'.' '{print $1}'|uniq -cd

awk prints the first field($1) of each files with . field separator.

uniq -d gives only the duplicates lines, and with -c option print the number of occurrences.

$ ls -1
 0001.jpg
 0001.tar.gz 
 0001.tiff
 0002.png
 0002.tar.bz2
 001.zip

$ ls -1 | awk -F'.' '{print $1}'|uniq -cd
 3 0001
 2 0002

Find files whose basenames are the same, but their ext names are not

6 Answers6

Linked