6

In a directory, how can we find all the files whose base names are the same, but their extension names are not? E.g. 0001.jpg and 0001.png and 0001.tiff, and 0002.jpg and 0002.png.

Tim
  • 101,790

6 Answers6

7

If you want all the unique filenames, here you go:

ls -1 | sed 's/\([^.]*\).*/\1/' | uniq

If you want the files such that more than one of those has the same basename, then use:

ls -1 | sed 's/\([^.]*\).*/\1/' | uniq -c | sort -n | egrep -v "^ *\<1\>"

For filenames with multiple periods, use the following:

ls -1 | sed 's/\(.*\)\..*/\1/' | uniq -c | sort -n | egrep -v "^ *\<1\>"
unxnut
  • 6,008
4

A solution using (I avoid parsing ls output, it's not designed for this task and can cause bugs):

perl -E '
    while (<*>){
        ($full, $short) = (m/^((.*?)\..*)$/);
        next unless $short;
        push @{ $h->{$short} }, $full;
    }
    for $key (keys %$h) {
        say join " ", @{ $h->{$key} } if @{ $h->{$key} } > 1;
    }
' /home/sputnick

replace /home/sputnick by . or any directory you'd like ;)

3

Since the only answers here either use sed or perl and regular expressions, I thought I'd be different and post something debatably simpler.

for file in /path/to/your/files/*; do echo ${file%%.*}; done | uniq -d

In this example, ${file%%.*} matches the file path up to the first period (.). So, 0001.tar.gz would be treated as 0001.

The output would look like this

/path/to/your/files/0001
/path/to/your/files/0002

If you don't want the full path in the output, simply cd into the directory first and then run the command with just a asterisk (*) for the path.

cd /path/to/your/files
for file in *; do echo ${file%%.*}; done | uniq -d

Then the output would look like this

0001
0002
2

If you have a GNU environment, here's a robust solution which prints out the common base names, using gawk (just to mix it up):

find . -maxdepth 1 -type f -printf "%f\0" | 
  gawk 'BEGIN{RS="\0"} {sub(/\.[^.]+$/,""); if (length($0))printf("%s\0",$0)}' | 
  sort -z | uniq -zd | 
  tr '\000' '\n'

This uses find with \0 (nul) delimited filenames, gawk with RS (record separator) set to \0 to match the input, and a sub(/regex/) to strip an extension.

The final tr command undoes the nul delimiting for printing to the screen, omit this for further (safe) processing of filenames.

(Normally I would do something like whatever | rev | cut -d. -f2- | rev | sort, but rev doesn't do nul-delimited input.)

If you want to limit it to only files with a .ext or more specific pattern you can use:

find . -maxdepth 1 -type f -name "*.*" -printf "%f\0" | ...

The first option above only prints the common base, if you want to print out the actual filenames:

find . -maxdepth 1 -type f -name "*.*" -printf "%f\0" |        
  gawk 'BEGIN{ RS="\0" } 
             { base=$0;sub(/\.[^.]+$/,"",base);seen[base][FNR]=$0} 
        END  { for (bb in seen) 
                 if (length(seen[bb])>1) 
                    for (ff in seen[bb]) printf("%s\0",seen[bb][ff])
              }' |    
  tr '\000' '\n'

(gawk v4.0 minimum required for multi-dimensional arrays!)

This uses an array (hash) seen[] to cache seen file names keyed by the base name, then at the end it iterates over the the base names in seen[bb] and prints out those with more than match (length(seen[bb])>1).

mr.spuratic
  • 9,901
1

If you aren't afraid to parse ls:

/bin/ls --color=no -1 | sed 's/\.[^.]*$//' | uniq -d

That will fail if the file names contain new lines.

jimmij
  • 47,140
1
ls -1 | awk -F'.' '{print $1}'|uniq -cd

awk prints the first field($1) of each files with . field separator.

uniq -d gives only the duplicates lines, and with -c option print the number of occurrences.

$ ls -1
 0001.jpg
 0001.tar.gz 
 0001.tiff
 0002.png
 0002.tar.bz2
 001.zip
$ ls -1 | awk -F'.' '{print $1}'|uniq -cd
 3 0001
 2 0002
αғsнιη
  • 41,407