In a directory, how can we find all the files whose base names are the same, but their extension names are not? E.g. 0001.jpg
and 0001.png
and 0001.tiff
, and 0002.jpg
and 0002.png
.
6 Answers
If you want all the unique filenames, here you go:
ls -1 | sed 's/\([^.]*\).*/\1/' | uniq
If you want the files such that more than one of those has the same basename, then use:
ls -1 | sed 's/\([^.]*\).*/\1/' | uniq -c | sort -n | egrep -v "^ *\<1\>"
For filenames with multiple periods, use the following:
ls -1 | sed 's/\(.*\)\..*/\1/' | uniq -c | sort -n | egrep -v "^ *\<1\>"

- 6,008
-
Thanks @jimmij. I have just added a third solution to fix that. – unxnut Nov 14 '14 at 01:40
A solution using perl (I avoid parsing ls
output, it's not designed for this task and can cause bugs):
perl -E '
while (<*>){
($full, $short) = (m/^((.*?)\..*)$/);
next unless $short;
push @{ $h->{$short} }, $full;
}
for $key (keys %$h) {
say join " ", @{ $h->{$key} } if @{ $h->{$key} } > 1;
}
' /home/sputnick
replace /home/sputnick
by .
or any directory you'd like ;)

- 33,867
-
-
-
@sputnick good plan, parsing
printf *
suffers from at least one of the problems that parsingls
does. – mr.spuratic Nov 14 '14 at 12:14 -
@mr.spuratic: Well,
printf '%s\0' *
is fine, and there's a very small chance that even'%s\n'
would break (I mean, who uses newlines in file names?) But it's still unnecessary. – u1686_grawity Nov 14 '14 at 13:32
Since the only answers here either use sed
or perl
and regular expressions, I thought I'd be different and post something debatably simpler.
for file in /path/to/your/files/*; do echo ${file%%.*}; done | uniq -d
In this example, ${file%%.*}
matches the file path up to the first period (.
). So, 0001.tar.gz
would be treated as 0001
.
The output would look like this
/path/to/your/files/0001
/path/to/your/files/0002
If you don't want the full path in the output, simply cd
into the directory first and then run the command with just a asterisk (*
) for the path.
cd /path/to/your/files
for file in *; do echo ${file%%.*}; done | uniq -d
Then the output would look like this
0001
0002

- 957
If you have a GNU environment, here's a robust solution which prints out the common base names, using gawk
(just to mix it up):
find . -maxdepth 1 -type f -printf "%f\0" |
gawk 'BEGIN{RS="\0"} {sub(/\.[^.]+$/,""); if (length($0))printf("%s\0",$0)}' |
sort -z | uniq -zd |
tr '\000' '\n'
This uses find
with \0 (nul) delimited filenames, gawk
with RS
(record separator) set to \0 to match the input, and a sub(/regex/)
to strip an extension.
The final tr
command undoes the nul delimiting for printing to the screen, omit this for further (safe) processing of filenames.
(Normally I would do something like whatever | rev | cut -d. -f2- | rev | sort
, but rev
doesn't do nul-delimited input.)
If you want to limit it to only files with a .ext
or more specific pattern you can use:
find . -maxdepth 1 -type f -name "*.*" -printf "%f\0" | ...
The first option above only prints the common base, if you want to print out the actual filenames:
find . -maxdepth 1 -type f -name "*.*" -printf "%f\0" |
gawk 'BEGIN{ RS="\0" }
{ base=$0;sub(/\.[^.]+$/,"",base);seen[base][FNR]=$0}
END { for (bb in seen)
if (length(seen[bb])>1)
for (ff in seen[bb]) printf("%s\0",seen[bb][ff])
}' |
tr '\000' '\n'
(gawk v4.0 minimum required for multi-dimensional arrays!)
This uses an array (hash) seen[]
to cache seen file names keyed by the base name, then at the end it iterates over the the base names in seen[bb]
and prints out those with more than match (length(seen[bb])>1
).

- 9,901
If you aren't afraid to parse ls
:
/bin/ls --color=no -1 | sed 's/\.[^.]*$//' | uniq -d
That will fail if the file names contain new lines.

- 957

- 47,140
ls -1 | awk -F'.' '{print $1}'|uniq -cd
awk prints the first field($1
) of each files with .
field separator.
uniq -d
gives only the duplicates lines, and with -c
option print the number of occurrences.
$ ls -1
0001.jpg
0001.tar.gz
0001.tiff
0002.png
0002.tar.bz2
001.zip
$ ls -1 | awk -F'.' '{print $1}'|uniq -cd 3 0001 2 0002

- 41,407