If your file names don't contain newline characters, you can do:
(export LC_ALL=C; comm -23 <(ls -A dir1) <(ls -A dir2))
to find out which of the files in dir1
are not found in dir2
.
For arbitrary file names, you can use the array subtraction feature of zsh
:
dir1_files=(dir1/*(DN:t)) dir2_files=(dir2/*(DN:t))
dir1_and_not_dir2_files=(${dir1_files:|dir2_files})
(change *
to **/*
for recursive list of files)
Or with bash4.4+ and recent versions of GNU utilities:
readarray -td '' dir1_and_not_dir2_files < <(
export LC_ALL=C
shopt -s nullglob dotglob
comm -z23 <(printf '%s\0' dir1/* | cut -zd/ -f2-) \
<(printf '%s\0' dir2/* | cut -zd/ -f2-)
)
(use the globstar
option and replace *
with **
for a recursive list).
The LC_ALL=C
is needed for at least two reasons:
filenames can contain any sequence of bytes (except 0 or the value for /
(0x2F on ASCII-based systems)), while comm
is a text utility, so with a behaviour unspecified for those sequences of bytes that don't format valid characters. In the C locale, all characters are single-byte and all bytes are valid characters (though possibly undefined), sp any file name is valid text (also considering that the maximum file name length is generally significantly smaller than the maximum text line length).
more importantly, comm
expects a sorted input, but in some locales, some characters have undefined sort order or sort the same as others which would confuse comm
. For instance, on a GNU system in a en_GB.UTF-8 locale:
$ ls dir1 dir2
dir1:
dir2:
$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
[...]
$ comm -23 <(ls -A dir1) <(ls -A dir2)
$ (export LC_ALL=C; comm -23 <(ls -A dir1) <(ls -A dir2))
Those are mathematical letters whose order is not defined in GNU locales, so in the en_GB.UTF-8
locales, as far as comm
is concerned (or the sorting done by ls
, see how I got but I could just as well have gotten ), those letters are the same, so dir1
and dir2
appear to contain the same files.
While in the C locale, comm
and ls
see one character for each of the byte of the encoding of those UTF-8 characters and the sorting is based on byte value, so all file names are different (except for
, actually seen as four undefined characters with byte values 0xf0 0x9d 0x90 0x82, seen in both directories).
Your approach, with basename $i
fixed to basename -- "$i"
and echo $f
fixed to printf '%s\n' "$f"
would work except for file names that end in newline characters, but there are subtle differences in semantic:
Globs expand to directory entries. The shell doesn't need search access to directories to expand globs in them.
"$1"/*
will expand to all the directory entries in "$1"
that don't start with .
(the hidden ones).
While the [ -e "dir2/$f" ]
doesn't care at all about directory entries (it will succeeded even if you don't have read access to dir2
as long as you have search access), it attempts a stat()
system call on the file. You will see a difference if you don't have search permissions on dir2
or for instance for symlink files whose target doesn't exist or is not reachable. For instance:
ln -s /var/spool/cron/crontab/root dir2/root-crontab
[ -e dir2/root-crontab ]
Would return false if run as a mere user (as you don't have read access to /var/spool/cron/crontab
), but may return true if run as root if root
has a crontab.
See also:
[ -e "dir2/$f" ] || [ -L "$dir2/$f" ]
to test whether dir2/$f
exist or is a (broken/currently unresolvable by the current user at that path) symlink.