2

I would like to write a bash script to find those files in a directory but not in another directory.

Does the following script work? When does it not?

for i in "$1"/*; do
    f=$(basename $i);
    if [ ! -e "$2"/"$f" ]
    then
        echo $f
    fi
done

I heard diff can find differences between the contents of two directories too. Can it also solve the same task?

Or some other tool?

Thanks.

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
Tim
  • 101,790
  • More about diff for symmetric task: https://serverfault.com/questions/177001/find-files-in-one-directory-not-in-another – Tim Nov 05 '18 at 21:46
  • Also: https://unix.stackexchange.com/questions/9042/compare-files-that-are-in-directory-1-but-not-directory-2 – muru Nov 05 '18 at 23:05

2 Answers2

9

Yes, you can use diff for that purpose. It's as simple as:

diff -rq dir1 dir2

The -r option tells diff to recurse into sub-directories as well. The -q option tells diff to only report when files differ.

These are the two options I typically use when I want to find out which files are in dir1, but not in dir2, and vice versa. (You can also drop the -r parameter if you do not want to recurse into sub-directories, but only consider the immediate contents of the two directories.)

Note that this will show you both files that exist in dir1 but not in dir2, as well as files that exist in dir2, but not in dir1, e.g.:

$ diff -rq /tmp/dir1/ /tmp/dir2/
Only in /tmp/dir1/: file1
Only in /tmp/dir2/: file2
Only in /tmp/dir2/: file3

If you need only one of the directions (e.g., files that are in dir1 but not in dir2) and get a list of the filenames only (without the "Only in ..." clutter), you could certainly attempt to massage diff's output using grep, sed, awk and the like, but in that case you're better off not using diff in the first place and use Stéphane Chazelas's solution instead.

  • Thanks. It is more convenient in a single direction. But it is nice to know what diff can do at the best. – Tim Nov 05 '18 at 17:53
  • I assumed so, I only wanted to add an answer to the specific question: "Can diff be used for this task?". :-) Do note, however, that Stéphane's solution will not work recursively. So if you want to compare directories recursively, you'd need a more involved solution. – Malte Skoruppa Nov 05 '18 at 17:55
  • Note that -q is not a standard option. – Stéphane Chazelas Nov 05 '18 at 20:56
  • Note that some diff implementations will follow symlinks to directories when recursing (note that the recursion was not asked for). – Stéphane Chazelas Nov 05 '18 at 20:57
  • Note that it will also look for (and report) differences in the content of the files, so it does a lot more than comparing file names. If you run it after truncate -s1T dir{1,2}/file, it will take a while to complete. – Stéphane Chazelas Nov 05 '18 at 20:59
  • @StéphaneChazelas can diff be used on two directories? I am only interested in the contents of two directories, not the contents of the files or directories (no recursion) under the directories. – Tim Nov 05 '18 at 21:39
  • @Tim, not portably, nor reliably, I wouldn't go there, it's not been designed for that. comm was designed for that. – Stéphane Chazelas Nov 05 '18 at 21:44
5

If your file names don't contain newline characters, you can do:

(export LC_ALL=C; comm -23 <(ls -A dir1) <(ls -A dir2))

to find out which of the files in dir1 are not found in dir2.

For arbitrary file names, you can use the array subtraction feature of zsh:

dir1_files=(dir1/*(DN:t)) dir2_files=(dir2/*(DN:t))
dir1_and_not_dir2_files=(${dir1_files:|dir2_files})

(change * to **/* for recursive list of files)

Or with bash4.4+ and recent versions of GNU utilities:

readarray -td '' dir1_and_not_dir2_files < <(
  export LC_ALL=C
  shopt -s nullglob  dotglob
  comm -z23 <(printf '%s\0' dir1/* | cut -zd/ -f2-) \
            <(printf '%s\0' dir2/* | cut -zd/ -f2-)
)

(use the globstar option and replace * with ** for a recursive list).

The LC_ALL=C is needed for at least two reasons:

  • filenames can contain any sequence of bytes (except 0 or the value for / (0x2F on ASCII-based systems)), while comm is a text utility, so with a behaviour unspecified for those sequences of bytes that don't format valid characters. In the C locale, all characters are single-byte and all bytes are valid characters (though possibly undefined), sp any file name is valid text (also considering that the maximum file name length is generally significantly smaller than the maximum text line length).

  • more importantly, comm expects a sorted input, but in some locales, some characters have undefined sort order or sort the same as others which would confuse comm. For instance, on a GNU system in a en_GB.UTF-8 locale:

      $ ls dir1 dir2
      dir1:
    

    dir2:

    $ locale
    LANG=en_GB.UTF-8 LC_CTYPE="en_GB.UTF-8" LC_COLLATE="en_GB.UTF-8" [...] $ comm -23 <(ls -A dir1) <(ls -A dir2) $ (export LC_ALL=C; comm -23 <(ls -A dir1) <(ls -A dir2))

Those are mathematical letters whose order is not defined in GNU locales, so in the en_GB.UTF-8 locales, as far as comm is concerned (or the sorting done by ls, see how I got but I could just as well have gotten ), those letters are the same, so dir1 and dir2 appear to contain the same files.

While in the C locale, comm and ls see one character for each of the byte of the encoding of those UTF-8 characters and the sorting is based on byte value, so all file names are different (except for , actually seen as four undefined characters with byte values 0xf0 0x9d 0x90 0x82, seen in both directories).


Your approach, with basename $i fixed to basename -- "$i" and echo $f fixed to printf '%s\n' "$f" would work except for file names that end in newline characters, but there are subtle differences in semantic:

Globs expand to directory entries. The shell doesn't need search access to directories to expand globs in them.

"$1"/* will expand to all the directory entries in "$1" that don't start with . (the hidden ones).

While the [ -e "dir2/$f" ] doesn't care at all about directory entries (it will succeeded even if you don't have read access to dir2 as long as you have search access), it attempts a stat() system call on the file. You will see a difference if you don't have search permissions on dir2 or for instance for symlink files whose target doesn't exist or is not reachable. For instance:

 ln -s /var/spool/cron/crontab/root dir2/root-crontab
 [ -e dir2/root-crontab ]

Would return false if run as a mere user (as you don't have read access to /var/spool/cron/crontab), but may return true if run as root if root has a crontab.

See also:

[ -e "dir2/$f" ] || [ -L "$dir2/$f" ]

to test whether dir2/$f exist or is a (broken/currently unresolvable by the current user at that path) symlink.