1

Over in an answer to a different question, I wanted to use a structure much like this to find files that appear in list2 that do not appear in list1:

( cd dir1 && find . -type f -print0 ) | sort -z > list1
( cd dir2 && find . -type f -print0 ) | sort -z > list2
comm -13 list1 list2

However, I hit a brick wall because my version of comm cannot handle NULL-terminated records. (Some background: I'm passing a computed list to rm, so I particularly want to be able to handle file names that could contain an embedded newline.)

If you want an easy worked example, try this

mkdir dir1 dir2
touch dir1/{a,b,c} dir2/{a,c,d}
( cd dir1 && find . -type f ) | sort > list1
( cd dir2 && find . -type f ) | sort > list2
comm -13 list1 list2

Without NULL-terminated lines the output here is the single element ./d that appears only in list2.

I'd like to be able to use find ... -print0 | sort -z to generate the lists.

How can I best reimplement an equivalent to comm that outputs the NULL-terminated records that appear in list2 but that do not appear in list1?

Chris Davies
  • 116,213
  • 16
  • 160
  • 287
  • Do you really have filenames with NEW-lines inside? – schily May 30 '18 at 16:06
  • 1
    @schily it's not difficult to create them, and seeing as I'm passing a computed list to rm I'd like to be sure and code defensively. I'll clarify in the question. – Chris Davies May 30 '18 at 16:07

1 Answers1

1

GNU comm (as of GNU coreutils 8.25) now has a -z/--zero-terminated option for that.

For older versions of GNU comm, you should be able to swap NUL and NL:

comm -13 <(cd dir1 && find . -type f -print0 | tr '\n\0' '\0\n' | sort) \
         <(cd dir2 && find . -type f -print0 | tr '\n\0' '\0\n' | sort) |
  tr '\n\0' '\0\n'

That way comm still works with newline-delimited records, but with actual newlines in the input encoded as NULs, so we're still safe with filenames containing newlines.

You may also want to set the locale to C because on GNU systems and most UTF-8 locales at least, there are different strings that sort the same and would cause problems here¹.

That's a very common trick (see Invert matching lines, NUL-separated for another example with comm), but needs utilities that support NUL in their input, which outside of GNU systems is relatively rare.


¹ Example:

$ touch dir1/{①,②} dir2/{②,③}
$ comm -12 <(cd dir1 && find . -type f -print0 | tr '\n\0' '\0\n' | sort) \
           <(cd dir2 && find . -type f -print0 | tr '\n\0' '\0\n' | sort)  
./③
./②
$ (export LC_ALL=C
    comm -12 <(cd dir1 && find . -type f -print0 | tr '\n\0' '\0\n' | sort) \
             <(cd dir2 && find . -type f -print0 | tr '\n\0' '\0\n' | sort))
./②

(2019 edit: The relative order of ①②③ has been fixed in newer versions of the GNU libc, but you can use instead for instance in newer versions (2.30 at least) that still have the problem like 95% of Unicode code points)

  • What problems would cause strings that sort the same ? Examples? –  May 30 '18 at 17:04
  • 1
    @issac, see edit. There's also a related discussion around 2015-03-30 on the austin group mailing list (you can use the gmane NNTP interface) you may be interested in reading on the subject. – Stéphane Chazelas May 30 '18 at 18:31