8

I was trying to find the intersection of two plain data files, and found from a previous post that it can be done through

comm -12 <(sort test1.list) < (sort test2.list)

It seems to me that sort test1.list aims to sort test1.list in order. In order to understand how sort works, I tried sort against the following file, test1.list as sort test1.list > test2.list

100
-200
300
2
92
15
340

However, it turns out that test2.list is

100
15
2
-200
300
340
92

This re-ordered list make me quite confused about how this sort works, and how does sort and comm work together.

Kevin
  • 40,767

3 Answers3

16

Per the comm manual, "Before `comm' can be used, the input files must be sorted using the collating sequence specified by the `LC_COLLATE' locale."

And the sort manual: "Unless otherwise specified, all comparisons use the character collating sequence specified by the `LC_COLLATE' locale.

Therefore, and a quick test confirms, the LC_COLLATE order comm expects is provided by the sort's default order, dictionary sort.

sort can sort files in a variety of manners:

  • -d: Dictionary order - ignores anything but whitespace and alphanumerics.
  • -g: General numeric - alpha, then negative numbers, then positive.
  • -h: Human-readable - negative, alpha, positive. n < nk = nK < nM < nG
  • -n: Numeric - negative, alpha, positive. k,M,G, etc. are not special.
  • -V: Version - positive, caps, lower, negative. 1 < 1.2 < 1.10
  • -f: Case-insensitive.
  • -R: Random - shuffle the input.
  • -r: Reverse - usually used with one of dghnV

There are other options, of course, but these are the ones you're likely to see or need.

Your test shows that the default sort order is probably -d, dictionary order.

  d   |   g   |   h   |   n   |   V 
------+-------+-------+-------+-------
  1   |  a    | -1G   | -10   |  1
 -1   |  A    | -1k   | -5    |  1G
  10  |  z    | -10   | -1    |  1g
 -10  |  Z    | -5    | -1g   |  1k
  1.10| -10   | -1    | -1G   |  1.2
  1.2 | -5    | -1g   | -1k   |  1.10
  1g  | -1    |  a    |  a    |  5
  1G  | -1g   |  A    |  A    |  10
 -1g  | -1G   |  z    |  z    |  A
 -1G  | -1k   |  Z    |  Z    |  Z
  1k  |  1    |  1    |  1    |  a
 -1k  |  1g   |  1g   |  1g   |  z
  5   |  1G   |  1.10 |  1G   | -1
 -5   |  1k   |  1.2  |  1k   | -1G
  a   |  1.10 |  5    |  1.10 | -1g
  A   |  1.2  |  10   |  1.2  | -1k
  z   |  5    |  1k   |  5    | -5
  Z   |  10   |  1G   |  10   | -10
Kevin
  • 40,767
1

Just to flesh out Kevin's comprehensive and illustrative answer: if you run comm with the case-insensitive flag, comm -i, you must also sort case-insensitively, e.g., sort -f.

Full example:

comm -i <(sort -i test1.list) <(sort -i test2.list)

Otherwise, the native sort (no flags) works.

chbrown
  • 907
  • 6
  • 9
0

You can do the sorting using sort -n filename for sorting and use the comm command.

Kevin
  • 40,767
  • I tried comm -12 <(sort test1.list) <(sort test2.list) and comm -12 <(sort -n test1.list) <(sort test2.list) looks like the results are the same. Does that mean that the -n option and -d option of sort do not affect the result of comm. – user288609 Jan 22 '12 at 12:56