6

I have trouble understanding unix sort. Consider the following file (tab separated)

aa  ~ a1
aa  B
b   A
b   ~ e
bb  B
bb  ~ B

When calling:

cat tmp2 | sort -t $'\t' -k1,2

I get

aa  ~ a1
aa  B
b   A
bb  B
bb  ~ B
b   ~ e

As far as I understand, -t $'\t' says to consider the separator to be a tab instead of a white space and -k1,2 says to sort by the first column and, if two rows have the same fist column, then by the second one. But in that case, shouldn't my last 'b' appear in the fourth row?

giulio
  • 163

1 Answers1

13

No, -k1,2 says to sort on the portion of the line that starts at the beginning of the first field and ends at the end of the second field.

To sort on the first field and then on the second, it's:

sort -k1,1 -k2,2
  • (edit I could not put each character to a different line in a comment) So why then 'a~'

    'a'

    'a2'

    'aa'

    'aa3' 'aa~' will be sorted as 'cat tmp2 | sort -k1,1' 'a' 'a~' 'a2' 'aa' 'aa~' 'aa3' ? Since the '~' is after any other character in the ascii table, 'aa~' should be sorted after 'aa3' for example?

    – giulio Feb 07 '15 at 21:00
  • @giulio. The ASCII table is only relevant for sorting in the C/POSIX locale. String comparison in other locales is a lot more complicated and tries and work the same as in natural languages. For instance spaces are ignored in a first instance in most locales which is why "b c" sorts after "bb". – Stéphane Chazelas Feb 07 '15 at 21:05
  • Are you sure there is a difference between sort -k1,2 and sort -k1,1 -k2,2? At least from your description, I don't understand the difference between the two commands. diff <(sort -t $'\t' -k1,2 <<<"$content") <(sort -t $'\t' -k1,2 -k2,2 <<<"$content") and diff <(LC_ALL=C sort -t $'\t' -k1,2 <<<"$content") <(LC_ALL=C sort -t $'\t' -k1,2 -k2,2 <<<"$content") produced no output. – Six Sep 16 '15 at 02:39
  • @Six, in the C locale, comparison is done byte-by-byte on the byte value, that is strcoll() (the comparison function used by sort) works the same as strcmp() there (and there only). So for sort -k1,1 -k2,2 and sort -k1,2 to produce different results, you need an input that contains characters that sort before the separator. In the case of TAB, that's going to be byte values 0 to 7 which are unlikely to be found in text, but you can compare on the output of printf 'a\1\tb\na\tc\n'. – Stéphane Chazelas Sep 16 '15 at 08:53
  • @Six, see that other answer for more detailed explanations. – Stéphane Chazelas Sep 16 '15 at 08:57