20

I've got two files _jeter3.txt and _jeter1.txt

I've checked they are both sorted on the 20th column using sort -c

sort -t '     ' -c -k20,20 _jeter3.txt
sort -t '     ' -c -k20,20 _jeter1.txt
#no errors

but there is an error when I want to join both files it says that the second file is not sorted:

join -t '   ' -1 20 -2 20 _jeter1.txt _jeter3.txt > /dev/null
join: File 2 is not in sorted order

I don't understand why.

cat /etc/*-release #FYI
openSUSE 11.0 (i586)
VERSION = 11.0

UPDATE: using 'sort -f' and join -i (both case insensitive) fixes the problem. But it doesn't explain my initial problem.

UPDATE: versions of sort & join:

> join --version
join (GNU coreutils) 6.11
Copyright (C) 2008 Free Software Foundation, Inc.
(...)

> sort --version
sort (GNU coreutils) 6.11
Copyright (C) 2008 Free Software Foundation, Inc.
(...)
Pierre
  • 1,793
  • Can you give us the output of "join --version" and "sort --version" just for completeness sake? I can't get some older versions of gnu join to give me that error message under any circumstance. –  May 10 '11 at 18:44
  • 3
    Please post some sample data that exhibits the problem, and the output of locale. – Gilles 'SO- stop being evil' May 10 '11 at 22:37

7 Answers7

33

I got the same error with Ubuntu 11.04, with sort and join both in version (GNU coreutils) 8.5.

They are clearly incompatible. In fact the sort command seems bugged: there is no difference with or without the -f (--ignore-case) option. When sorting, aaB is always before aBa. Non alphanumeric characters seems also always ignored (abc is before ab-x)

Join seems to expect the opposite... But I have a solution

In fact, this is linked to the collation sequence: using LANG=en_EN sort -k 1,1 <myfile> ... then LANG=en_EN join ... eliminates the message.

Internationalisation is the root of evil... (nobody documents it clearly).

Michael
  • 346
  • 3
  • 4
  • 2
    So, if both use LANG=en_EN, then it will definitely work? Will it work for any locale, as long as both use the same locale? Can we say that the difference between sort and join is that they use a different locale by default? – Aaron McDaid Aug 25 '14 at 15:19
  • 1
    Is the -k option the answer here, or is it the LANG=en_EN? It's unclear of what the exact solution is here. – User Jul 20 '16 at 23:12
  • 1
    LANG=en_EN is the solution – Kirill Mikhailov Nov 01 '21 at 11:09
8

sort by default uses the entire line as the key

join uses only the specified field as the key.

You must correct this incompatibility by restricting sort to use only the key you want to join on.

The Join man page states:

Important: FILE1 and FILE2 must be sorted on the join fields. E.g., use 'sort -k 1b,1' if >'join' has no options. Note, comparisons honor the rules specified by 'LC_COLLATE'. If the >input is not sorted and some lines cannot be joined, a warning message will be given.

jasonwryan
  • 73,126
  • I think this is incorrect. Sorting by a full line doesn't produce the partial line in the wrong order such that join should complain about it. – Sam Liddicott Feb 03 '21 at 09:33
6

If you are sure you properly sorted your input files and their lines can be paired, you can avoid the above error by running join --nocheck-order file1.txt file2.txt

6

Note that if you see this error, and you have already sorted on a specific column and are beating your head against the wall e.g. sort -k4,4 then you may also need to set the separator for the sort command

Apparently OP already did this with -t ' ' but for a normal tab separated text I'd recommend

sort -t $'\t' ...

The sort command can incorporate spaces as separators by default even on something that looks like a tab separated file (especially if there are spaces inside the column you are sorting on).

Then if you passed that sorted data to join, and you have

join -t $'\t' ...

Then this ends up causing the error message about it being unsorted. As noted above, join may not accept -t ' ' though.

Colin D
  • 430
5

Were you sorting with numbers? I found that zero-padding the column that I was joining on solved this issue for me.

cat file.txt \
     | awk -F"   " '{ $20=sprintf("%06s", $20); print $0}' \
     | sort > readytojoin.txt
Mat
  • 52,586
Conor
  • 151
5
LOCALE=C sort ...
LOCALE=C join ...

This will solve your problem. The issue, as pointed out by @Michael, is collation sequence, which depends on your LOCALE setting.

0

For join the argument after -t is a character. For sort you can supply a longer sort separator. I think that you may be joining the files on a different field that you want to, and ignoring the case solves the problem by coincidence.

And I agreee with Gilles, that sample data would be helpful.