4

I would like to merge a variable from one file to another in linux. The first variable contains the name I want to merge files on.

I have sorted both files using both -f and -k: sort -f -k 1,1 SCZ.N.tmp> SCZ.N.tmp.sorted and sort -f -k 1,1 1kg.tmp > 1kG.ref_file.sorted

However, when I join both files with this command: join -1 1 -2 1 SCZ.N.tmp.sorted 1kG.ref_file.sorted> SCZ.freq.joined

I keep getting the error 'join: SCZ.N.tmp.sorted:112855: is not sorted: chr1_100002155_D D I6 0.995112 0.0184 0.7897 87016' Nevertheless, the join continues and the majority is merged. However, I am not sure whether I am losing a small proportion of cases because of mismatch between the files, or because something goes wrong with sorting these files.

Does anybody know what I am doing wrong? And what i can do to not get this error? Thank you!

I have also tried: LANG=en_EN sort -f -k 1,1 SCZ.N.tmp> SCZ.N.tmp.sorted2 and LANG=en_EN sort -f -k 1,1 1kg.tmp > 1kg.tmp.sorted2, with then joining using: LANG=en_EN join -1 1 -2 1 SCZ.N.tmp.sorted2 1kg.tmp.sorted2> SCZ.freq.joined. But that did not solve it.

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
LauraW
  • 43
  • 4
  • I would try LC_COLLATE=en_EN. I suspect LANG only affects presentation, not sequencing. Failing that, try LC_ALL=C which is the ultimate sanction. – Paul_Pedant Aug 07 '20 at 14:24
  • Thank you. I tried both LC_COLLATE=en_EN and LC_ALL=C which made the error change to join: SCZ.N.tmp.sorted2:317251: is not sorted: MERGED_DEL_2_4660 D I5 0.98738 0.0113 0.2611 87016 but comes down the same final joined sample size... – LauraW Aug 07 '20 at 14:38
  • It complained about the second file now. Did you change all three instances of the LANG to LC_ALL=C? This export only applies to the command that follows directly on the same line with no intervening semi-colon. – Paul_Pedant Aug 07 '20 at 14:55
  • Yes I did apply it to all 3. I just tried LOCALE=C and that gave the earlier error again with join: SCZ.N.tmp.sorted2:112855: is not sorted: – LauraW Aug 07 '20 at 15:00
  • Sorry, LC_ALL has always worked for me. I suspect the things in the messages like 112855 and 317251 are line numbers in the sorted files, so you could look at sections of the files to see if there are crazy characters. LOCALE is not meaningful: run the locale command to see valid names and current settings. Run LC_ALL=C locale to see what it changes. – Paul_Pedant Aug 07 '20 at 15:40

1 Answers1

1

You are sorting the files with the -f option, as case-independent keys.

However, join expects the keys in normal sorted sequence.

You should add the -i option to the command-line for join, to have it ignore case differences.

Alternatively, omit the -f option from both sorts.

Edit: also found another possibility here. The field separators need to be identical for the sort and the join. It looks like the defaults for sort and join are both whitespace, but it may be the next hurdle.

Paul_Pedant
  • 8,679
  • 1
    This has solved my problem, thank you a lot! So in the end I have now used: LC_ALL=C sort -k 1 SCZ.N.tmp> SCZ.N.tmp.sorted2 and LC_ALL=C sort -k 1 1kg.tmp > 1kg.tmp.sorted2 and then LC_ALL=C join -1 1 -2 1 SCZ.N.tmp.sorted2 1kg.tmp.sorted2 > SCZ.freq.joined2. Thanks again! – LauraW Aug 10 '20 at 08:17