0

I have an issue joining the two files linked below using join -t $'\t' -a1 file_1 file_2. I'm aware that one has to use specific flags when sorting in order to make join work, as explained in this and this post. Specifically, I have sorted both file_1 and file_2 using the following syntax: cat file_1 | LANG=en_EN sort k1b,1.

The error I get is this:

https://drive.google.com/open?id=1vlh9NqD1Nlm6dQi33Qt0gevsLfYxRURN

This is problematic. For example, entry "S2_005_008G1__bin.1" fails to be joined despite being present in both file_1 and file_2.

Sorry that I cannot give a simple toy example, but I've been unable to regenerate this error using hand-made files. I really at a loss as what might cause this issue.

file_1, file_2

PejoPhylo
  • 345
  • 3
  • 6
  • 1
    This is a bad question because you didn't copy the actual commands that you ran (for example you obviously ran sort -k1b,1, not sort k1b,1) or the error messages. Always copy-paste, never copy things manually or post a screenshot of text. I answered the question because I can probably guess what you did wrong and it might help others, but don't expect this to happen often when you don't even bother to copy-paste your code and data. – Gilles 'SO- stop being evil' Apr 20 '20 at 08:09
  • Thanks for the advice, I'll keep it in mind – PejoPhylo Apr 20 '20 at 08:12

1 Answers1

2

You need to use matching options and locale settings for join and sort. If you tell join to use tab-separated fields, you also need to tell sort to use tab-separated fields (the default is whitespace-separated).

Setting the LANG environment variable may or may not set the LC_COLLATE locale setting: if the environment LC_COLLATE is set, it takes precedence over LANG, and if the environment variable LC_ALL is set, it sets all locale settings. See set LC_* but not LC_ALL

Unless you need to have the files sorted in a particular “human-friendly” way, use the C collation locale, which simply uses byte order. And unless you need some other locale setting to be different, set LC_ALL to be sure to override any setting inherited from the rest of the script or from a parent process. In any case, make sure to use the same locale settings for join and sort.

LC_ALL=C sort -t $'\t' -k 1b,1 file_1 >file_1.sorted
LC_ALL=C sort -t $'\t' -k 1b,1 file_2 >file_2.sorted
LC_ALL=C join -t $'\t' -a1 file_1.sorted file_2.sorted
  • This solved my problem, thank you! How come the join/sort statement doesn't work without setting LC_all=C? Shouldn't this be set to the same value by default when invoking join/sort? – PejoPhylo Apr 20 '20 at 08:14