6

I am trying to find a solution where I can sort my first file using the first column based on the second file keys

First file example (file1.csv)

COLUMN1 COlUMN2
apple fruit
dog animal
cat animal
cow animal

Second file example (sort_keys.txt)

cat
dog
apple
cow

Expected output (sorted.txt)

COLUMN1 COlUMN2
cat animal
dog animal
apple fruit
cow animal

So far I have found a sort command and awk commands might be able to help but I do not have any working code.

$> awk 'NR==FNR{o[FNR]=$1; next} {t[$1]=$0} END{for(x=1; x<=FNR; x++){y=o[x]; print t[y]}}' sort_key.txt file1.csv

However, this command is not working as expected, and would request any expert advice on this. P.S I do have Linux commands knowledge but this is something very specific and I do not have any idea how to achieve this.

Any help or hint is highly appreciated.

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
  • 2
    You mention "CSV" but I don't see any commas separating your values; are there? – Jeff Schaller Sep 20 '20 at 12:41
  • @JeffSchaller the C in CSV stands for Character. It is often : a tab, a ;, a comma – Olivier Dulac Sep 20 '20 at 12:47
  • 1
    No, the C in CSV stands for "Comma" (see https://www.google.com/search?q=csv, https://tools.ietf.org/html/rfc4180, and "CSV" at https://en.wikipedia.org/wiki/List_of_filename_extensions_(A%E2%80%93E)#C for example) the acronym has just recently started to be abused/recycled on some forums to mean "Character" which just adds to the confusion around what people mean when they say they have a CSV file. A tab-separated file is TSV (see https://www.google.com/search?q=tsv and "TSV" at https://en.wikipedia.org/wiki/List_of_filename_extensions_(S%E2%80%93Z)#T). – Ed Morton Sep 20 '20 at 14:01
  • The Wikipedia article does say though (not cited): "Many applications that accept CSV files have options to select the delimiter character and the quotation character.... Because of that, the term character-separated values is suggested as a wider definition of this file format." – marsnebulasoup Sep 20 '20 at 14:24
  • 1
    But it does get confusing when CSV can mean character separated values. I believe the correct general term is delimiter-seperated values (DSV) – marsnebulasoup Sep 20 '20 at 14:29
  • 3
    I could buy into that term. It's not immediately clear that a "delimiter" in that term couldn't be a multi-character string, but at least it's not misappropriating an existing acronym definition. Surprised we're not calling them "XSV" files since everyone seems fond of sticking "X" at the beginning of terms these days. Oops - I spoke too soon: https://mj.ucw.cz/sw/xsv/xsv.1.html. I actually wouldn't have a problem with the term "XSV" the more I think about it. – Ed Morton Sep 20 '20 at 14:35

5 Answers5

8
$ awk 'NR==1; NR==FNR{a[$1]=$2; next} {print $1, a[$1]}' file1 sort_keys.txt
COLUMN1 COlUMN2
cat animal
dog animal
apple fruit
cow animal
Ed Morton
  • 31,617
5

If you have GNU awk (aka gawk) you can define and use your own custom sort function.

For example, assuming GNU awk > 4.0 for the PROCINFO array traversal feature:

$ gawk '
  function mysort(ia,va,ib,vb){return o[ia] - o[ib]}

NR==FNR{o[$1]=FNR; next} # map keys to numerical order

FNR==1{print; next} # print + skip the header line {a[$1]=$0}

END{ PROCINFO["sorted_in"] = "mysort" for(i in a) print a[i] } ' sort_key.txt file1.csv COLUMN1 COlUMN2 cat animal dog animal apple fruit cow animal

(With older GNU awks, you should be able to achieve the same using asorti.)

steeldriver
  • 81,074
3

If your data isn't exceedingly large, this is a simple solution with quadratic complexity:

cat sort_keys.txt | while read key ; do egrep "^$key " file1.csv ; done

For adding / removing the header, add head and tail commands as needed.

Alex O
  • 159
  • Thanks for the hint. I did emphasize readability - you don't need to be proficient in awk here. When it comes to performance, runtime will be governed by grep anyway and the loop overhead is negligible. So while this solution is not particularly fast nor covering all border cases, it may fit the particular use case here. – Alex O Sep 20 '20 at 14:41
  • 3
    You do need to be proficient in shell, though, to understand that the cat is doing nothing useful (see http://porkmail.org/era/unix/award.html), that use of a pipe to a while loop is an anti-pattern (see https://mywiki.wooledge.org/BashFAQ/001), the contents of the loop will run in a subshell, the read is slowly reading 1 byte at a time, not reading a line at a time, and will convert any escape sequences so, e.g. foo\tbar will become foo<tab>bar, the grep will generate false matches when key contains regexp metacharacters, and that egrep is deprecated in favor of grep -E. – Ed Morton Sep 20 '20 at 14:58
  • The point is that while the command looks simple, it isn't functionally simple and it's not doing what someone who isn't proficient in shell would intuitively assume it's doing, i.e. reading a line at a time from sort_keys.txt and looking for a line starting with that literal string in file1.csv. – Ed Morton Sep 20 '20 at 15:02
3

Just for the sake of adding a variant, one can start the steeldriver's route but simplify the logic by stopping short of using both a function or an awk built-in array such as PROCINFO. This works only for sorting keys that are not repeated.

$ gawk '
        NR==FNR {o[$1]=FNR; next}  # map keys to numerical order in 1st input file `sort_key.txt`
        FNR==1 {print; next}       # print header of 2nd input file `file.csv`; go to next record
        {a[$1]=$0}                 # after header, place each record of `file.csv` in array `a`.
        END {
            for(i in o) b[o[i]]=i; # make new array, `b`, with swapped keys and values from array `o`.
            n=length(o) 
            for (j=1;j<=n;j++) print a[b[j]]
        }
       ' sort_key.txt file.csv

COLUMN1 COlUMN2 cat animal dog animal apple fruit cow animal

Cbhihe
  • 2,701
1

You can use join for that. From man join:

For each pair of input lines with identical join fields, write a line to standard output. The default join field is the first, delimited by blanks.

Note that the the first line mustn't be sorted.

TLDR :

head -n 1 file1.csv; join -1 2 <(cat -n sort_keys.txt | sort -k 2) <(tail -n +2 file1.csv | sort) | sort -n -k 2 | awk '{ print $1, $3 }' will do the job.

Explanations

We'll basically :

  • extract the first line of file1.csv
  • join the remaining of file1.csv with sort_keys, on the first field
  • sort the result on the order of sort_keys

Additionaly, join need the files to be sorted.

This will lead us with :

  • as first input, number (in a prepended field) the sort_keys file (to be able to resort on this original order at the end), and sort on the 2nd field

cat -n sort_keys.txt | sort -k 2

  3 apple
  1 cat
  4 cow
  2 dog
  • as second input, we take the csv file, skipping the first line, and sort it on the first field.

tail -n +2 file1.csv | sort

 apple fruit
 cat animal
 cow animal
 dog animal
  • we can then join this alltogether, using the second field for the first process (-1 2) :

join -1 2 <(cat -n sort_keys.txt | sort -k 2) <(tail -n +2 file1.csv | sort)

 apple 3 fruit
 cat 1 animal
 cow 4 animal
 dog 2 animal
  • the join result can now be sorted on the second field, numericaly (in case sort_keys has more than 9 entries), and we keep only 1st and 3rd fields

`... | sort -n -k 2 | awk '{ print $1, $3 }'

 cat animal
 dog animal
 apple fruit
 cow animal
  • finally, prepend that with the first line of file1.csv

head -n 1 file1.csv; join -1 2 <(cat -n sort_keys.txt | sort -k 2) <(tail -n +2 file1.csv | sort) | sort -n -k 2 | awk '{ print $1, $3 }'

 COLUMN1 COlUMN2
 cat animal
 dog animal
 apple fruit
 cow animal

Going further

Depending on your real data, you will have to adjust the field numbers, and the field separator.

You may also want to keep data lines whose key is not in sort_keys and or keep lines of sort_keys having no corresponding data lines (see -a option of join).

Enjoy using join !