4

I want to sort a tab-delimited file by a specific field while preserving the header. I'm using awk as described here sort and uniq in awk, but I can't figure out who to tell sort that the field separator is a tab.

Toy data:

$ echo -e "head_1\thead_2\thead_3" > file.tsv
$ echo -e "aaa zzz\tc\t300" >> file.tsv
$ echo -e "bbb yyy ooo\ta\t100" >> file.tsv
$ echo -e "ccc xxx nnn\tb\t200" >> file.tsv
$ column -ts $'\t' file.tsv
head_1       head_2  head_3
aaa zzz      c       300
bbb yyy ooo  a       100
ccc xxx nnn  b       200

$ awk -F'\t' 'NR==1; NR>1 { print | "sort -k2" }' file.tsv | column -ts $'\t' 
head_1       head_2  head_3
ccc xxx nnn  b       200           ## note these data are sorted 
bbb yyy ooo  a       100           ## based on the xxx/yyy/zzz 
aaa zzz      c       300           ## not the a/b/c

When I try to explicitly tell sort the the field separator is a tab, I get this error, which I believe is related to quoting issues:

$ awk -F'\t' 'NR==1; NR>1 { print | "sort -k2 -t $'\t'" }' file.tsv | column -ts $'\t'
sort: option requires an argument -- 't'
Try 'sort --help' for more information.
head_1  head_2  head_3

How do I specify the column separator for sort inside `awk? Thanks

SE's web interface is doing a better job of syntax highlighting than Notepad++; here are a couple of things I've tried:

$ awk -F'\t' 'NR==1; NR>1 { print | "sort -k2 -t $'$'\t''" }' file.tsv | column -ts $'\t'
head_1       head_2  head_3
aaa zzz      c       300
bbb yyy ooo  a       100
ccc xxx nnn  b       200


$ awk -F'\t' 'NR==1; NR>1 { print | "sort -k2 -t $'\t'" }' file.tsv | column -ts $'\t'
sort: option requires an argument -- 't'
Try 'sort --help' for more information.
head_1  head_2  head_3

$ awk -F'\t' 'NR==1; NR>1 { print | "sort -k2 -t "'$'\t''"" }' file.tsv | column -ts $'\t'
sort: option requires an argument -- 't'
Try 'sort --help' for more information.
head_1  head_2  head_3

$ awk -F'\t' 'NR==1; NR>1 { print | "sort -k2 -t "'$'\t'' }' file.tsv | column -ts $'\t'
sort: option requires an argument -- 't'
Try 'sort --help' for more information.
head_1  head_2  head_3
Josh
  • 303
  • 1
    Not a direct answer, but perhaps easier -- https://unix.stackexchange.com/questions/11856/sort-but-keep-header-line-at-the-top – Jeff Schaller Apr 07 '20 at 16:17
  • 1
    Did you try "sort -t \\\t -k2" ? – steeldriver Apr 07 '20 at 16:50
  • @steeldriver, that worked! What the heck is going on with \\\t? – Josh Apr 07 '20 at 16:52
  • t's not quoted properly and so being interepreted by everyone and their grandma before sort sees it. Are you using GNU awk by chance (awk --version will tell you)? If so it has builtin sorting functions. – Ed Morton Apr 07 '20 at 17:28

3 Answers3

5

chose one of these options:

... | "sort -k2 -t \\\t "
... | "sort -k2 -t \"\t\" "
... | "sort -k2 -t'\''\t'\'' "
... | "sort -k2 -t \047\011\047" ## preferred 

\011 is the Octet ASCII code for Tab character/ \047 for single quote '

awk -v q="'" ... { print | "sort -k2 -t " q "\t" q }'
awk -v tb="'\t'" ... { print | "sort -k2 -t " tb }'
awk -v tb=$'\t' ... { print | "sort -k2 -t \"" tb "\"" }'
awk -v tb=$'\t' -v q="'" ... { print | "sort -k2 -t " q tb q }'

and many more …; read Shell Quoting Issues in awk; see also Escape Sequences in awk

αғsнιη
  • 41,407
  • FWIW if I was going to call sort from awk that way then I'd use awk '{ print | "sort -k2 -t $\047\t\047" }' - that'll work whether you're calling it from a command line or stored in a script file and it doesn't require any separate variables or rely on awk calling a specific shell. You can't go wrong simply using \047 anywhere you want a ' in awk. Not all sorts will accept -t '\t' as an argument btw, OSX/BSD will fail with sort: multi-character tab ‘\\t’ given that. -t $'\t' should work with all sorts, assuming your shell expands $'\t' to a tab char. – Ed Morton Apr 07 '20 at 17:53
  • it should be \047\t\047 I think not with $ prefixing to it. – αғsнιη Apr 07 '20 at 17:55
  • Without the $ prefix the shell won't convert the string \t to a tab character and then sort can fail depending on which version of sort you're running per my previous comment. – Ed Morton Apr 07 '20 at 18:06
  • I'm pointing to used $ prefix in "sort -k2 -t $\047\t\047". it will not work like that, at least it should be "sort -k2 -t '$'\047\t\047" or remove it completely as it makes no sense at all and just use "sort -k2 -t \047\t\047" and even better "sort -k2 -t \047\011\047" – αғsнιη Apr 07 '20 at 18:15
  • I know what you're refering to and I'm sorry but you're wrong, "sort -k2 -t $\047\t\047" is exactly the string we need to pass to the shell since some sorts (e.g. OSX/BSD sort) will fail given sort -t '\t' but all will work given sort -t $'\t' assuming the shell expands $'\t' to be a tab character. Try it. – Ed Morton Apr 07 '20 at 18:24
  • OK, at least that "sort -k2 -t $\047\t\047" doesn't for me when $ prefixed. btw, I'm on WSL, if that's probably related to the issue. – αғsнιη Apr 07 '20 at 18:26
  • I've no idea what WSL is, sorry. OK, like I say it'll only work if your shell expands $'\t' to be a tab character, sounds like the shell that awk calls on your system doesn't do that but your sort does accept -t '\t' alone. – Ed Morton Apr 07 '20 at 18:28
  • I'm in bash ans sh, both does expanding $'...' syntax, but sort doest accept -t '\t' as you asked – αғsнιη Apr 07 '20 at 18:31
  • Ugh, turns out I was ssh-ed onto Linux so everything I said above about MacOS actually applies to Linux instead. Although you're running bash when you call awk, that doesn't mean it's bash that awk calls - it can be calling a different shell. But while > echo 5 | sort -t '\t' fails with sort: multi-character tab ‘\\t’, when you call echo 5 | awk '{print | "sort -t \047\t\047"}' it actually succeeds so the \t is being converted SOMEWHERE before sort gets called. – Ed Morton Apr 07 '20 at 18:33
  • awk executes in sh shell for running external command AFAIK. – αғsнιη Apr 07 '20 at 18:35
  • I think you can over-ride that and even if not I expect that produces different behavior depending on which actual shell sh is aliased to and in either case that might not be the shell you use on your login. Seems like you're right though and you don't need the $ prefix based on the couple of tests I just ran. – Ed Morton Apr 07 '20 at 18:37
  • I recently discovered that an alias I created in my .bashrc (alias awkj="awk -F'\t' -v OFS='\t'") is displayed as follows when I call alias from the command line: alias awkj='awk -F'\''\t'\'' -v OFS='\''\t'\'''. Does this provide any insight into the best or most appropriate nomenclature for a tab character in my shell (i.e. '\t' vs. '\''\t'\'' vs. $'\t')? If this is too far off topic I can delete and post as a separate question. – Josh Jun 27 '20 at 16:13
1

This may not be a proper one liner way , but simple and no need to play with tab character... :D

var=$(head -1 file.tsv);perl -ne '{ print $_ if $. > 1; }'  file.csv| sort -k2 | sed "1 i $var" | column -ts "\t"
  • The major drawback to that approach is that you have to read the input twice (once by head and then a 2nd time by perl) so you can't use it for output being piped from a command. It'd also require GNU sed and perl, either/both of which may not be available on a given box. You could simplify it and make it portable with { head -1 file.tsv; tail +2 file.tsv | sort -k2; } | column -ts "\t" but it still requires 2 reads of the input. – Ed Morton May 21 '20 at 21:10
0

FWIW here's how I'd solve your actual problem of sorting the data but preserving the header at the top:

awk -v OFS='\t' '{print (NR>1), $0}' file.tsv | sort -t$'\t' -k1,1n -k3 | cut -f2-

The above works by prepending a 0 or 1 to the input data (0 for the first line, 1 for all other lines) so you can sort on that indicator first and then your real key you care about and then just remove that added field again.

Here's it working in stages:

$ awk -v OFS='\t' '{print (NR>1), $0}' file.tsv
0   head_1  head_2  head_3
1   aaa zzz c   300
1   bbb yyy ooo a   100
1   ccc xxx nnn b   200

$ awk -v OFS='\t' '{print (NR>1), $0}' file.tsv | sort -t$'\t' -k1,1n -k3
0   head_1  head_2  head_3
1   bbb yyy ooo a   100
1   ccc xxx nnn b   200
1   aaa zzz c   300

$ awk -v OFS='\t' '{print (NR>1), $0}' file.tsv | sort -t$'\t' -k1,1n -k3 | cut -f2-
head_1  head_2  head_3
bbb yyy ooo a   100
ccc xxx nnn b   200
aaa zzz c   300
Ed Morton
  • 31,617
  • Dear drive-by downvoter- this is the correct way to do what the OP is trying to do. If you don't understand it please feel free to ask questions. – Ed Morton Apr 12 '20 at 11:20