3

If I have a file consisting of data that looks as follows, how would I sort the data based on the numbers in the 3rd column ? The space between the first two columns are NOT tab delimited but some number of spaces. The space between the second and third column varies based on the size of the number. Also note that there are spaces within some data of the second column ( like lp25( plasmid between ( and p) while other do not have any spaces( like chromosome).

HELIX       lp25(plasmid           24437 bp    RNA     linear       29-AUG-2011
HELIX       cp9(plasmid             9586 bp    DNA     helix       29-AUG-2011
HELIX       lp28-1(plasmid         25455 bp    DNA     linear       29-AUG-2011
HELIX       chromosome            911724 bp    DNA     plasmid       29-AUG-2011
Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255

2 Answers2

1

Try this:

sort -n --k3 <file>

For example:

$ sort -n -k3 test
HELIX       cp9(plasmid             9586 bp    DNA     helix       29-AUG-2011
HELIX       lp25(plasmid           24437 bp    RNA     linear       29-AUG-2011
HELIX       lp28-1(plasmid         25455 bp    DNA     linear       29-AUG-2011
HELIX       chromosome            911724 bp    DNA     plasmid       29-AUG-2011

-n sorts by numeric value, and -k3 selects column 3.

Will
  • 2,754
0
sed $'s/\t/ /g' my_file | tr -s " " | sort -t" " -k 3 

the first sed command, replaces all tab characters with single space. tr -s " " means squeeze multiple consecutive white space characters to only one space character.

If a numeric sort is needed, you can use

sed $'s/\t/ /g' my_file | tr -s " " | sort -t" " -n -k 3 

of course this did not address the 2nd column irregularity, I just noticed. Hence the edit. In which case I have one question. In the line below,

HELIX       lp28-1(plasmid         25455 bp    DNA     linear       29-AUG-2011
        ^                     ^
        1                     2

are these delimiters 1 & 2 tab or space ?

MelBurslan
  • 6,966