sort the whole .csv based on the value in a certain column

Question

I have a csv file with different revenues. I want to sort the csv file on revenue from high to low value. I can't find how to do this in terminal without the use of python.

I don't want to use python.

I want to use something simple like mlr/sed/awk.

Input:

name,location,capital,profit-lost,revenue,employees,year
company1,location1,35527.19,-33226.25,,0.70,2020
company2,location2,-155921.70,-146.03,,,2020
company3,location3,1873134.74,778424.56,13320152.32,16.90,2020
company4,location4,1050987.60,426317.61,,24.90,2021
company5,location5,368506.18,11997.04,,,2019
company6,location6,7965648.89,369947.14,64413602.44,103.30,2019
company7,location7,1531534.27,125750.94,3054307.36,12.10,2020
company8,location8,6161574.62,906591.96,124804038.64,51.30,2021

Output:

name,location,capital,profit-lost,revenue,employees,year
company8,location8,6161574.62,906591.96,124804038.64,51.30,2021
company6,location6,7965648.89,369947.14,64413602.44,103.30,2019
company3,location3,1873134.74,778424.56,13320152.32,16.90,2020
company7,location7,1531534.27,125750.94,3054307.36,12.10,2020
company1,location1,35527.19,-33226.25,,0.70,2020
company2,location2,-155921.70,-146.03,,,2020
company4,location4,1050987.60,426317.61,,24.90,2021
company5,location5,368506.18,11997.04,,,2019

The revenue is empty to billions.

Hopefully someone can help me out with this one too

One thing to consider is whether you may have escaped values with commas or space, for example: "ACME, Inc.","Coyote Road, LA",6161574.62,906591.96,124804038.64,51.30,2021. If so, you may find that using Python's csv module (or other similar language) might be simpler than sed or awk — Bruno, Jul 04 '23 at 08:36

annahri · Answer 1 · 2023-08-11T06:47:26.997

Using sort:

cat input.csv | (sed -u 1q; sort -t, -r -n -k5)

The sed -u 1q is required to make the sort ignore the header. It basically means, process the 1st line and quit, then pass the remaining to sort. -u is short for --unbuffered, to tell sed not to buffer the lines.

Sort's flags:

-t, to specify the delimiter to a comma.
-r to make the sorted output descending. It's ascending by default.
-n to sort numerically.
-k5 to sort the 5th key/column.

Demo:

$ cat input.csv | (sed -u 1q; sort -t, -r -n -k5)
name,location,capital,profit-lost,revenue,employees,year
company8,location8,6161574.62,906591.96,124804038.64,51.30,2021
company6,location6,7965648.89,369947.14,64413602.44,103.30,2019
company3,location3,1873134.74,778424.56,13320152.32,16.90,2020
company7,location7,1531534.27,125750.94,3054307.36,12.10,2020
company5,location5,368506.18,11997.04,,,2019
company4,location4,1050987.60,426317.61,,24.90,2021
company2,location2,-155921.70,-146.03,,,2020
company1,location1,35527.19,-33226.25,,0.70,2020

With GNU sort, you can get the exact order shown by the OP by stopping the sort at key 5 (-k5,5) and adding the -s or --stable option — steeldriver, Jul 02 '23 at 15:32

score 12 · Accepted Answer · answered Jul 02 '23 at 01:02

So, you want to sort (stably) on revenue in numerically descending order, which sounds like it should be easy in Miller except that its rules for null handling say:

Records with one or more empty sort-field values sort after records with all sort-field values present

which means that they sort first in a descending sort:

$ mlr --csv sort -nr revenue file.csv
name,location,capital,profit-lost,revenue,employees,year
company1,location1,35527.19,-33226.25,,0.70,2020
company2,location2,-155921.70,-146.03,,,2020
company4,location4,1050987.60,426317.61,,24.90,2021
company5,location5,368506.18,11997.04,,,2019
company8,location8,6161574.62,906591.96,124804038.64,51.30,2021
company6,location6,7965648.89,369947.14,64413602.44,103.30,2019
company3,location3,1873134.74,778424.56,13320152.32,16.90,2020
company7,location7,1531534.27,125750.94,3054307.36,12.10,2020

However using then chaining it's straightforward to decorate-sort-undecorate with a key that assigns numerical 0 to empty revenues:

$ mlr --csv put '$key = is_empty($revenue) ? 0 : $revenue' \
    then sort -nr key then cut -x -f key file.csv
name,location,capital,profit-lost,revenue,employees,year
company8,location8,6161574.62,906591.96,124804038.64,51.30,2021
company6,location6,7965648.89,369947.14,64413602.44,103.30,2019
company3,location3,1873134.74,778424.56,13320152.32,16.90,2020
company7,location7,1531534.27,125750.94,3054307.36,12.10,2020
company1,location1,35527.19,-33226.25,,0.70,2020
company2,location2,-155921.70,-146.03,,,2020
company4,location4,1050987.60,426317.61,,24.90,2021
company5,location5,368506.18,11997.04,,,2019

With Miller 6.8.0, you can use sort -tr to do "natural" sorting in descending order. I don't really know what "natural" means, but it seems to be doing what the user wants. See https://miller.readthedocs.io/en/latest/reference-verbs/#sort — Kusalananda, Aug 06 '23 at 08:32
@Kusalananda, about natural sort, see https://miller.readthedocs.io/en/latest/sorting/ which leads to https://github.com/facette/natsort which leads to http://davekoelle.com/alphanum.html (latter a dead link for me; it's mentioned at https://en.wikipedia.org/wiki/Natural_sort_order). I sounds similar to zsh's numeric sort. — Stéphane Chazelas, Aug 11 '23 at 08:39

Ed Morton · Answer 3 · 2023-07-03T17:53:00.620

8

Using mandatory POSIX tools available on all Unix systems:

$ { head -n 1; sort -t, -k5,5rn; } < file
name,location,capital,profit-lost,revenue,employees,year
company8,location8,6161574.62,906591.96,124804038.64,51.30,2021
company6,location6,7965648.89,369947.14,64413602.44,103.30,2019
company3,location3,1873134.74,778424.56,13320152.32,16.90,2020
company7,location7,1531534.27,125750.94,3054307.36,12.10,2020
company1,location1,35527.19,-33226.25,,0.70,2020
company2,location2,-155921.70,-146.03,,,2020
company4,location4,1050987.60,426317.61,,24.90,2021
company5,location5,368506.18,11997.04,,,2019

See the comments below and Can head read more input lines than it outputs? for additional important information about the above script.

edited Jul 03 '23 at 17:53

answered Jul 02 '23 at 11:29

Ed Morton

31,617

3

Is it really guaranteed that head won't read ahead beyond the specified number of lines? As far as I can see, only the output is specified, and head may use buffered I/O if it wishes, and consume more input. – Toby Speight Jul 03 '23 at 09:33
2

I can see this issue. ( echo a; echo b; echo c ) | ( head -n1; cat ) returns just a single line a here. Termux on my phone, and also a Pi. FWIW – Chris Davies Jul 03 '23 at 14:25
1

Well, looks this is a real problem as { head -n 1; cat; } < <(printf 'a\nb\nc\n') (removing the subshells and pipe from @roaima's suggestion to narrow it down) only outputs a in cygwin. printf 'a\nb\nc\n' | { head -n 1; cat; } behaves the same way, but then printf 'a\nb\nc\n' > file; { head -n 1; cat; } < file produces the desired output. TIL... So, I THINK what I show in my answer is safe, but some other ways of reading the input may not be. – Ed Morton Jul 03 '23 at 14:51
3

@EdMorton, POSIX requires head to leave the position within the file right after the last line it has output if the input is seekable, but leaves it unspecified when not (like for pipes). To be able to honour it for pipes, it would need to read one byte at a time (like read does) or peek at the contents of pipes without reading it to know where to stop reading and then read. IIRC, that's what ksh93's head builtin does. Few systems have peekable pipes these days. I've seen head implementations that don't honour the POSIX requirement for seekable files. – Stéphane Chazelas Jul 03 '23 at 15:47
1

@EdMorton strikes me this is worth posting as a question – Chris Davies Jul 03 '23 at 17:00
3

@roaima done, see https://unix.stackexchange.com/q/750523/133219 – Ed Morton Jul 03 '23 at 17:39
1

@StéphaneChazelas - thanks! roaima requested I post this as a question so I did, https://unix.stackexchange.com/q/750523/133219, would you mind posting your comment as an answer there? – Ed Morton Jul 03 '23 at 17:40

score 3 · Answer 4 · answered Jul 02 '23 at 21:51

The duckdb db cli is really a great tool, and it's simply SQL

duckdb -csv -c "select * from read_csv_auto('input.csv') order by revenue desc"

And you get

name,location,capital,profit-lost,revenue,employees,year
company8,location8,6161574.62,906591.96,124804038.64,51.3,2021
company6,location6,7965648.89,369947.14,64413602.44,103.3,2019
company3,location3,1873134.74,778424.56,13320152.32,16.9,2020
company7,location7,1531534.27,125750.94,3054307.36,12.1,2020
company1,location1,35527.19,-33226.25,,0.7,2020
company2,location2,-155921.7,-146.03,,,2020
company4,location4,1050987.6,426317.61,,24.9,2021
company5,location5,368506.18,11997.04,,,2019

jubilatious1 · Answer 5 · 2023-10-25T04:37:04.113

Using Raku (formerly known as Perl_6)

~$ raku -e 'lines.head.put;  my @a = lines(); .put for @a.sort(-*.split(",")[4]);'  file
#OR
~$ raku -e 'lines.head.put; .put for lines.sort(-*.split(",")[4]);'  file

Briefly, the first line (header row) is read and immediately output. Then the remainder of the lines are read. In the first example the value rows are stored in @a array. In the second example lines are sorted directly. The sort function takes a mapper, here sorting on lines that have been split on commas, and then taking the 5th column (zero-index = 4). Sorting is alphabetic so prepending + or - to the sort criteria forces numeric comparison (e.g. .sort(+*.split(",")[4]). The -*. negative sign is used to invert the sort order (descending instead of ascending).

Sample Input:

name,location,capital,profit-lost,revenue,employees,year
company1,location1,35527.19,-33226.25,,0.70,2020
company2,location2,-155921.70,-146.03,,,2020
company3,location3,1873134.74,778424.56,13320152.32,16.90,2020
company4,location4,1050987.60,426317.61,,24.90,2021
company5,location5,368506.18,11997.04,,,2019
company6,location6,7965648.89,369947.14,64413602.44,103.30,2019
company7,location7,1531534.27,125750.94,3054307.36,12.10,2020
company8,location8,6161574.62,906591.96,124804038.64,51.30,2021

Sample Output:

name,location,capital,profit-lost,revenue,employees,year
company8,location8,6161574.62,906591.96,124804038.64,51.30,2021
company6,location6,7965648.89,369947.14,64413602.44,103.30,2019
company3,location3,1873134.74,778424.56,13320152.32,16.90,2020
company7,location7,1531534.27,125750.94,3054307.36,12.10,2020
company1,location1,35527.19,-33226.25,,0.70,2020
company2,location2,-155921.70,-146.03,,,2020
company4,location4,1050987.60,426317.61,,24.90,2021
company5,location5,368506.18,11997.04,,,2019

For more complex CSV files:

~$ raku -MText::CSV -e 'my @a = csv(in => $*IN); @a[1..*] = @a[1..*].sort(-*.[4]); csv(in => @a, out => $*OUT);'  <  file

https://docs.raku.org/routine/lines
https://docs.raku.org/routine/split
https://github.com/Tux/CSV
https://raku.org

drewk · Answer 6 · 2023-07-03T12:23:33.713

MacOS has Ruby as a default and Ruby has a csv parser built in.

Here is a ruby:

ruby -r csv -e 'tbl=CSV.parse($<.read, **{:headers=>true, :converters=>:all}).to_a
    header=tbl.shift
    puts header.to_csv
    idx=header.index("revenue")
    puts tbl.sort_by.with_index{|row, i| [row[idx].nil? ? 0 : -row[idx] ,i] }.
        map{|row| row.to_csv}
' file

Or you can use GNU awk (not the MacOS native awk) this way:

gawk -F, '
FNR==1{print; next}
{if (-$5 in a) {a[-$5 "_" FNR]=$0} else {a[-$5]=$0}}
END{PROCINFO["sorted_in"]="@ind_num_asc"
    for(i in a) print a[i] 
}
' file

Or regular MacOS awk and creating a pipe to sort (which is the fastest of these three):

awk -F, 'FNR==1{print; next}
{print|"sort -t, -nk5,5rn"}' file

Any of those prints:

name,location,capital,profit-lost,revenue,employees,year
company8,location8,6161574.62,906591.96,124804038.64,51.3,2021
company6,location6,7965648.89,369947.14,64413602.44,103.3,2019
company3,location3,1873134.74,778424.56,13320152.32,16.9,2020
company7,location7,1531534.27,125750.94,3054307.36,12.1,2020
company1,location1,35527.19,-33226.25,,0.7,2020
company2,location2,-155921.7,-146.03,,,2020
company4,location4,1050987.6,426317.61,,24.9,2021
company5,location5,368506.18,11997.04,,,2019

sort the whole .csv based on the value in a certain column

6 Answers6

Linked