SQL operation on csv file using bash or shell

Question

This is my input file

0164318,001449,001452,001922  
0164318,001456,001457,001922  
0842179,002115,002118,001485  
0846354,001512,001513,001590  
0841422,001221,001224,001860  
0841422,001227,001228,001860

I want my result as

0164318,001449,001457,001922  
0842179,002115,002118,001485  
0846354,001512,001513,001590  
0841422,001221,001228,001860

group by using col1 and to find min(col2) and max(col3)
through shell script.

From U&L Help: "Have you thoroughly searched for an answer before asking your question? Sharing your research helps everyone. Tell us what you found and why it didn’t meet your needs. This demonstrates that you’ve taken the time to try to help yourself" — pLumo, Jan 17 '19 at 10:01
In other words: What did you try? People are not here to do your work but to help you do your work. — pLumo, Jan 17 '19 at 10:02
I am trying to create a validation script, where i am doing same operation in sql table, well I am new to shell scripting so that I am asking whether its possible in shell or not? — Aditya, Jan 17 '19 at 10:06
Please [edit] your question and clarify what you are trying to do. What if the min value for column 2 associated with 0164318 in column 1, and the max value for column 3 are on different lines? Also, does it need to be a shell script? The shell is a very bad tool for text processing. — terdon, Jan 17 '19 at 10:14
$ awk -F, '{if (a[$1] < $3)a[$1]=$3;}END{for(i in a){print i,a[i];}}' OFS=, file ------ Through this command I am able to find the max in col3 for every group of col1 , — Aditya, Jan 17 '19 at 10:31
no need to worry about column 4 .. in this case col4 value is unique for each value of col1 — Aditya, Jan 17 '19 at 10:38
It is still really unclear if you want to 1) issue SQL commands against your file using a tool available in the shell and able to behave similarly to an RDBMS (this will lead, for instance, to the csvkit based answers you have) OR 2) manipulate your file using something that is not SQL, nor an RDBMS (e.g. shell constructs, awk, ...) in a way equivalent to the SQL command you provided. — fra-san, Jan 17 '19 at 13:53
@Aditya, mm, why the edit to remove the example data? The question was a lot more clear and useful with it. — ilkkachu, Mar 13 '19 at 18:18

Kusalananda · Answer 1 · 2019-01-17T10:42:35.287

Using csvkit,

$ csvsql -H --query 'SELECT a,min(b),max(c),d FROM file GROUP BY a' file.csv
a,min(b),max(c),d
164318,1449,1457,1922
841422,1221,1228,1860
842179,2115,2118,1485
846354,1512,1513,1590

This would load the CSV data into a temporary database (SQLite by default I believe), and then apply the given SQL query to it. The table will by default have the same name as the input file (sans suffix) and since the data lacks column headers, the default field names will be alphabetical.

The -H options tells csvsql that the data has no column headers.

To delete the generated header in the output, pipe the result through something like sed '1d'.

To get zero-filled integers:

$ csvsql -H --query 'SELECT printf("%07d,%06d,%06d,%06d",a,min(b),max(c),d) FROM file GROUP BY a' file.csv
"printf(""%07d,%06d,%06d,%06d"",a,min(b),max(c),d)"
"0164318,001449,001457,001922"
"0841422,001221,001228,001860"
"0842179,002115,002118,001485"
"0846354,001512,001513,001590"

Here, the lines gets quoted since we're actually only requesting a single output field for each result record (and it contains commas). Another way to do it, which involves a bit more typing, but does not generate extra double quotes:

$ csvsql -H --query 'SELECT printf("%07d",a),printf("%06d",min(b)),printf("%06d",max(c)),printf("%06d",d) FROM file GROUP BY a' file.csv
"printf(""%07d"",a)","printf(""%06d"",min(b))","printf(""%06d"",max(c))","printf(""%06d"",d)"
0164318,001449,001457,001922
0841422,001221,001228,001860
0842179,002115,002118,001485
0846354,001512,001513,001590

Again, the output header can be removed by piping the result through sed '1d'.

score 6 · Answer 2 · answered Jan 17 '19 at 10:13

Using csvkit:

csvsql -H --query "select a,min(b),max(c),d from file group by a,d" file.csv

Note, that this will truncate the leading 0.

Output:

a,min(b),max(c),d
164318,1449,1457,1922
841422,1221,1228,1860
842179,2115,2118,1485
846354,1512,1513,1590

aborruso · Answer 3 · 2019-01-17T11:35:17.180

With Miller (http://johnkerl.org/miller/doc), using

mlr --ocsv --quote-all --inidx --ifs , cat inputFile | \
mlr --ocsv --quote-none  --icsvlite stats1 -g '"1"' -a min,max,min -f '"2","3","4"' \
then cut -f '"1","2"_min,"3"_max,"4"_min' \
then label id,col2,col3,col4 | sed 's/"//g'

you have

id,col2,col3,col4
0164318,001449,001457,001922
0842179,002115,002118,001485
0846354,001512,001513,001590
0841422,001221,001228,001860

fra-san · Answer 4 · 2019-03-08T11:51:55.533

You can break down your SQL into basic procedural operations and replicate them in a shell script.

This is of course not a great idea, since one of the advantages of declarative languages (as SQL) is that they hide the verbosity and complexity of the procedural implementation to developers, allowing them to concentrate on data. (Optimization is a second great advantage of declarative languages that is lost if you replicate them with a procedural program).
Also, this approach is problematic because processing text in shell loops is usually considered bad practice.

However, here is an example of shell script that leverages standard utilities that you will find pre-installed on many systems (except for the array construct — not specified in POSIX, but widely available, and surely available to you since you are asking about bash):

#!/bin/bash

# The input file will be passed as the first argument
file="$1"

# For each input line:
# We take only the values of the first field, sort them, remove duplicates
for i in $(cut -d ',' -f 1 "$file" | sort -n -u); do

    # Resetting the array is not really needed; we do it for safety
    out=()

    # The first field of the output row is the key of the loop
    out[0]="$i"

    # We only consider the rows whose first field is equal
    # to the current key (grep) and...

    # ... we sort the values of the second field
    # in ascending order and take only the first one
    out[1]="$(grep "^${out[0]}" "$file" | cut -d ',' -f 2 | sort -n | head -n 1)"

    # ... we sort the values of the third field in
    # ascending order and take only the last one
    out[2]="$(grep "^${out[0]}" "$file" | cut -d ',' -f 3 | sort -n | tail -n 1)"

    # ... we sort the values of the fourth field in
    # ascending order and take only the first one
    out[3]="$(grep "^${out[0]}" "$file" | cut -d ',' -f 4 | sort -n | head -n 1)"

    # Finally we print out the output, separating fields with ','
    printf '%s,%s,%s,%s\n' "${out[@]}"

done

It is meant to be invoked as

./script file

This script is equivalent to

SELECT col1, MIN(col2), MAX(col3), MIN(col4)
FROM text
GROUP BY col1
ORDER BY col1

@Aditya What's the point in removing all the code while leaving the first part as it is? You are right, this is not a good approach. If you confirm that your aim is not to perform SQL-like operations with shell tools I guess that the best thing to do is for me to just delete my whole answer. — fra-san, Mar 13 '19 at 18:16

SQL operation on csv file using bash or shell

4 Answers4

Linked