Nice way to SQL query multiple CSV files?

Question

I was looking for a way to traverse CSV files as relational database tables.

I did some research as nothing I found fit my bill of requirements in its entirety. I found several partially good options, namely:

termsql - which takes stdin or a file and allows some sql on it - but sets up only one "table"
csv2sqlite - which is very promising given it allows for potentially more sql goodness than termsql - but still only one "table"
this ULSE question - which describes how one would implement set operations with unix file traverse commands - promising and a possible starting point

It is possible, and very straightforward to traverse and perform some database-like operations on a single csv/text file (column sums, averages, min, max, subsets, etc), but not on two files, with some connection between them. It is also possible to import the files to a temp DB for querying, and I have done this, although not as practical as I would like.

TL;DR - I basically would like a convenient way to do quick and dirty sql joins on csv files. Not looking for a full fledged text based RDBMS, but just a nicer way to do some analysis on csv RDBMS extracts.

example:

sqlthingy -i tbl1.csv tbl2.csv -o 'select 1,2,3 from tbl1, tbl2 where tbl1.1 = tbl2.1'

This seems like an interesting enough problem that I could devote some time on, but I'd like to know if it exists already.

score 3 · Accepted Answer · answered Oct 26 '16 at 13:59

3

Take a look at fsql (Perl), or csvkit (Python). They both have various problems and limitations, but they are often fine for "small" data. And, of course, you can always fallback to a proper database when they aren't enough.

answered Oct 26 '16 at 13:59

Satō Katsura

13,368
2
31
50

Wildcard · Answer 2 · 2016-11-02T03:22:03.487

What you want is the join command, which is specified by POSIX.

Here is your example pseudocode command:

sqlthingy -i tbl1.csv tbl2.csv -o 'select 1,2,3 from tbl1, tbl2 where tbl1.1 = tbl2.1'

Here is an actual working command using join that is equivalent:

join -t, tbl1.csv tbl2.csv

If both files have only two fields, comma separated, this join command is exactly what you represent in pseudocode.

If they have more fields but you only want up to the second field from each file, still joining on the first field, you would use:

join -t, -o 0,1.2,2.2 tbl1.csv tbl2.csv

If you want to join on a different field there are flags for that, as well.

It isn't a full-fledged RDBMS; for instance you are limited to just two files and a single join field. But for what you requested:

TL;DR - I basically would like a convenient way to do quick and dirty sql joins on csv files. Not looking for a full fledged text based RDBMS, but just a nicer way to do some analysis on csv RDBMS extracts.

It fits the bill perfectly.

You should also check out comm, also specified by POSIX, which is for printing lines common to two files (or only present in one or the other of them, or similar things).

Also note that both join and comm can operate on standard input by using - as a file name.

If you want the equivalent of an SQL "count()" command with a "group by" clause, just get the column you want (which join will sort using the join field, or you can sort yourself if it's directly from a file) and pipe it through uniq -c.

Between Awk, join, uniq, comm, and sort, you can do some pretty fancy stuff with CSVs. And all of it POSIX compliant.

Mathieu Rey · Answer 3 · 2017-11-07T11:21:02.507

Apache Drill can directly query CSV and JSON files, and join across them.

You only need to define the files' location and adjust settings based on the file extension (e.g. whether to use the first row as header or not) the first time.

Then it's as if you were using the mysql client, but the tables are the actual files on disk

$ ./bin/drill-embedded 
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Nov 07, 2017 7:05:52 PM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
apache drill 1.11.0 
"drill baby drill"
0: jdbc:drill:zk=local> SELECT ix.field1, o.field2, o.field3 
. . . . . . . . . . . > FROM dfs.myfolder.`file1.tsv` ix
. . . . . . . . . . . > LEFT JOIN dfs.myfolder.`file2.tsv` o ON (o.field=ix.field)
. . . . . . . . . . . > LIMIT 10;
+-------------+-------------+---------------+
| field1      | field2      | field3        |
+-------------+-------------+---------------+
...redacted...
+-------------+-------------+---------------+
10 rows selected (0.656 seconds)
0: jdbc:drill:zk=local>

Nice way to SQL query multiple CSV files?

3 Answers3