I use a lot of grep awk sort in my unix shell to work with medium-sized (around 10M-100M lines) tab-separated column text files. In this respect unix shell is my spreadsheet.
But I have one huge problem, that is selecting records given a list of IDs.
Having table.csv
file with format id\tfoo\tbar...
and ids.csv
file with list of ids, only select records from table.csv
with id present in ids.csv
.
kind of https://stackoverflow.com/questions/13732295/extract-all-lines-from-text-file-based-on-a-given-list-of-ids but with shell, not perl.
grep -F
obviously produces false positives if ids are variable width.
join
is an utility I could never figure out. First of all, it requires alphabetic sorting (my files are usually numerically sorted), but even then I can't get it to work without complaining about incorrect order and skipping some records. So I don't like it.
grep -f against file with ^id\t
-s is very slow when number of ids is large.
awk
is cumbersome.
Are there any good solutions for this? Any specific tools for tab-separated files? Extra functionality will be most welcome too.
UPD: Corrected sort
-> join
grep -f
is too slow, maintaining this strategy sounds like more trouble than it is worth -- variations will likely fall prey to the same O(N*M) performance issues. Maybe your time would be better spent learning how to use a normalized SQL DB... – goldilocks Jan 23 '14 at 20:00awk
. – cjm Jan 23 '14 at 20:04sort
can do all kinds of sorting, numeric, alphabetical and others. Seeman sort
. – terdon Jan 23 '14 at 20:31