20

I use a lot of grep awk sort in my unix shell to work with medium-sized (around 10M-100M lines) tab-separated column text files. In this respect unix shell is my spreadsheet.

But I have one huge problem, that is selecting records given a list of IDs.

Having table.csv file with format id\tfoo\tbar... and ids.csv file with list of ids, only select records from table.csv with id present in ids.csv.

kind of https://stackoverflow.com/questions/13732295/extract-all-lines-from-text-file-based-on-a-given-list-of-ids but with shell, not perl.

grep -F obviously produces false positives if ids are variable width. join is an utility I could never figure out. First of all, it requires alphabetic sorting (my files are usually numerically sorted), but even then I can't get it to work without complaining about incorrect order and skipping some records. So I don't like it. grep -f against file with ^id\t-s is very slow when number of ids is large. awk is cumbersome.

Are there any good solutions for this? Any specific tools for tab-separated files? Extra functionality will be most welcome too.

UPD: Corrected sort -> join

alamar
  • 369
  • If grep -f is too slow, maintaining this strategy sounds like more trouble than it is worth -- variations will likely fall prey to the same O(N*M) performance issues. Maybe your time would be better spent learning how to use a normalized SQL DB... – goldilocks Jan 23 '14 at 20:00
  • 1
    Why not use the Perl script from the question you linked? Alternatively, it should be possible to write a similar script in awk. – cjm Jan 23 '14 at 20:04
  • Bash 4 has associative arrays, which is what you need to circumvent the nested loops a la the perl example. – goldilocks Jan 23 '14 at 20:12
  • 1
    sort can do all kinds of sorting, numeric, alphabetical and others. See man sort. – terdon Jan 23 '14 at 20:31
  • I have a query here, how do we do the same if the source file from where we want to extract the data is a non-delimited file –  Oct 10 '14 at 20:23

4 Answers4

29

I guess you meant grep -f not grep -F but you actually need a combination of both and -w:

grep -Fwf ids.csv table.csv

The reason you were getting false positives is (I guess, you did not explain) because if an id can be contained in another, then both will be printed. -w removes this problem and -F makes sure your patterns are treated as strings, not regular expressions. From man grep:

   -F, --fixed-strings
          Interpret PATTERN as a  list  of  fixed  strings,  separated  by
          newlines,  any  of  which is to be matched.  (-F is specified by
          POSIX.)
   -w, --word-regexp
          Select  only  those  lines  containing  matches  that form whole
          words.  The test is that the matching substring must  either  be
          at  the  beginning  of  the  line,  or  preceded  by  a non-word
          constituent character.  Similarly, it must be either at the  end
          of  the  line  or  followed by a non-word constituent character.
          Word-constituent  characters  are  letters,  digits,   and   the
          underscore.

   -f FILE, --file=FILE
          Obtain  patterns  from  FILE,  one  per  line.   The  empty file
          contains zero patterns, and therefore matches nothing.   (-f  is
          specified by POSIX.)

If your false positives are because an ID can be present in a non-ID field, loop through your file instead:

while read pat; do grep -w "^$pat" table.csv; done < ids.csv

or, faster:

xargs -I {} grep "^{}" table.csv < ids.csv

Personally, I would do this in perl though:

perl -lane 'BEGIN{open(A,"ids.csv"); while(<A>){chomp; $k{$_}++}} 
            print $_ if defined($k{$F[0]}); ' table.csv
jan
  • 105
terdon
  • 242,166
  • 1
    +1 But: What if there are potential false positives that match the id exactly word-wise, just not in the id column? If you can't use ^ with -F, you can't target the first column specifically. – goldilocks Jan 23 '14 at 20:14
  • @goldilocks if they match exactly, they're not false positives. I get what you mean, but in that case, the OP should show their input files. – terdon Jan 23 '14 at 20:15
  • The way we used to do this was to create temporary files (using awk or sed) that added a unique character (say, control-A) delimiting the field we wanted to search for, then use grep -F -f temppatternfile temptargetfile | tr -d '\001' – Mark Plotnick Jan 23 '14 at 23:14
  • @terdon: could you please add that -f WILL SILENTLY FAIL if the lines are DOS (\r\n) terminated. See http://stackoverflow.com/a/8864345/798677 – That Brazilian Guy Jul 17 '20 at 15:18
  • 1
    @ThatBrazilianGuy Sorry, but I don't really see how that's relevant. This is a site about *nix systems and only Windows uses the \r\n line endings. That said, it is also unsurprising since the \r is part of the patern you are passing to grep so it doesn't SILENTLY FAIL as you say, it works perfectly: it simply doesn't find something that isn't there to be found. – terdon Jul 17 '20 at 15:38
  • I see. It's just that this answer was exactly what I was looking for, but it worked for some cases and dind't work for others. Some files I had were \r terminated and others were \r\n terminated. The files were sent to me and wouldn't care if I was using Windows or *nix. It took me hours to figure out why, and thought I could share my discovery with others to avoid more wasted hours in the furure. Maybe what's so obvious to more seasond people isn't so obvious to less experienced users. – That Brazilian Guy Jul 17 '20 at 15:48
  • 1
    @ThatBrazilianGuy oh, we've all been bitten by the \r\n thing. There are just several dozen posts about it here, and it isn't something that I feel is relevant unless we have reason to think the file came from Windows. Since macs have also abandoned their \r since becoming UNIXes, and given that Windows is off topic here, it doesn't seem like something that needs to be added to the answers here. I mean, we'd have to do this to pretty much every single answer that deals with text parsing! – terdon Jul 17 '20 at 15:53
  • I do realise now that it would "silently fail" no because the presence of \r\n line endings, but because of difference on line-endings on both files. My other points still stand, though. A user unaware of this would be as frustrated as I was. – That Brazilian Guy Jul 17 '20 at 15:53
  • 1
    @ThatBrazilianGuy just to get an idea: https://unix.stackexchange.com/search?tab=votes&q=dos2unix (and that's not anywhere near all relevant posts, just the ones that mention dos2unix a common tool for fixing this issue). – terdon Jul 17 '20 at 15:54
9

The join utility is what you want. It does require the input files to be lexically sorted.

Assuming your shell is bash or ksh:

join -t $'\t' <(sort ids.csv) <(sort table.csv)

Without needing to sort, the usual awk solution is

awk -F '\t' 'NR==FNR {id[$1]; next} $1 in id' ids.csv table.csv
glenn jackman
  • 85,964
  • As I tried but ultimately failed to convey, join is a kludge. Doesn't work for me so well. – alamar Jan 23 '14 at 22:55
  • 1
    join is not a kludge: your words were you couldn't figure it out. Open your mind and learn. What output did you get, and how does that differ from what you expect? – glenn jackman Jan 24 '14 at 00:12
  • +1, this is a job for join. – don_crissti May 01 '15 at 00:46
  • 1
    The awk solution here is very quick and efficient for my purposes (I'm extracting subsets of a few hundred from files with ~100M lines) – Luke May 30 '18 at 09:52
2

The answers to this SO question helped me get around the niggles with join. Essentially, when you sort the file in preparation to send it to join, you need to make sure you're sorting based on the column you're joining on. So if that's the first one, you need to tell it what the separator character is in the file and that you want it to sort on the first field (and only the first field). Otherwise if the first field has variable widths (for example), your separators and possibly other fields may start affecting the sort order.

So, use the -t option of sort to specify your separating character, and use the -k option to specify the field (remembering that you need a start and end field - even if it's the same - or it'll sort from that character to the end of the line).

So for a tab-separated file like in this question, the following should work (with thanks to glenn's answer for structure):

join -t$'\t' <(sort -d ids.csv) <(sort -d -t$'\t' -k1,1 table.csv) > output.csv

(For reference, the -d flag means dictionary sort. You might also want to use the -b flag to ignore leading whitespace, see man sort and man join).

As a more general example, suppose you're joining two comma-separated files - input1.csv on the third column and input2.csv on the fourth. You could use

join -t, -1 3 -2 4 <(sort -d -t, -k3,3 input2.csv) <(sort -d -t, -k4,4 input2.csv) > output.csv

Here the -1 and -2 options specify which fields to join on in the first and second input files respectively.

1

You can also use ruby to do something similar:

ruby -pe 'File.open("id.csv").each { |i| puts i if i =~ /\$\_/ }' table.csv
Jay
  • 29