Filtering 20GB of .csv files with grep: lookbehind doesn't seem to behave as expected

Question

As per the title, I need to filter lines from about a dozen csv files totaling about 20GB.

The file contents are formatted as follows:

IdSensore,Data,Valore,Stato,idOperatore
2001,01/01/2019 00:00:00,3.4,VA,1
2002,01/01/2019 00:00:00,89,VA,1
2008,01/01/2019 00:00:00,0,VA,1
2039,01/01/2019 00:00:00,2.1,VA,1

I need to select all lines such that the first item is a specific ID of my choice, for instance 8010, and such that the next-to-last value (which is always a 2-character code) begins with V. I also need to only output the second and third values in each line (the date and the following number).

I managed to match the desired lines with:

grep -h "^8010,.*,.*,V.,.*" *.csv

However, this matches the whole line. I'd like to match only the parts I need, so that I can have grep output only the match and redirect it into a file.

I tried using lookbehind to check for the correct ID without matching it, as follows:

grep -h "(?<=^8010,).*,.*,V.,.*" *.csv

However, this appears to match nothing.

This is solved by using the -P option with grep, yielding the following final regex:

grep -Ph "(?<=^8010,).*,.*(?=,V.,.*)" *.csv

I now need to do this filtering for dozens of IDs: can I do this via the command line, or should I try and write a script?

Note that I am used to MATLAB and such, but resorted to the command line because attempting to import just 2GB of that data into MATLAB made my poor laptop run out of RAM - nevermind the minutes of waiting for MATLAB itself to get a grip.

"PCRE ... adopted by GNU grep -P: ... (?<=FOO)BAR (lookbehind) " — muru, May 28 '20 at 15:42
you can use the great Miller (https://github.com/johnkerl/miller) and run something like mlr --csv filter -S '$IdSensore=="2039" && $Stato=~"^V"' input.csv >output.csv — aborruso, May 29 '20 at 10:12

Filtering 20GB of .csv files with grep: lookbehind doesn't seem to behave as expected

0 Answers0