As per the title, I need to filter lines from about a dozen csv files totaling about 20GB.
The file contents are formatted as follows:
IdSensore,Data,Valore,Stato,idOperatore
2001,01/01/2019 00:00:00,3.4,VA,1
2002,01/01/2019 00:00:00,89,VA,1
2008,01/01/2019 00:00:00,0,VA,1
2039,01/01/2019 00:00:00,2.1,VA,1
I need to select all lines such that the first item is a specific ID of my choice, for instance 8010
, and such that the next-to-last value (which is always a 2-character code) begins with V. I also need to only output the second and third values in each line (the date and the following number).
I managed to match the desired lines with:
grep -h "^8010,.*,.*,V.,.*" *.csv
However, this matches the whole line. I'd like to match only the parts I need, so that I can have grep
output only the match and redirect it into a file.
I tried using lookbehind to check for the correct ID without matching it, as follows:
grep -h "(?<=^8010,).*,.*,V.,.*" *.csv
However, this appears to match nothing.
This is solved by using the -P
option with grep
, yielding the following final regex:
grep -Ph "(?<=^8010,).*,.*(?=,V.,.*)" *.csv
I now need to do this filtering for dozens of IDs: can I do this via the command line, or should I try and write a script?
Note that I am used to MATLAB and such, but resorted to the command line because attempting to import just 2GB of that data into MATLAB made my poor laptop run out of RAM - nevermind the minutes of waiting for MATLAB itself to get a grip.
grep -P
: ...(?<=FOO)BAR
(lookbehind) " – muru May 28 '20 at 15:42mlr --csv filter -S '$IdSensore=="2039" && $Stato=~"^V"' input.csv >output.csv
– aborruso May 29 '20 at 10:12