CSV files are tricky. Working on the assumption that Somedata
is a properly quoted data field that may contain commas, we may replace the delimiter with something that is definitely not occurring in the data, for example a tab character ($'\t'
in most modern shells) (you will change this to something you know works). If the data fields are free from commas, just skip the csvformat
bits here.
Using csvkit
:
$ csvformat -D$'\t' data.csv
SomeData SomeData 1 SomeData
SomeData SomeData 1 SomeData
SomeData SomeData 2 SomeData
SomeData SomeData 3 SomeData
SomeData SomeData 1 SomeData
etc.
We can then pass this to an awk
script that does the actual work of finding the groups.
NR > 1 && $3 == count + 1 {
# This line is part of the set.
++count; # We expect this value on the next line.
++set_size; # This is the number of lines in the set.
# Output previous line and remember this line.
print previous_line;
previous_line = $0;
# Continue with next line.
next;
}
set_size > 0 && $3 != count + 1 {
# This line is not part of the set, but we're currently tracking a
# set. This means that the set ended, so output the last line of
# the set.
print previous_line;
set_size = 0;
}
{
# This line might be part of the next set.
count = $3;
previous_line = $0
}
Running it:
$ csvformat -D$'\t' data.csv | awk -F$'\t' -f script.awk
SomeData SomeData 1 SomeData
SomeData SomeData 2 SomeData
SomeData SomeData 3 SomeData
SomeData SomeData 1 SomeData
SomeData SomeData 2 SomeData
SomeData SomeData 3 SomeData
SomeData SomeData 4 SomeData
SomeData SomeData 5 SomeData
Then just get it back on standard comma-delimited form:
$ csvformat -D$'\t' data.csv | awk -F$'\t' -f script.awk | csvformat -d$'\t'
SomeData,SomeData,1,SomeData
SomeData,SomeData,2,SomeData
SomeData,SomeData,3,SomeData
SomeData,SomeData,1,SomeData
SomeData,SomeData,2,SomeData
SomeData,SomeData,3,SomeData
SomeData,SomeData,4,SomeData
SomeData,SomeData,5,SomeData
If the data is free of commas inside of the data fields, you may leave csvformat
out of it completely:
$ awk -F',' -f script.awk data.csv
SomeData,SomeData,1,SomeData
SomeData,SomeData,2,SomeData
SomeData,SomeData,3,SomeData
SomeData,SomeData,1,SomeData
SomeData,SomeData,2,SomeData
SomeData,SomeData,3,SomeData
SomeData,SomeData,4,SomeData
SomeData,SomeData,5,SomeData
$line
in theawk
script but should have writtenline
. Sorry about that. – Chris Davies Feb 02 '17 at 17:03