2

I have a CSV file of approx 1000 rows, and where I'm supposed to import it, I get an error on row 700. However, the entries in this CSV contain newlines (and are quoted), and thus I cannot really quickly use awk or similar to show what is row 700.

So I found Is there a robust command line tool for processing csv files?, and installed both csvfix and csvkit; however it seems none of these applications supports simply specifying a row number (or a range of rows), and outputting them. For instance:

$ csvfix help echo
echo input CSV data to output
usage: csvfix echo [flags] [file ...]
where flags are:
  -ibl      ignore blank input lines
  -sep s    specify CSV field separator character
  -rsep s   as for -sep but retain separator on output
  -osep s   specifies output separator
  -hdr s    write the string s out as a header record
  -ifn      ignore field name record
  -smq      use smart quotes on output
  -sqf fields   specify fields that must be quoted
  -o file   write output to file rather than standard output
  -skip t   if test t is true, do not process or output record

I would have thought echo is what I need, as soon as I could specify which row(s) is(are) to be echoed, but when I look at http://neilb.bitbucket.org/csvfix/manual/csvfix16/csvfix.html?unique.html, only columns are described.

How could I use these tools - or other tools - to simply dump say row 700 (or rows 702-705) from a 1000-row CSV to stdout?


EDIT: Found (http://neilb.bitbucket.org/csvfix/manual/csvfix16/ExpressionLanguage.html) that csvfix has:

csvfix find -if '$line == 407' data.csv

... however, this is indeed line number and not row number; so if the row starts at line 406, then breaks to line 407, and ends at 407; then the above command will output nothing - but if you go one line back, -if '$line == 406', then the row is dumped. This is useful too, but still isn't a row number....

sdbbs
  • 480

3 Answers3

2

The csvfix find command does support dumping a row by range or number. The following command would extract lines 3 and 4 from a file called file.csv.

csvfix find -if '$line >= 3 && $line < 5' file.csv
1

You could temporarily remove all quoted newlines to be able to use normal text tools and re-add the newlines.

For example in case if double quotes:

gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "%NEWLINE%") } { printf("%s%s", $0, RT) }' file.csv > tmp.csv
head -n 700 tmp.csv | sed 's/%NEWLINE%/\n/g' > file_1-700.csv
rudimeier
  • 10,315
  • Thanks @rudimeier - this is a good suggestion; if I don't manage to find anything about a csv tool that can do that, then I'll have to use this. – sdbbs Sep 27 '16 at 16:24
1

You can get a position out of perl's Text::CSV_XS like this:

perl -MText::CSV_XS -E 'open(my $fh, "<:encoding(utf8)", $ARGV[0]) or die "open: $!"; $csv = Text::CSV_XS->new({binary => 1, auto_diag => 9, diag_verbose => 1 } ); while (my $row = $csv->getline($fh)) { say tell $fh }' FILENAME.csv

Note the FILENAME.csv at the end of the line.

After successfully parsing each row, it'll print the byte offset.

Unpacking the one-liner:

use Text::CSV_XS;
use feature 'say';
open(my $fh, '<:encoding(utf8)', $ARGV[0]) or die "open: $!";
$csv = 'Text::CSV_XS'->new({'binary' => 1, 'auto_diag' => 9, 'diag_verbose' => 1});
while (my $row = $csv->getline($fh)) {
    say tell $fh
}

I fed it this faulty CSS (new.css):

r1c1,"r1
c2",r1c3
r2c1,"r2c2,r2c3
r3c1,r3c2,r3c3

Output:

18
# CSV_XS ERROR: 2027 - EIQ - Quoted field not terminated @ rec 1 pos 15 field 2

(if there were more good rows before the corrupt one, there would be more byte offsets printed. Use the last one.)

So after byte 18, it found an error. Easy enough to get a line number from that: head -c 18 new.csv | wc -l, which says 2 (the number of good lines). So the error is on line 3β€”and indeed it is, the quote around r2c2 is not closed.

derobert
  • 109,670