11

I have a huge csv file with 10 fields separated by commas. Unfortunately, some lines are malformed and do not contain exactly 10 commas (what causes some problems when I want to read the file into R). How can I filter out only the lines that contain exactly 10 commas?

sitems
  • 245
  • 1
    your question and the linked question are not the same question. you ask how to how to handle lines with no more or fewer than a certain number of matches, whereas that question requires only a minimum match count. the reality is that question is more easily answered - it doesn't require scanning a line in full, or (at least, as the sed does here) only as far as one more match than is looked for, though this question does. You should not have closed this. – mikeserv Jan 13 '16 at 18:42
  • 1
    actually, looking closer, the asker there does want no more or fewer than matches. that question needs a new title. but the grep answer there is not an acceptable answer for either question... – mikeserv Jan 13 '16 at 18:52

8 Answers8

25

Another POSIX one:

awk -F , 'NF == 11' <file

If the line has 10 commas, then there will be 11 fields in this line. So we simply make awk use , as the field delimiter. If the number of fields is 11, the condition NF == 11 is true, awk then performs the default action print $0.

EightBitTony
  • 21,373
cuonglm
  • 153,898
  • 5
    That's actually the first thing that came to my mind on this question. I thought it was overkill, but looking at the code...it sure is clearer. For the benefit of others: -F sets the field separator and NF refers to the number of fields in a given line. Since no code block {statement} is appended to the condition NF == 11, the default action is to print the line. (@cuonglm, feel free to incorporate this explanation if you like.) – Wildcard Jan 13 '16 at 09:16
  • 5
    +1: Very elegant and readable solution that is also very general. I can e.g. found all malformed lines with awk -F , 'NF != 11' <file – sitems Jan 13 '16 at 10:08
  • @gardenhead: It's easy to get it, as you see the OP said in his comment. I sometime answer from my mobile, so it's difficult to add the details explanation. – cuonglm Jan 13 '16 at 10:15
  • 1
    @mikeserv: No, sorry if I made you confused, it's just my bad English. You can't have 11 fields with 1-9 commas. – cuonglm Jan 13 '16 at 10:27
  • oh! of course! right - if they're the splitter they make eleven fields. thanks, man. – mikeserv Jan 13 '16 at 10:38
  • +1 for using the right tool for the job. But I wonder why you '< file', instead of just 'file' to give it as a parameter to awk (which also allows awk to know its filename, etc) ? – Olivier Dulac Jan 13 '16 at 13:45
  • 1
    @OlivierDulac: It guards you against file start with - or named -. – cuonglm Jan 13 '16 at 15:42
  • oh, then "./file" instead of "file" is the portable way ^^ (as it can be reused in many other situations, so i would rather get used to that one, with or without the "<" (even if the latter makes it superflous)). thx for the reason. – Olivier Dulac Jan 13 '16 at 16:33
  • @OlivierDulac how is that "the portable way"? <file is defined by POSIX, very common and works with any command that takes a filename as input. Are you thinking of <(command) instead? – terdon Jan 13 '16 at 17:26
  • @terdon: When you say "any command that takes a filename as input", do you mean "any command (program) that takes *data* as *standard input?  Because it's not true for "any command that takes a filename as a command-line argument; for example, cp and rm.  And some commands won't look at stdin unless you give them a -* argument; e.g., file and stat. – G-Man Says 'Reinstate Monica' Jan 13 '16 at 18:25
  • I compared this to mine and to the grep like: yes "$(set 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1; IFS=,; for s in 1 2 3; do shift; echo "$*"; done)" | head -n 5000000 | ...cmd... | wc -l and this was three times faster than the sed which completed half again as fast as the grep, which took the longest. – mikeserv Jan 13 '16 at 19:15
  • Umm... another check and I may have been hasty - I guess which sed is very important. While still trailing awk by a full 3 seconds to completion, my statically-linked musl libc heirloom sed does in 8 seconds what my GNU sed requires 18 to do. The variation between greps was not significant, but the gawk was still faster than the mawk. – mikeserv Jan 13 '16 at 19:24
  • @G-Man I meant any command that takes a file name and then opens it to process the data it contains. I just wanted to point out that there is nothing non-portable about < file. – terdon Jan 13 '16 at 22:28
  • This solution will not work with empty fields – tashuhka May 03 '16 at 14:31
  • @tashuhka: It will, can you show an example? – cuonglm May 03 '16 at 14:33
  • @cuonglm True. Thank you for the correction :) – tashuhka May 03 '16 at 14:56
  • Or if you have lots of little (or big) files and you want to merge it, you could use a variant of cat `ls *.csv` | awk -F , 'NF == 11' > merged.log then change that file's extension to csv when through. – mpag Mar 23 '18 at 21:54
8

Using egrep (or grep -E in POSIX):

egrep "^([^,]*,){10}[^,]*$" file.csv

This filters out anything not containing 10 commas: it matches full lines (^ at the start and $ at the end), containing exactly ten repetitions ({10}) of the sequence "any number of characters except ',', followed by a single ','" (([^,]*,)), followed again by any number of characters except ',' ([^,]*).

You can also use the -x parameter to drop the anchors:

grep -xE "([^,]*,){10}[^,]*" file.csv

This is less efficient than cuonglm's awk solution though; the latter is typically six times faster on my system for lines with around 10 commas. Longer lines will cause huge slowdowns.

Stephen Kitt
  • 434,908
5

The simplest grep code that will work:

grep -xE '([^,]*,){10}[^,]*'

Explanation:

-x ensures that the pattern must match the entire line, rather than just part of it. This is important so you don't match lines with more than 10 commas.

-E means "extended regex", which makes for less backslash-escaping in your regex.

Parentheses are used for grouping, and the {10} afterwards means there must be exactly ten matches in a row of the pattern within the parantheses.

[^,] is a character class—for instance, [c-f] would match any single character that is a c, a d, an e or an f, and [^A-Z] would match any single character that is NOT an uppercase letter. So [^,] matches any single character except a comma.

The * after the character class means "zero or more of these."

So the regex part ([^,]*,) means "Any character except a comma any number of times (including zero times), followed by a comma" and the {10} specifies 10 of these. Then [^,]* to match the rest of the non-comma characters to the end of the line.

Wildcard
  • 36,499
5
sed -ne's/,//11;t' -e's/,/&/10p' <in >out

That first branches out any line with 11 or more commas, and then prints of what remains only those that match 10 commas.

Apparently I answered this before... Here's a me-plagiarism from a question looking for exactly 4 occurrences of some pattern:

You can target [num]th occurrence of a pattern with a sed s///ubstitution command by just adding the [num] to the command. When you test for a successful substitution and don't specify a target :label, the test branches out of the script. This means all you have to do is test for s///5 or more commas, then print what remains.

Or, at least, that handles the lines which exceed your maximum of 4. Apparently you also have a minimum requirement. Luckily, that is just as simple:

sed -ne 's|,||5;t' -e 's||,|4p'

...just replace the 4th occurrence of , on a line with itself and tack your print on to the s///ubstitution flags. Because any lines matching , 5 or more times have already been pruned, the lines containing 4 , matches contain only 4.

mikeserv
  • 58,310
  • 1
    @cuonglm - that is what I had actually, at first, but people are always telling me i should write more readable code. since i can read the stuff others dispute as unreadable im not sure what to keep and what to drop...? so i put the second comma. – mikeserv Jan 13 '16 at 10:04
  • @cuonglm - you can mock me - it wont hurt my feelings. i can take a joke. if you were mocking me it was a little funny. its ok - i just wasn't sure and wanted to know. in my opinion, people should be able to laugh at themselves. anyway, i still dont get it! – mikeserv Jan 13 '16 at 10:34
  • Haha, right, it's a very positive thinking. Anyway, it's very funny to chat with you and sometimes, you stress my brain. – cuonglm Jan 13 '16 at 10:42
  • It's interesting that in this answer, if I replace s/hello/world/2 with s//world/2, GNU sed work fine. With two sed from heirloom, /usr/5bin/posix/sed raise segfault, /usr/5bin/sed goes into infinitive loop. – cuonglm Jan 14 '16 at 04:23
  • @mikeserv, in reference to our earlier discussion about sed and awk (in comments)—I like this answer and upvoted it, but notice the translation of the accepted awk answer is: "Print lines with 11 fields" and the translation of this sed answer is: "Attempt to remove the 11th comma; skip to next line if you fail. Attempt to replace the 10th comma with itself; print line if you succeed." The awk answer gives the instructions to the computer just the way you would express them in English. (awk is good for field based data.) – Wildcard Jan 14 '16 at 06:24
  • @Wildcard - I never argued that it wasn't, only that sed was more flexible. And it is - you can skip to next line if it fails because the tests are your own to construct - you have more control. And so on the next line the field delimiter could change entirely if you like - because you define fields with each match - or even in the middle of this one. It is more basic - more fundamental - and, in my hands at least, far more powerful. I don't see how the points you bring up here are supposed to contradict any I offered there. – mikeserv Jan 14 '16 at 06:54
  • Anyway, I don't consider that English would make a very good programming language. – mikeserv Jan 14 '16 at 07:00
  • @cuonglm - it didn't for me - it processed 5 million lines without issue - several times. – mikeserv Jan 14 '16 at 07:15
4

Throwing some short python:

#!/usr/bin/env python2
with open('file.csv') as f:
    print '\n'.join(line for line in f if line.count(',') == 10)

This will read each line and check if the number of commas in the line is equal to 10 line.count(',') == 10, if so print it will the line.

heemayl
  • 56,300
2

And here's a Perl way:

perl -F, -ane 'print if $#F==10'

The -n causes perl to read its input file line by line and execute the script given by -e on each line. The -a turns on automatic splitting: each input line will be split on the value given by -F (here, a comma) and saved as the array @F.

The $#F (or, more generally $#array), is the highest index of the array @F. Since arrays start at 0, a line with 11 fields will have an @F of 10. The script, therefore, prints the line if it has exactly 11 fields.

terdon
  • 242,166
  • 1
    You could also do print if @F==11 as an array in a scalar context returns the number of elements. – Sobrique Jan 14 '16 at 09:34
2

If fields can contain commas or newlines your code needs to understand csv. Example (with three columns):

$ cat filter.csv
a,b,c
d,"e,f",g
1,2,3,4
one,two,"three
...continued"

$ cat filter.csv | python3 -c 'import sys, csv
> csv.writer(sys.stdout).writerows(
> row for row in csv.reader(sys.stdin) if len(row) == 3)
> '
a,b,c
d,"e,f",g
one,two,"three
...continued"

I suppose that most solutions so far would discard the second and fourth row.

0

To display those lines NOT containing the exact number of delimiters :


[ $(awk -F, '!(NF == 11)' <myfile.csv | wc -l) -gt 0 ] && printf "\nPlease maintain the comma separator for the following line(s) in the myfile.csv: \n\n" && awk -F, '!(NF == 11)' <myfile.csv && echo && exit 1

Useful for those looking to validate their input data files.

This one liner is taken from my open source project called Automated_Quiz, which is hosted here : https://sourceforge.net/projects/automated-quiz/