I have a huge csv file with 10 fields separated by commas. Unfortunately, some lines are malformed and do not contain exactly 10 commas (what causes some problems when I want to read the file into R). How can I filter out only the lines that contain exactly 10 commas?
8 Answers
Another POSIX one:
awk -F , 'NF == 11' <file
If the line has 10 commas, then there will be 11 fields in this line. So we simply make awk
use ,
as the field delimiter. If the number of fields is 11, the condition NF == 11
is true, awk
then performs the default action print $0
.

- 21,373

- 153,898
-
5That's actually the first thing that came to my mind on this question. I thought it was overkill, but looking at the code...it sure is clearer. For the benefit of others:
-F
sets the field separator andNF
refers to the number of fields in a given line. Since no code block{statement}
is appended to the conditionNF == 11
, the default action is to print the line. (@cuonglm, feel free to incorporate this explanation if you like.) – Wildcard Jan 13 '16 at 09:16 -
5+1: Very elegant and readable solution that is also very general. I can e.g. found all malformed lines with
awk -F , 'NF != 11' <file
– sitems Jan 13 '16 at 10:08 -
@gardenhead: It's easy to get it, as you see the OP said in his comment. I sometime answer from my mobile, so it's difficult to add the details explanation. – cuonglm Jan 13 '16 at 10:15
-
1@mikeserv: No, sorry if I made you confused, it's just my bad English. You can't have 11 fields with 1-9 commas. – cuonglm Jan 13 '16 at 10:27
-
oh! of course! right - if they're the splitter they make eleven fields. thanks, man. – mikeserv Jan 13 '16 at 10:38
-
+1 for using the right tool for the job. But I wonder why you '< file', instead of just 'file' to give it as a parameter to awk (which also allows awk to know its filename, etc) ? – Olivier Dulac Jan 13 '16 at 13:45
-
1
-
oh, then "./file" instead of "file" is the portable way ^^ (as it can be reused in many other situations, so i would rather get used to that one, with or without the "<" (even if the latter makes it superflous)). thx for the reason. – Olivier Dulac Jan 13 '16 at 16:33
-
@OlivierDulac how is that "the portable way"?
<file
is defined by POSIX, very common and works with any command that takes a filename as input. Are you thinking of<(command)
instead? – terdon Jan 13 '16 at 17:26 -
@terdon: When you say "any command that takes a filename as input", do you mean "any command (program) that takes *data* as *standard input? Because it's not true for "any command that takes a filename as a command-line argument; for example,
cp
andrm
. And some commands won't look at stdin unless you give them a-
* argument; e.g.,file
andstat
. – G-Man Says 'Reinstate Monica' Jan 13 '16 at 18:25 -
I compared this to mine and to the grep like:
yes "$(set 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1; IFS=,; for s in 1 2 3; do shift; echo "$*"; done)" | head -n 5000000 | ...cmd... | wc -l
and this was three times faster than thesed
which completed half again as fast as thegrep
, which took the longest. – mikeserv Jan 13 '16 at 19:15 -
Umm... another check and I may have been hasty - I guess which
sed
is very important. While still trailingawk
by a full 3 seconds to completion, my statically-linked musl libc heirloomsed
does in 8 seconds what my GNUsed
requires 18 to do. The variation betweengrep
s was not significant, but thegawk
was still faster than themawk
. – mikeserv Jan 13 '16 at 19:24 -
@G-Man I meant any command that takes a file name and then opens it to process the data it contains. I just wanted to point out that there is nothing non-portable about
< file
. – terdon Jan 13 '16 at 22:28 -
-
-
-
Or if you have lots of little (or big) files and you want to merge it, you could use a variant of
cat `ls *.csv` | awk -F , 'NF == 11' > merged.log
then change that file's extension to csv when through. – mpag Mar 23 '18 at 21:54
Using egrep
(or grep -E
in POSIX):
egrep "^([^,]*,){10}[^,]*$" file.csv
This filters out anything not containing 10 commas: it matches full lines (^
at the start and $
at the end), containing exactly ten repetitions ({10}
) of the sequence "any number of characters except ',', followed by a single ','" (([^,]*,)
), followed again by any number of characters except ',' ([^,]*
).
You can also use the -x
parameter to drop the anchors:
grep -xE "([^,]*,){10}[^,]*" file.csv
This is less efficient than cuonglm's awk
solution though; the latter is typically six times faster on my system for lines with around 10 commas. Longer lines will cause huge slowdowns.

- 434,908
The simplest grep
code that will work:
grep -xE '([^,]*,){10}[^,]*'
Explanation:
-x
ensures that the pattern must match the entire line, rather than just part of it. This is important so you don't match lines with more than 10 commas.
-E
means "extended regex", which makes for less backslash-escaping in your regex.
Parentheses are used for grouping, and the {10}
afterwards means there must be exactly ten matches in a row of the pattern within the parantheses.
[^,]
is a character class—for instance, [c-f]
would match any single character that is a c
, a d
, an e
or an f
, and [^A-Z]
would match any single character that is NOT an uppercase letter. So [^,]
matches any single character except a comma.
The *
after the character class means "zero or more of these."
So the regex part ([^,]*,)
means "Any character except a comma any number of times (including zero times), followed by a comma" and the {10}
specifies 10 of these. Then [^,]*
to match the rest of the non-comma characters to the end of the line.

- 36,499
sed -ne's/,//11;t' -e's/,/&/10p' <in >out
That first branches out any line with 11 or more commas, and then prints of what remains only those that match 10 commas.
Apparently I answered this before... Here's a me-plagiarism from a question looking for exactly 4 occurrences of some pattern:
You can target
[num]
th occurrence of a pattern with a seds///
ubstitution command by just adding the[num]
to the command. When yout
est for a successful substitution and don't specify a target:
label, thet
est branches out of the script. This means all you have to do is test fors///5
or more commas, then print what remains.Or, at least, that handles the lines which exceed your maximum of 4. Apparently you also have a minimum requirement. Luckily, that is just as simple:
sed -ne 's|,||5;t' -e 's||,|4p'
...just replace the 4th occurrence of
,
on a line with itself and tack yourp
rint on to thes///
ubstitution flags. Because any lines matching,
5 or more times have already been pruned, the lines containing 4,
matches contain only 4.
-
1@cuonglm - that is what I had actually, at first, but people are always telling me i should write more readable code. since i can read the stuff others dispute as unreadable im not sure what to keep and what to drop...? so i put the second comma. – mikeserv Jan 13 '16 at 10:04
-
@cuonglm - you can mock me - it wont hurt my feelings. i can take a joke. if you were mocking me it was a little funny. its ok - i just wasn't sure and wanted to know. in my opinion, people should be able to laugh at themselves. anyway, i still dont get it! – mikeserv Jan 13 '16 at 10:34
-
Haha, right, it's a very positive thinking. Anyway, it's very funny to chat with you and sometimes, you stress my brain. – cuonglm Jan 13 '16 at 10:42
-
It's interesting that in this answer, if I replace
s/hello/world/2
withs//world/2
, GNU sed work fine. With twosed
from heirloom,/usr/5bin/posix/sed
raise segfault,/usr/5bin/sed
goes into infinitive loop. – cuonglm Jan 14 '16 at 04:23 -
@mikeserv, in reference to our earlier discussion about
sed
andawk
(in comments)—I like this answer and upvoted it, but notice the translation of the acceptedawk
answer is: "Print lines with 11 fields" and the translation of thissed
answer is: "Attempt to remove the 11th comma; skip to next line if you fail. Attempt to replace the 10th comma with itself; print line if you succeed." Theawk
answer gives the instructions to the computer just the way you would express them in English. (awk
is good for field based data.) – Wildcard Jan 14 '16 at 06:24 -
@Wildcard - I never argued that it wasn't, only that
sed
was more flexible. And it is - you can skip to next line if it fails because the tests are your own to construct - you have more control. And so on the next line the field delimiter could change entirely if you like - because you define fields with each match - or even in the middle of this one. It is more basic - more fundamental - and, in my hands at least, far more powerful. I don't see how the points you bring up here are supposed to contradict any I offered there. – mikeserv Jan 14 '16 at 06:54 -
Anyway, I don't consider that English would make a very good programming language. – mikeserv Jan 14 '16 at 07:00
-
@cuonglm - it didn't for me - it processed 5 million lines without issue - several times. – mikeserv Jan 14 '16 at 07:15
Throwing some short python
:
#!/usr/bin/env python2
with open('file.csv') as f:
print '\n'.join(line for line in f if line.count(',') == 10)
This will read each line and check if the number of commas in the line is equal to 10 line.count(',') == 10
, if so print it will the line.

- 56,300
And here's a Perl way:
perl -F, -ane 'print if $#F==10'
The -n
causes perl
to read its input file line by line and execute the script given by -e
on each line. The -a
turns on automatic splitting: each input line will be split on the value given by -F
(here, a comma) and saved as the array @F
.
The $#F
(or, more generally $#array
), is the highest index of the array @F
. Since arrays start at 0
, a line with 11 fields will have an @F
of 10
. The script, therefore, prints the line if it has exactly 11 fields.

- 242,166
-
1You could also do
print if @F==11
as an array in a scalar context returns the number of elements. – Sobrique Jan 14 '16 at 09:34
If fields can contain commas or newlines your code needs to understand csv. Example (with three columns):
$ cat filter.csv
a,b,c
d,"e,f",g
1,2,3,4
one,two,"three
...continued"
$ cat filter.csv | python3 -c 'import sys, csv
> csv.writer(sys.stdout).writerows(
> row for row in csv.reader(sys.stdin) if len(row) == 3)
> '
a,b,c
d,"e,f",g
one,two,"three
...continued"
I suppose that most solutions so far would discard the second and fourth row.

- 121
To display those lines NOT containing the exact number of delimiters :
[ $(awk -F, '!(NF == 11)' <myfile.csv | wc -l) -gt 0 ] && printf "\nPlease maintain the comma separator for the following line(s) in the myfile.csv: \n\n" && awk -F, '!(NF == 11)' <myfile.csv && echo && exit 1
Useful for those looking to validate their input data files.
This one liner is taken from my open source project called Automated_Quiz, which is hosted here : https://sourceforge.net/projects/automated-quiz/

- 33
sed
does here) only as far as one more match than is looked for, though this question does. You should not have closed this. – mikeserv Jan 13 '16 at 18:42grep
answer there is not an acceptable answer for either question... – mikeserv Jan 13 '16 at 18:52