I have a .tsv file (values separated by tabs) with four values. So each line should have only three tabs and some text around each tab like this:
value value2 value3 value4
But it looks that some lines are broken (there is more than three tabs). I need to find out these lines.
I came up with following grep pattern.
grep -v "^[^\t]+\t[^\t]+\t[^\t]+\t[^\t]+$"
My thinking:
- first ^ matches the beggining
- [^\t]+ matches more than one "no tab character"
- \t matches single tab character
- $ matches end
And than I just put it into right order with correct number of times. That should match correct lines. So I reverted it by -v option to get the wrong lines.
But with the -v option it matches any line in the file and also some random text I tried that don't have any tabs inside.
What is my mistake please?
EDIT: I am using debian and bash.
grep
for lines with at least four tabs:grep '\t.*\t.*\t.*\t'
? – Philippos Aug 16 '22 at 11:53.*
ends immediately before a tab, so it need never backtrack into the middle of one, only to a different tab-position, which is either unnecessary (in the case that the pattern matches) or proven impossible by a fixed-string search (in the case that it doesn't). – hobbs Aug 17 '22 at 03:40.*?
is also a Perl-ism, and won't work in standardgrep
. The awk solution terdon posted half a day earlier runs in one tenth of the time. – ilkkachu Aug 17 '22 at 07:48grep -P
made it even faster than the awk, withgrep -E
at around 3 s, awk at 0.3 s andgrep -P
at less than 0.1 s, even without changing the regexes. – ilkkachu Aug 17 '22 at 08:12$'^([^\t]*\t){24}'
and 20 to 30 tab input lines I got about a 10x speed difference between Busybox and GNU grep's-E
engine, and another ~10x between Busybox and GNU grep-P
. With-E
being slowest, Busybox in the middle andgrep -P
being fastest. And awk slightly slower thangrep -P
. – ilkkachu Aug 17 '22 at 11:46grep -Pv "^(.+\t){29}.+$" test.csv
fails withgrep: exceeded PCRE's backtracking limit
.grep -Pv "^([^\t]+\t){29}[^\t]+$" test.csv
andgrep -Pv "^(.+?\t){29}.+$" test.csv
both return instantly. With more lines but "only" 19 tabs, the.+
version returns, but is ~10000 times slower than with.+?
or[^\t]+
. – Eric Duminil Aug 20 '22 at 11:32