9

I have a .tsv file (values separated by tabs) with four values. So each line should have only three tabs and some text around each tab like this:

value   value2  value3  value4

But it looks that some lines are broken (there is more than three tabs). I need to find out these lines.


I came up with following grep pattern.

grep -v "^[^\t]+\t[^\t]+\t[^\t]+\t[^\t]+$"

My thinking:

  • first ^ matches the beggining
  • [^\t]+ matches more than one "no tab character"
  • \t matches single tab character
  • $ matches end

And than I just put it into right order with correct number of times. That should match correct lines. So I reverted it by -v option to get the wrong lines.

But with the -v option it matches any line in the file and also some random text I tried that don't have any tabs inside.

What is my mistake please?

EDIT: I am using debian and bash.

muru
  • 72,889
TGar
  • 287
  • Why not simply grep for lines with at least four tabs: grep '\t.*\t.*\t.*\t'? – Philippos Aug 16 '22 at 11:53
  • @Philippos I thought that my solution is more bullet proof. But you are right, your solution should be enough for most cases. But actually your solution also doesn't work for my file. It looks that the \t matches regular 't' characters... Could it be? – TGar Aug 16 '22 at 12:02
  • @Philippos Ok, so it was really that. \t cannot be used simply like this. Both solutions work great with -P option. See https://stackoverflow.com/a/1825573/6372038 – TGar Aug 16 '22 at 12:07
  • @steeldriver not sure about the rules now. If I saw the question you are posting before posting mine, it would help me for sure. And I would be probably able to come up with solution. But it is not direct answer to my specific case. – TGar Aug 16 '22 at 12:14
  • @Philippos: "why not simply... ?" As far as I can tell, your regex is much less efficient than OP's, and involves a lot of backtracking. You could replace the greedy '.' with non greedy '.?', but it wouldn't be much more readable than the original. – Eric Duminil Aug 17 '22 at 02:47
  • @EricDuminil a sensible regex engine doesn't ever need to backtrack on that pattern because it knows that every .* ends immediately before a tab, so it need never backtrack into the middle of one, only to a different tab-position, which is either unnecessary (in the case that the pattern matches) or proven impossible by a fixed-string search (in the case that it doesn't). – hobbs Aug 17 '22 at 03:40
  • 1
    @EricDuminil, on a quick test, the difference in execution times for a grep of those two regexes over a 317250-line file was about 9 % in favor of the original, hardly making the other "much less efficient". (On my Debian, with GNU grep, after fixing the regexes to actually work.) .*? is also a Perl-ism, and won't work in standard grep. The awk solution terdon posted half a day earlier runs in one tenth of the time. – ilkkachu Aug 17 '22 at 07:48
  • @ilkkachu thanks a lot for checking, and the info about '.*?'. 9% is nothing to sneeze at, especially with just 4 tabs. CSV files can have many more. For what it's worth, I find the original regex the most readable and straightforward. No tabs, tab, no tabs, tab, no tabs, ..... – Eric Duminil Aug 17 '22 at 07:55
  • 1
    @EricDuminil, well, it depends, I guess. I was thinking that if those 9 % matter, then one is probably using a dedicated tool already, and not grep or regexes (given we can just scan the lines character-by-character and count the tabs here, and any other processing would probably be easier by splitting the fields first, instead of trying to figure out regexes for the full line). But what's hilarious is that switching to grep -P made it even faster than the awk, with grep -E at around 3 s, awk at 0.3 s and grep -P at less than 0.1 s, even without changing the regexes. – ilkkachu Aug 17 '22 at 08:12
  • @ilkkachu, thanks for the info, once again. It would be interesting to check with, let's say, 20 or 30 tabs. – Eric Duminil Aug 17 '22 at 09:33
  • 1
    @EricDuminil, hmm, increasing the number of tabs on each line looks to make it even worse for GNU grep. But it looks like there's big differences between regex engines. With a regex like $'^([^\t]*\t){24}' and 20 to 30 tab input lines I got about a 10x speed difference between Busybox and GNU grep's -E engine, and another ~10x between Busybox and GNU grep -P. With -E being slowest, Busybox in the middle and grep -P being fastest. And awk slightly slower than grep -P. – ilkkachu Aug 17 '22 at 11:46
  • 1
    (I didn't check if some of those is the glibc implementation, or if GNU grep and/or Busybox have regex engines of their own. Also I didn't bother with different variants of the regexes themselves; anything would just get swamped in the difference between the engines.) – ilkkachu Aug 17 '22 at 11:48
  • @ilkkachu impressive work. "Would get swamped..." I don't know much about regexen, but id expect the greedy version from philippos to backtrack a lot, possibly with exponential complexity depending on the number of tabs (?). – Eric Duminil Aug 17 '22 at 13:11
  • @ilkkachu: I just checked, with 5 lines, 30 values on each line, and 29 tabs in-between (https://pastebin.com/si0NPzYi). grep -Pv "^(.+\t){29}.+$" test.csv fails with grep: exceeded PCRE's backtracking limit. grep -Pv "^([^\t]+\t){29}[^\t]+$" test.csv and grep -Pv "^(.+?\t){29}.+$" test.csv both return instantly. With more lines but "only" 19 tabs, the .+ version returns, but is ~10000 times slower than with .+? or [^\t]+. – Eric Duminil Aug 20 '22 at 11:32
  • @hobbs: If you're interested, here's a comparison between both regexen : https://regex101.com/r/ms7tYb/1 requires ~million steps. One tab more and regex101 returns a "catastrophic backtracking" error. https://regex101.com/r/634lCP/1 requires 500 steps, and would work with hundreds of tabs. – Eric Duminil Aug 20 '22 at 11:44
  • @EricDuminil no, that's a comparison of two regexes that are both different from the ones that were under discussion. You added capturing and a quantifier, which turns something that would have been okay into something disastrously stupid. – hobbs Aug 21 '22 at 04:02
  • @hobbs No need for derogatory remarks. My point still stands without capturing group or quantifiers, which were only used for quick experimenting. Feel free to try with https://regex101.com/r/ms7tYb/2 and https://regex101.com/r/634lCP/2 , then. The difference in performance is the same as with quantifiers, and their structure is the same as the original regexen. https://regex101.com/r/65iUMi/1 still leads to "catastrophic backtracking", without any capturing group. – Eric Duminil Aug 21 '22 at 07:59

4 Answers4

15

As you already saw, \t isn't special for Basic regular Expressions, and grep uses BRE by default. GNU grep, the default on Linux has -P for Perl Compatible Regular Expressions which lets you use \t for tab characters.

However, what you want is much easier to do with awk. Just set the input field separator to a tab (-F '\t') and then print any lines whose number of fields (NF) is not 3:

awk -F'\t' 'NF!=3' file

That will print all lines in file with more or less than three fields. To limit to only more than three fields, use:

awk -F'\t' 'NF>3' file
terdon
  • 242,166
  • standard ERE doesn't support \t for a tab either (and neither does GNU as far as I can tell, though it supports some of the Perl backslash-letter escapes). [\t] matches a backslash or the letter t in standard BRE or ERE. – ilkkachu Aug 16 '22 at 13:36
  • 2
    Oh wow. All these years, I never knew that! I tend to use -P by default but I really thought ERE also handled that. Maybe because GNU sed does support it or something. Thanks, @ilkkachu, and fixed. – terdon Aug 16 '22 at 13:58
  • 1
    The behaviour for \t in both BRE and ERE is unspecified by POSIX, so what it matches will depend on the implementation. It may match on TAB, t (like GNU grep) or \t (like ksh93's builtin grep) or anything. [\t] is required to match on either backslash or t though (though some implementations of some tools such as GNU sed are not compliant by default in that regard). – Stéphane Chazelas Aug 16 '22 at 16:38
8
grep -v "^[^\t]+\t[^\t]+\t[^\t]+\t[^\t]+$"

Here, grep will use basic regular expressions (BRE), since it wasn't given the -E option. Unlike in extended regular expressions (ERE), the + is not special in BRE and matches itself. Also, in standard regular expressions, the backslash is not special in bracket groups, so [\t] matches a backslash or the letter t, and [^\t] matches anything but those.

Outside of bracket groups, what \t matches is unspecified by the standard and in practice varies with the implementation. For instance, with GNU grep, it matches t, whilst with ast-open's grep, it matches a TAB character.

If you want to match a tab in standard regexes, you'll need to pass a literal tab to grep, e.g. with the $'...' form of quoting supported by many shells. (Though that's not standard either (yet); you'd have to use printf in a standard shell to get the tab character.)

So grep $'a\tb' would look for a and b separated by a tab, and grep $'a\t\t*b' or grep $'a\t\\{1,\\}b' or grep -E $'a\t+b' would look for a and b separated by at least one tab.

ilkkachu
  • 138,973
  • Re: you'll need to pass a literal tab to grep - easiest way to do this in Linux is ctrl-v followed by tab; no need for $''. – Reinstate Monica Aug 17 '22 at 18:15
  • @ReinstateMonica, well "easiest" is up to opinion anyway, but I rather like using $'\t' over a hard tab because you can see it's a tab and not just a pile of spaces. Also tabs sometimes break when copypasting, and e.g. SE's editors destroys them. That Ctrl-V is a shell thing (not Linux), and depending on the shell it might not be supported or required, though almost all shells I tried support it. (but not Busybox, which does support $'...' though). And if you're writing a script in a separate file, you might well be able to just hit tab to get the character, but that's up to your editor. – ilkkachu Aug 18 '22 at 08:32
7

Ok, so I find out the issue. I cannot use \t in grep like this. It just matches normal letter t .

Options how to match tab char are visible in this question on SO.

I solved my case by adding -P option to my command, so this works:

grep -Pv "^[^\t]+\t[^\t]+\t[^\t]+\t[^\t]+$"

Another option was pointed by @Philippos in comments (just matching lines with at least four tabs). It also needs -P option though:

grep -P '\t.*\t.*\t.*\t'
TGar
  • 287
4

As others pointed out already \t in a regex does not stand for TAB. So the obvious solution would be to add a literal TAB character, which BASH may make a bit tricky. However you can enter a literal TAB using ^V (Control+vTAB).

Maybe it's more convenient to set TAB='Control+v TAB'. The other thing is that + is treated literally in non-extended regular expressions (BRE) (see "Basic vs Extended Regular Expressions" in man grep), so use:

grep -v "^[^$TAB]\+$TAB[^$TAB]\+$TAB[^$TAB]\+$TAB[^$TAB]\+$"

(Here you could also shorten the variable to just T, and it's not necessary to use ${TAB} (or ${T}), but be prepared for surprises)

Also when you prefer to use egrep you can use a repeated group like this:

egrep -v "^([^$TAB]+$TAB){3}[^$TAB]+$"
U. Windl
  • 1,411
  • Or TAB=$(printf '\t'). Note that \+ is a GNUism. The standard equivalent is \{1,\}. Or you can switch to ERE with -E and then use +. But here, maybe * would make more sense to allow those TAB-separated fields to be empty like the the awk-based solution does. – Stéphane Chazelas Aug 17 '22 at 08:06
  • 1
    Or you can wrap the regex in $'' to have the shell expand \t into literal TAB characters for you. You will however have to escape any backslashes you want to retain. – TooTea Aug 17 '22 at 15:25
  • @TooTea: That's also a bashism, not necessarily valid with non-bash shells that are often preferred for running scripts. – R.. GitHub STOP HELPING ICE Aug 17 '22 at 22:29
  • 1
    @R..GitHubSTOPHELPINGICE True, but OP's question is explicitly about Bash. Also, if you're talking about dash, note that it does support $'', just like many other common shells. – TooTea Aug 18 '22 at 11:45