2

My task is to find lines with egrep in which first word contains exactly three same letters. I've tried to use backreferences but only found a way to build pattern which finds words builded from 3 or more same characters:

egrep -i '^[^[:alpha:]]*\<[a-z]*([a-z])[a-z]*(\1[a-z]*){2}\>'
steeldriver
  • 81,074

3 Answers3

1

The following matches any "word" at the beginning of a line consisting of only 3 of the same [:alpha:] character:

grep -i '^\([[:alpha:]]\)\1\1\b' 

Or, with grep's -E (--extended-regexp) or -P (aka --perl-regexp) options:

grep -iE '^([[:alpha:]])\1\1\b'

grep -iP '^([[:alpha:]])\1\1\b'

These work with GNU grep and (except for the -P version) with FreeBSD's grep. They may not work with other versions of grep.


If you want to match words of any length containing 3 or more of the same alpha character anywhere within them, it's a bit more difficult. You need to use a negative lookahead, which requires perl compatible regular expressions.

i.e. it cannot be done with grep -E (aka egrep, which has been deprecated).

For example:

$ grep -iP '^[[:alpha:]]*([[:alpha:]])((?:(?!\1)[[:alpha:]])*\1){2}[[:alpha:]]*\b' /usr/share/dict/words
Aaliyah
Aaliyah's
Aarau
Aargau
Aaronical
Abadan
Abbottstown
Abbottstown's
Aberdeen
Aberdeen's
...
zoozoo
zoozoos
zuzzes
zwitterionic
zygogeneses
zygomorphous
zymogeneses
zyzzyva
zyzzyvas
zzz

(according to wc -l, this matches 67117 out of the 344817 words in my /usr/share/dict/words file)


And, finally, to match only words with exactly 3 of the same [:alpha:] character anywhere within them:

$ grep -iP '^[[:alpha:]]*([[:alpha:]])((?:(?!\1)[[:alpha:]])*\1){2}[[:alpha:]]*\b' /usr/share/dict/words | 
  grep -viP '^[[:alpha:]]*([[:alpha:]])((?:(?!\1)[[:alpha:]])*\1){3}'

The first grep finds words with 3 or more of the same character, and the second excludes those with 4 or more.

I'm not sure if this can be done with a single regex or not.

(this matches 56820 words in my /usr/share/dict/words file).

cas
  • 78,579
  • Your 2nd version will match tattooists (4 a's and 4 t's, but no letter 3 times) but not zoologist (3 s's). Backreferences are NOT the way to go about it. –  Sep 02 '19 at 04:13
  • huh? there's only 1 a in tattooist. and 4 ts. the 2nd version matches 3 or more, as mentioned. zoologist is a problem, i'll look into that. – cas Sep 02 '19 at 04:22
  • BTW, back-references with negative lookaheads are the way to do it. In fact, that is the ONLY way to do it with a single regular expression alone. – cas Sep 02 '19 at 04:25
  • messed the letters a bit when copy-pasted, sorry: so 4 t's in tattooists, no letter exactly 3 times as required in the OP. –  Sep 02 '19 at 04:25
  • can you not read? 2nd version is explicitly documented as "If you want to match words of any length containing 3 or more of the same alpha character anywhere within them". 3 or more encompasses 4. – cas Sep 02 '19 at 04:27
  • Could it be done with regexps + backreferences at all? that's an interesting question. –  Sep 02 '19 at 04:40
  • @UncleBilly it seems that it can. and it also seems that you need to know that back-references and look-aheads (or look-arounds in general) are not the same thing. Related, in that they're both part of regular expressions (as are character classes and quantifiers), but not the same. – cas Sep 02 '19 at 05:40
  • Still not there: recklessness, underdeveloped, assassination, inconvenience, transcontinental, maintainability, etc. –  Sep 02 '19 at 06:26
  • 1
    you still can't read. the 2nd version (without the grep -viP) will match 3 or more of the same alpha char. as has been mentioned multiple times. The version with grep -viP excludes any 4-or-more matches. – cas Sep 02 '19 at 06:41
  • 1
    recklessness has exactly 3 es and your final code doesn't match it. Read you own "finally, to match only words with exactly 3 of the same ...". Do you need some code the check your regexps against? –  Sep 02 '19 at 07:52
  • recklessness also has 4 s chars in it, so it is excluded. The OP hasn't specified how that conflict is to be resolved, or even if it IS a conflict, so I'm going with "can't have more than 3 of the same character". If you have a different interpretation, feel free to write your own answer. – cas Sep 02 '19 at 08:46
  • Then banana should be excluded too, because it has 3 as, but only 2 ns. And puppy too, because it has only 1 u and 1 y. –  Sep 02 '19 at 10:17
  • 1 or 2 of any letter is irrelevant. Stop trying so hard for faulty excuses to be pedantic, it's both undignified and demeaning. – cas Sep 02 '19 at 10:27
  • behave yourself 2. you're not fooling anyone
  • –  Sep 02 '19 at 10:28
  • I am not attempting to fool anyone. Go pester someone else with your false pedantry - you don't even have the excuse of being right. – cas Sep 02 '19 at 10:32
  • Fails to match gewürztraminer (3 r's) or o'keeffe (with 3 e's). I am not sure why .... –  Sep 02 '19 at 22:53
  • Should tastelessness teleconferencing teletypewriter teletypewriters tempestuousness timelessness tintinnabulation tintinnabulations tirelessness transcontinental transgressors transubstantiation (just to list a few) be rejected? All have exactly 3 of one character while also having 4 of some other character. –  Sep 02 '19 at 23:00
  • @Isaac gewürztraminer may be due to locale or unicode issues. o'keeffe definitely because the embedded ' triggers \b. The other words you listed - as already mentioned more than 3 of same character = exclude, even if there's exactly 3 of another. – cas Sep 02 '19 at 23:42
  • Allowing apostrophes is easy enough. I decided against it because that would match on the possessive form of all words with 2 s characters. – cas Sep 02 '19 at 23:46