There is no way to build such regex with only ERE (extended regex).
The closer with GNU grep (perl regex) (which match 3 or more repeated characters) is:
grep -P '(\w)(((?!\1)\w)*\1){2}' filename
So, removing words with 4 or more repeats, you will get an answer:
grep -P '(\w)(((?!\1)\w)*\1){2}' filename |
grep -Pv '(\w)(((?!\1)\w)*\1){3}'
An alternative with GNU awk is:
awk '{
a=$1;
while (length(a)){
b=gensub(substr(a,0,1),"","g",a);
if(length(a)-length(b)==3){print $0;next};
a=b
}
}' filename
It works by removing all repeats of the first character, if the removal was of 3 characters then print it, else, remove the next first letter until there are no more characters to replace (an improvement is to test only if the remaining length is equal or bigger than the repeat required).
Assuming that you want to count A
as equivalent to a
, then filter your file with:
cat /usr/share/dict/words | tr [[:upper:]] [[:lower:]] > words
The two solutions are similar but not equal. The two differ on words like independence
from the dictionary file generated above.
Yes, independence
contains 3 n
's but 4 e
's. Depending of which is found first the word may be included or not. Awk solution is stable and will include words in which any character is repeated exactly 3 times. A regex solution is more slipery and will match in some conditions and not in others.
Additionally, the regex will match only word characters which do not include '
(and the file contains several words with that character).
In all, the number of lines matched is (1527 more with awk):
13758 awklist
12231 greplist
And, removing the '
(184 more with awk):
9236 awklist2
9052 greplist2
Should tastelessness teleconferencing teletypewriter teletypewriters tempestuousness timelessness tintinnabulation tintinnabulations tirelessness transcontinental transgressors transubstantiation
(just to list a few) be rejected?
All do have exactly 3 of one character and four (or more) of another.
tattooists
(4 a's and 4 t's, but no letter 3 times) but notzoologist
(3 s's). Backreferences are NOT the way to go about it. – Sep 02 '19 at 04:13tattooists
, no letter exactly 3 times as required in the OP. – Sep 02 '19 at 04:25grep -viP
) will match 3 or more of the same alpha char. as has been mentioned multiple times. The version withgrep -viP
excludes any 4-or-more matches. – cas Sep 02 '19 at 06:41recklessness
has exactly 3e
s and your final code doesn't match it. Read you own "finally, to match only words with exactly 3 of the same ...". Do you need some code the check your regexps against? – Sep 02 '19 at 07:52recklessness
also has 4s
chars in it, so it is excluded. The OP hasn't specified how that conflict is to be resolved, or even if it IS a conflict, so I'm going with "can't have more than 3 of the same character". If you have a different interpretation, feel free to write your own answer. – cas Sep 02 '19 at 08:46banana
should be excluded too, because it has 3a
s, but only 2n
s. Andpuppy
too, because it has only 1u
and 1y
. – Sep 02 '19 at 10:17gewürztraminer
(3r
's) oro'keeffe
(with 3e
's). I am not sure why .... – Sep 02 '19 at 22:53tastelessness teleconferencing teletypewriter teletypewriters tempestuousness timelessness tintinnabulation tintinnabulations tirelessness transcontinental transgressors transubstantiation
(just to list a few) be rejected? All have exactly 3 of one character while also having 4 of some other character. – Sep 02 '19 at 23:00'
triggers\b
. The other words you listed - as already mentioned more than 3 of same character = exclude, even if there's exactly 3 of another. – cas Sep 02 '19 at 23:42s
characters. – cas Sep 02 '19 at 23:46