check if a string has a character more than once

Question

I would like to check if a string contains a letter (not a specific letter, really any letter) more than once.

for example:

user:

test.sh this list

script:

if [ "$1" has some letter more then once ]
then 
do something
fi

andcoz · Answer 1 · 2015-12-13T13:42:19.740

5

You can use grep.

The regexp $.$.*\1 matches any single character, followed by anything, followed by the same first character.

grep returns success if at least one row matches the regex.

if echo "$1" | grep -q '\(.\).*\1' ; then  
  echo "match" ; 
fi

Note that $.$ matches any character not any letter, perhaps you have to restrict the regex to your specific definition of "really any letter". You can use something like $[[:alnum:]]$.*\1, $[[:alpha:]]$.*\1 or $[a-df-z1245]$.*\1.

edited Dec 13 '15 at 13:42

answered Dec 13 '15 at 13:35

andcoz

17,130

Note that it will only work if those characters are on the same line (won't work if $1 is $'a\na' for instance), won't work for the newline characters. Depending on the implementation of echo, it won't work with strings like -nene or strings containing backslashes. You should avoid echo for arbitrary data – Stéphane Chazelas Dec 13 '15 at 14:05

Stéphane Chazelas · Answer 2 · 2023-12-24T14:09:11.470

c=$(expr " $string" : " .*\(.\).*\1") || [ "$c" = 0 ] &&
  printf '"%s" has "%s" (at least) more than once\n' "$string" "${c:-<newline>}"

(0 for which expr returns false, and newline which command substitution strips have to be treated specially).

To get a report of duplicate bytes, on a GNU system, you could do:

$ string=$'This is a string\nwith «multi-byte» «characters»\n'
printf %s "$string" | od -An -vtc -w1 | LC_ALL=C sort | LC_ALL=C uniq -dc
      5
      3    a
      2    c
      2    e
      3    h
      5    i
      3    r
      4    s
      5    t
      2   \n
      2  253
      2  273
      4  302

The bytes outside of the range covered by ASCII are represented as their octal value, the control characters with their octal value or the \x C representation.

To get a report of duplicate characters:

$ printf %s "$string" | recode ..dump | sort | uniq -dc
      2 000A   LF    line feed (lf)
      5 0020   SP    space
      3 0061   a     latin small letter a
      2 0063   c     latin small letter c
      2 0065   e     latin small letter e
      3 0068   h     latin small letter h
      5 0069   i     latin small letter i
      3 0072   r     latin small letter r
      4 0073   s     latin small letter s
      5 0074   t     latin small letter t
      2 00AB   <<    left-pointing double angle quotation mark
      2 00BB   >>    right-pointing double angle quotation mark

Note however that recode doesn't know about all Unicode characters (especially not the recent ones).

Using shell builtins.

In ksh93:

if [[ $string = *@(?)*\1* ]]; then
  print -r -- "$string contains duplicate characters"
fi

In zsh:

set -o rematchpcre
if [[ $string =~ '(.).*\1' ]]; then
  print -r -- "$string contains duplicate characters ($match[1] at least)"
fi

(would also work without set -o rematchpcre but only on systems where EREs support back-references as an extension over the standard).

Or to get the list of all duplicated characters:

typeset -A count=()
for c (${(s[])string}) if (( ++count[\$c] == 2 )) print -r -- $c is found more than once

score 2 · Answer 3 · edited Dec 13 '15 at 14:24

You could use fold to print the string one character per line, then uniq -c to count them and awk to print only those that appeared more than once:

$ string="foobar"
$ fold -w 1 <<< "$string" | sort | uniq -c | awk '$1>1'
      2 o

Or, if your shell doesn't support here strings:

printf '%s\n' "$string" | fold -w 1 | sort | uniq -c | awk '$1>1'

Then, you could test whether the command above returns an empty string or not:

$ string="foobar"
$ [ -n "$(fold -w 1 <<<"$string" | sort | uniq -c | awk '$1>1')" ] && echo repeated
repeated

You could then easily extend it to print the repeated character and the number of times it was repeated:

$ rep="$(fold -w 1 <<<"$string" | sort | uniq -c | awk '$1>1')"
$ [ -n "$rep" ] && printf -- "%s\n" "$rep"
    2 o

Note that it doesn't work for newline characters. With GNU fold, it doesn't work for multi-byte characters. — Stéphane Chazelas, Dec 13 '15 at 14:25

jubilatious1 · Answer 4 · 2024-01-18T20:11:51.767

Using Raku (formerly known as Perl_6)

At zsh command line:

% string=$'«AAÁÁÅÅÀÀÄÄBBßßœœþþ» CDE\n«X Y Z»\n'
% printf %s "$string"
«AAÁÁÅÅÀÀÄÄBBßßœœþþ» CDE
«X Y Z»
% printf %s "$string" | raku -e '$*IN.comb.raku.put;'
("«", "A", "A", "Á", "Á", "Å", "Å", "À", "À", "Ä", "Ä", "B", "B", "ß", "ß", "œ", "œ", "þ", "þ", "»", " ", "C", "D", "E", "\n", "«", "X", " ", "Y", " ", "Z", "»", "\n").Seq
% printf %s "$string" | raku -e 'slurp.comb.raku.put;'
("«", "A", "A", "Á", "Á", "Å", "Å", "À", "À", "Ä", "Ä", "B", "B", "ß", "ß", "œ", "œ", "þ", "þ", "»", " ", "C", "D", "E", "\n", "«", "X", " ", "Y", " ", "Z", "»", "\n").Seq

Using Hash:

% printf %s "$string" | raku -e 'my %h; %h{$_}++ for slurp.comb(); %h.pairs.sort.map({ print $_.key if $_.value > 1 });'
AB«»ÀÁÄÅßþœ%

OR using BagHash:

% printf %s "$string" | raku -e 'my %h = slurp.comb.BagHash; %h.pairs.sort.map({ print $_.key if $_.value > 1 });'
AB«»ÀÁÄÅßþœ%

Here are answers coded in Raku, a member of the Perl-family of programming languages that features high-level support for Unicode. Answers above use either a standard %-sigiled Hash, or a %-sigiled BagHash (second answer above). [Note zsh adds a % at the terminus to signify an incomplete final line].

In Raku, all text (excepting filenames) is normalized by default. For example, graphemes encoded via combining characters will be turned into one codepoint, per the Normalization Form C (NFC) specification. To get more grapheme/character information from the test string, you can use Raku's ords, encode and uniname functions:

% printf %s "$string" | raku -e 'my %h = slurp.comb.BagHash; %h.pairs.sort.map: { if .value > 1 { .put for ( .key, .key.ord, .key.encode.gist, .key.uniname ).join: "  " }};'
10  utf8:0x<0A>  <control-000A>
   32  utf8:0x<20>  SPACE
A  65  utf8:0x<41>  LATIN CAPITAL LETTER A
B  66  utf8:0x<42>  LATIN CAPITAL LETTER B
«  171  utf8:0x<C2 AB>  LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
»  187  utf8:0x<C2 BB>  RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
À  192  utf8:0x<C3 80>  LATIN CAPITAL LETTER A WITH GRAVE
Á  193  utf8:0x<C3 81>  LATIN CAPITAL LETTER A WITH ACUTE
Ä  196  utf8:0x<C3 84>  LATIN CAPITAL LETTER A WITH DIAERESIS
Å  197  utf8:0x<C3 85>  LATIN CAPITAL LETTER A WITH RING ABOVE
ß  223  utf8:0x<C3 9F>  LATIN SMALL LETTER SHARP S
þ  254  utf8:0x<C3 BE>  LATIN SMALL LETTER THORN
œ  339  utf8:0x<C5 93>  LATIN SMALL LIGATURE OE

Of course, if all you want to do is detect duplicate characters and exit, the following code works:

% printf %s "$string" | raku -e 'slurp.comb.BagHash.pairs.map: { $_.value > 1 && say("duplicate characters exist") && last };'
duplicate characters exist
#OR
% printf %s "$string" | raku -e 'any(slurp.comb.BagHash.pairs.map: *.value > 1).so && say("duplicate characters exist");'
duplicate characters exist

https://docs.raku.org/language/unicode#Normalization
https://docs.raku.org/language/traps#All_text_is_normalized_by_default
https://docs.raku.org/language/faq.html#String:_How_can_I_get_the_hexadecimal_representation_of_a_string%3F

duise · Answer 5 · 2023-12-24T14:21:35.860

Even if the question has been asked 8 years ago, given that all previous answers require external tools, long pipe expressions requiring multiple subshells although the question is tagged with bash, I'd like to present an internal solution.

The function count_chars() works similar to the PHP function with the same name. It takes a string as input and for each character it notes its number of occurrences in an associative array. The array to hold the result is passed by reference as the first argument.

It's then easy to get all the characters which fulfill the filter condition by looping through the index (keys).

EDIT: The updated code should work with Bash 4.3 and newer.

#!/bin/bash
Count character occurences in string $2. For each contained character, return
the number of occurrences in the associative array $1.
This is similar to the PHP function count_chars(), mode 1.
count_chars() {
    [ "$1" = "arr" ] || { declare -n arr 2>/dev/null || return 1; arr="$1"; }
    arr=( )
    local -i i
    local ch
    for (( i=0; i<${#2}; i++ )); do
        ch=${2:$i:1}
        # http://mywiki.wooledge.org/BashPitfalls#A.5B.5B_-v_hash.5B.24key.5D_.5D.5D
        [[ -v 'arr["$ch"]' ]] || arr["$ch"]="0"
        # Surprise, surpise--the increment works, despite
        # http://mywiki.wooledge.org/BashPitfalls#A.28.28_hash.5B.24key.5D.2B-.2B-_.29.29
        # (( ++arr["$ch"] )) EDIT: Bash 5.2+ only
        let '++arr["$ch"]'
    done
}
declare -A A=
count_chars A "Die Hoffnung stirbt zuletzt!"
for k in "${!A[@]}"; do
    (( ${A[$k]} > 1 )) && printf '%s|' "$k"
done
echo

This script will print out:

 |z|u|t|n|i|f|e|

The first result character is the blank. You can easily verify that this is correct:

$ declare -p A
declare -A A=(["!"]="1" [" "]="3" [H]="1" [D]="1" [z]="2" [u]="2" [t]="4" [s]="1" [r]="1" [o]="1" [n]="2" [l]="1" [i]="2" [g]="1" [f]="2" [e]="2" [b]="1" )

If you prefer an array to continue to work on, you could remove the nonmatching elements from the array:

for k in "${!A[@]}"; do
    (( ${A[$k]} > 1 )) || unset -v 'A[$k]'
done
declare -p A

Result:

declare -A A=([" "]="3" [z]="2" [u]="2" [t]="4" [n]="2" [i]="2" [f]="2" [e]="2" )

See also How to use associative arrays safely inside arithmetic expressions? (where I've just edited my answer to cover the new bash 5.2 behaviour) — Stéphane Chazelas, Dec 24 '23 at 12:33
Thanks for the link, @StéphaneChazelas. If I understand you right, the one line in the function doing the increment (( ++arr["$ch"] )) should be changed to let 'arr[$ch]++' to make this work with older Bash versions 4.3+, which is the first approach in your 2023 edit in the linked answer, is that correct? — duise, Dec 24 '23 at 13:21
That [ "$1" = "arr" ] || { declare -n arr 2>/dev/null || return 1; arr="$1"; } to guard against arr being passed as the variable name may be bit pointless given you'll have similar problems with i or ch. Maybe better to namespace all the internal variables like local -n _count_chars_arr; local _count_chars_i or use ksh93 (where bash copied namerefs and most of its array design from) instead of bash where that has been more thought through. — Stéphane Chazelas, Dec 24 '23 at 15:06
@StéphaneChazelas: No, because i and ch are local variables. You only have those "circular name reference" issues with name references. Actually I was considering dropping bash when I first faced this but the guard so far works for me. — duise, Dec 24 '23 at 15:13
OIC. Bash really sucks in that regard compared to ksh. As a workaround, I'd change the name of all local variables to begin with an underscore. It is understood that this won't completely eliminate the issue when nesting function calls. One would need to make sure that local variable names are unique, e.g. by adding digits (one figure per function :-| ) — duise, Dec 24 '23 at 15:33

check if a string has a character more than once

5 Answers5

Count character occurences in string $2. For each contained character, return

the number of occurrences in the associative array $1.

This is similar to the PHP function count_chars(), mode 1.