I would like to check if a string contains a letter (not a specific letter, really any letter) more than once.
for example:
user:
test.sh this list
script:
if [ "$1" has some letter more then once ]
then
do something
fi
I would like to check if a string contains a letter (not a specific letter, really any letter) more than once.
for example:
user:
test.sh this list
script:
if [ "$1" has some letter more then once ]
then
do something
fi
You can use grep
.
The regexp \(.\).*\1
matches any single character, followed by anything, followed by the same first character.
grep
returns success if at least one row matches the regex.
if echo "$1" | grep -q '\(.\).*\1' ; then
echo "match" ;
fi
Note that \(.\)
matches any character not any letter, perhaps you have to restrict the regex to your specific definition of "really any letter". You can use something like \([[:alnum:]]\).*\1
, \([[:alpha:]]\).*\1
or \([a-df-z1245]\).*\1
.
c=$(expr " $string" : " .*\(.\).*\1") || [ "$c" = 0 ] &&
printf '"%s" has "%s" (at least) more than once\n' "$string" "${c:-<newline>}"
(0 for which expr
returns false, and newline which command substitution strips have to be treated specially).
To get a report of duplicate bytes, on a GNU system, you could do:
$ string=$'This is a string\nwith «multi-byte» «characters»\n'
printf %s "$string" | od -An -vtc -w1 | LC_ALL=C sort | LC_ALL=C uniq -dc
5
3 a
2 c
2 e
3 h
5 i
3 r
4 s
5 t
2 \n
2 253
2 273
4 302
The bytes outside of the range covered by ASCII are represented as their octal value, the control characters with their octal value or the \x
C representation.
To get a report of duplicate characters:
$ printf %s "$string" | recode ..dump | sort | uniq -dc
2 000A LF line feed (lf)
5 0020 SP space
3 0061 a latin small letter a
2 0063 c latin small letter c
2 0065 e latin small letter e
3 0068 h latin small letter h
5 0069 i latin small letter i
3 0072 r latin small letter r
4 0073 s latin small letter s
5 0074 t latin small letter t
2 00AB << left-pointing double angle quotation mark
2 00BB >> right-pointing double angle quotation mark
Note however that recode
doesn't know about all Unicode characters (especially not the recent ones).
Using shell builtins.
In ksh93:
if [[ $string = *@(?)*\1* ]]; then
print -r -- "$string contains duplicate characters"
fi
In zsh:
set -o rematchpcre
if [[ $string =~ '(.).*\1' ]]; then
print -r -- "$string contains duplicate characters ($match[1] at least)"
fi
(would also work without set -o rematchpcre
but only on systems where EREs support back-references as an extension over the standard).
Or to get the list of all duplicated characters:
typeset -A count=()
for c (${(s[])string}) if (( ++count[\$c] == 2 )) print -r -- $c is found more than once
You could use fold
to print the string one character per line, then uniq -c
to count them and awk
to print only those that appeared more than once:
$ string="foobar"
$ fold -w 1 <<< "$string" | sort | uniq -c | awk '$1>1'
2 o
Or, if your shell doesn't support here strings:
printf '%s\n' "$string" | fold -w 1 | sort | uniq -c | awk '$1>1'
Then, you could test whether the command above returns an empty string or not:
$ string="foobar"
$ [ -n "$(fold -w 1 <<<"$string" | sort | uniq -c | awk '$1>1')" ] && echo repeated
repeated
You could then easily extend it to print the repeated character and the number of times it was repeated:
$ rep="$(fold -w 1 <<<"$string" | sort | uniq -c | awk '$1>1')"
$ [ -n "$rep" ] && printf -- "%s\n" "$rep"
2 o
Using Raku (formerly known as Perl_6)
At zsh
command line:
% string=$'«AAÁÁÅÅÀÀÄÄBBßßœœþþ» CDE\n«X Y Z»\n'
% printf %s "$string"
«AAÁÁÅÅÀÀÄÄBBßßœœþþ» CDE
«X Y Z»
% printf %s "$string" | raku -e '$*IN.comb.raku.put;'
("«", "A", "A", "Á", "Á", "Å", "Å", "À", "À", "Ä", "Ä", "B", "B", "ß", "ß", "œ", "œ", "þ", "þ", "»", " ", "C", "D", "E", "\n", "«", "X", " ", "Y", " ", "Z", "»", "\n").Seq
% printf %s "$string" | raku -e 'slurp.comb.raku.put;'
("«", "A", "A", "Á", "Á", "Å", "Å", "À", "À", "Ä", "Ä", "B", "B", "ß", "ß", "œ", "œ", "þ", "þ", "»", " ", "C", "D", "E", "\n", "«", "X", " ", "Y", " ", "Z", "»", "\n").Seq
Using Hash:
% printf %s "$string" | raku -e 'my %h; %h{$_}++ for slurp.comb(); %h.pairs.sort.map({ print $_.key if $_.value > 1 });'
AB«»ÀÁÄÅßþœ%
OR using BagHash:
% printf %s "$string" | raku -e 'my %h = slurp.comb.BagHash; %h.pairs.sort.map({ print $_.key if $_.value > 1 });'
AB«»ÀÁÄÅßþœ%
Here are answers coded in Raku, a member of the Perl-family of programming languages that features high-level support for Unicode. Answers above use either a standard %
-sigiled Hash, or a %
-sigiled BagHash (second answer above). [Note zsh
adds a %
at the terminus to signify an incomplete final line].
In Raku, all text (excepting filenames) is normalized by default. For example, graphemes encoded via combining characters will be turned into one codepoint, per the Normalization Form C (NFC) specification. To get more grapheme/character information from the test string, you can use Raku's ords
, encode
and uniname
functions:
% printf %s "$string" | raku -e 'my %h = slurp.comb.BagHash; %h.pairs.sort.map: { if .value > 1 { .put for ( .key, .key.ord, .key.encode.gist, .key.uniname ).join: " " }};'
10 utf8:0x<0A> <control-000A>
32 utf8:0x<20> SPACE
A 65 utf8:0x<41> LATIN CAPITAL LETTER A
B 66 utf8:0x<42> LATIN CAPITAL LETTER B
« 171 utf8:0x<C2 AB> LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
» 187 utf8:0x<C2 BB> RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
À 192 utf8:0x<C3 80> LATIN CAPITAL LETTER A WITH GRAVE
Á 193 utf8:0x<C3 81> LATIN CAPITAL LETTER A WITH ACUTE
Ä 196 utf8:0x<C3 84> LATIN CAPITAL LETTER A WITH DIAERESIS
Å 197 utf8:0x<C3 85> LATIN CAPITAL LETTER A WITH RING ABOVE
ß 223 utf8:0x<C3 9F> LATIN SMALL LETTER SHARP S
þ 254 utf8:0x<C3 BE> LATIN SMALL LETTER THORN
œ 339 utf8:0x<C5 93> LATIN SMALL LIGATURE OE
Of course, if all you want to do is detect duplicate characters and exit, the following code works:
% printf %s "$string" | raku -e 'slurp.comb.BagHash.pairs.map: { $_.value > 1 && say("duplicate characters exist") && last };'
duplicate characters exist
#OR
% printf %s "$string" | raku -e 'any(slurp.comb.BagHash.pairs.map: *.value > 1).so && say("duplicate characters exist");'
duplicate characters exist
https://docs.raku.org/language/unicode#Normalization
https://docs.raku.org/language/traps#All_text_is_normalized_by_default
https://docs.raku.org/language/faq.html#String:_How_can_I_get_the_hexadecimal_representation_of_a_string%3F
Even if the question has been asked 8 years ago, given that all previous answers require external tools, long pipe expressions requiring multiple subshells although the question is tagged with bash, I'd like to present an internal solution.
The function count_chars()
works similar to the PHP function with the same name. It takes a string as input and for each character it notes its number of occurrences in an associative array. The array to hold the result is passed by reference as the first argument.
It's then easy to get all the characters which fulfill the filter condition by looping through the index (keys).
EDIT: The updated code should work with Bash 4.3 and newer.
#!/bin/bash
Count character occurences in string $2. For each contained character, return
the number of occurrences in the associative array $1.
This is similar to the PHP function count_chars(), mode 1.
count_chars() {
[ "$1" = "arr" ] || { declare -n arr 2>/dev/null || return 1; arr="$1"; }
arr=( )
local -i i
local ch
for (( i=0; i<${#2}; i++ )); do
ch=${2:$i:1}
# http://mywiki.wooledge.org/BashPitfalls#A.5B.5B_-v_hash.5B.24key.5D_.5D.5D
[[ -v 'arr["$ch"]' ]] || arr["$ch"]="0"
# Surprise, surpise--the increment works, despite
# http://mywiki.wooledge.org/BashPitfalls#A.28.28_hash.5B.24key.5D.2B-.2B-_.29.29
# (( ++arr["$ch"] )) EDIT: Bash 5.2+ only
let '++arr["$ch"]'
done
}
declare -A A=
count_chars A "Die Hoffnung stirbt zuletzt!"
for k in "${!A[@]}"; do
(( ${A[$k]} > 1 )) && printf '%s|' "$k"
done
echo
This script will print out:
|z|u|t|n|i|f|e|
The first result character is the blank. You can easily verify that this is correct:
$ declare -p A
declare -A A=(["!"]="1" [" "]="3" [H]="1" [D]="1" [z]="2" [u]="2" [t]="4" [s]="1" [r]="1" [o]="1" [n]="2" [l]="1" [i]="2" [g]="1" [f]="2" [e]="2" [b]="1" )
If you prefer an array to continue to work on, you could remove the nonmatching elements from the array:
for k in "${!A[@]}"; do
(( ${A[$k]} > 1 )) || unset -v 'A[$k]'
done
declare -p A
Result:
declare -A A=([" "]="3" [z]="2" [u]="2" [t]="4" [n]="2" [i]="2" [f]="2" [e]="2" )
(( ++arr["$ch"] ))
should be changed to let 'arr[$ch]++'
to make this work with older Bash versions 4.3+, which is the first approach in your 2023 edit in the linked answer, is that correct?
– duise
Dec 24 '23 at 13:21
[ "$1" = "arr" ] || { declare -n arr 2>/dev/null || return 1; arr="$1"; }
to guard against arr
being passed as the variable name may be bit pointless given you'll have similar problems with i
or ch
. Maybe better to namespace all the internal variables like local -n _count_chars_arr; local _count_chars_i
or use ksh93 (where bash copied namerefs and most of its array design from) instead of bash where that has been more thought through.
– Stéphane Chazelas
Dec 24 '23 at 15:06
i
and ch
are local variables. You only have those "circular name reference" issues with name references. Actually I was considering dropping bash when I first faced this but the guard so far works for me.
– duise
Dec 24 '23 at 15:13
$1
is$'a\na'
for instance), won't work for the newline characters. Depending on the implementation ofecho
, it won't work with strings like-nene
or strings containing backslashes. You should avoidecho
for arbitrary data – Stéphane Chazelas Dec 13 '15 at 14:05