Bracket expression (without ranges) matching unexpected character in bash

Question

I'm using bash on Linux. I am getting a success from the following if statement, but shouldn't this return a fail code?

if [[ ■ = [⅕⅖⅗] ]] ; then echo yes ; fi

The square does NOT equal any of the characters, so I don't see why I get a success code.

It important for me to keep the double brackets in my case.

Is there any other way to do a range in this scenario, or what any other suggestions?

Probably a consequence of all those characters having an undefined sorting order in your locale (and thus sorting the same). See the ongoing, related discussion at the Austin group. Change the locale to C to fix it. — Stéphane Chazelas, Apr 03 '15 at 08:32
Sorry, C won't do here as it's not single-byte characters. C.UTF-8 would do where available. — Stéphane Chazelas, Apr 03 '15 at 08:38
Congratulations, you managed to summon Stéphane wielding an Austin Group thread on your first question. That's got to be worth at least ⅗ of an Internets. Or ⅘ or even ■ Internets, as apparently those are the same. Welcome to [unix.se], and please keep bringing interesting questions. — derobert, Apr 03 '15 at 16:37

Stéphane Chazelas · Accepted Answer · 2018-01-26T14:49:32.713

That's a consequence of those characters having the same sorting order.

You'll also notice that

sort -u << EOF
■
⅕
⅖
⅗
EOF

returns only one line.

Or that:

expr ■ = ⅕

returns true (as required by POSIX).

Most locales shipped with GNU systems have a number of characters (and even sequences of characters (collating sequences)) that have the same sorting order. In the case of those ■⅕⅖⅗ ones, it's because the order is not defined, and those characters whose order is not defined end up having the same sorting order in GNU systems. There are characters that are explicitly defined as having the same sorting order like Ș and Ş (though there's no apparent (to me anyway) real logic or consistency on how it is done).

That is the source of quite surprising and bogus behaviours. I have raised the issue very recently on the Austin group (the body behind POSIX and the Single UNIX Specification) mailing list and the discussion is still ongoing as of 2015-04-03.

In this case, whether [y] should match x where x and y sort the same is unclear to me, but since a bracket expression is meant to match a collating element, that suggests that the bash behaviour is expected.

In any case, I suppose [⅕-⅕] or at least [⅕-⅖] should match ■.

You'll notice that different tools behave differently. ksh93 behaves like bash, GNU grep or sed don't. Some other shells have different behaviours some like yash even more buggy.

To have a consistent behaviour, you need a locale where all characters sort differently. The C locale is the typical one. However the character set in the C locale on most systems is ASCII. On GNU systems, you generally have access to a C.UTF-8 locale that can be used instead to work on UTF-8 character.

So:

(export LC_ALL=C.UTF-8; [[ ■ = [⅕⅖⅗] ]])

or the standard equivalent:

(export LC_ALL=C.UTF-8
 case ■ in ([⅕⅖⅗]) true;; (*) false; esac)

should return false.

Another alternative would be to set only LC_COLLATE to C which would work on GNU systems, but not necessarily on others where it could fail to specify the sorting order of multi-byte character.

One lesson of that is that equality is not as clear a notion as one would expect when it comes to comparing strings. Equality might mean, from strictest to least strict.

Same number of bytes and all byte constituents have the same value.
Same number of characters and all characters are the same (for instance, refer to the same codepoint in the current charset).
The two strings have the same sorting order as per the locale's collation algorithm (that is, neither a < b nor b > a is true).

Now, for 2 or 3, that assumes both strings contain valid characters. In UTF-8 and some other encodings, some sequence of bytes don't form valid characters.

1 and 2 are not necessarily equivalent because of that, or because some characters may have more than one possible encoding. That's typically the case of stateful encodings like ISO-2022-JP where A can be expressed as 41 or 1b 28 42 41 (1b 28 42 being the sequence to switch to ASCII and you can insert as many of those as you want, that won't make a difference), though I wouldn't expect those types of encoding still being in use, and GNU tools at least generally don't work properly with them.

Also beware that most non-GNU utilities can't deal with the 0 byte value (the NUL character in ASCII).

Which of those definitions is used depends on the utility and utility implementation or version. POSIX is not 100% clear on that. In the C locale, all 3 are equivalent. Outside of that YMMV.

Another common case where 1 and 2 differ is in Unicode with things like combining characters. — Gilles 'SO- stop being evil', Apr 03 '15 at 13:43
@Gilles, combining characters are characters of their own. The combination forms a graphem/cell, but is still formed of several characters. é (U+00E9) and é (e followed by U+0301) are the same graphem, but two different sequences of character (at least from the POSIX APIs point of view). By 1 and 2, they would be different. By 3, they could considered the same if U+0301 had all its collation weights set to "IGNORE", but that's generally not the case as one generally wants to decide on the order of diacritics. — Stéphane Chazelas, Apr 03 '15 at 13:53
It is usually desirable to consider é and é to be the same string, but not e. POSIX's notion of collation order is rarely right, it's too heavily based on characters and does not account for most common ways of sorting strings (e.g. French dictionaries do not use a lexicographic order to sort words: they do a first lexicographic pass with accents ignored and then use accents to decide ties). — Gilles 'SO- stop being evil', Apr 03 '15 at 14:05
@Gilles, yes. That's why I'd say those characters having same sorting order (intentionaly) in glibc locales makes little sense. The é vs é is usually addressed by doing some transformation on the strings first like canonical decomposition (similar to convert to lower case first when you want to do case-insensitive sorting/matching). See also the ICU guide for some good reference on the subject. — Stéphane Chazelas, Apr 03 '15 at 14:20
@Gilles, the weights in POSIX locale collation algorithm can do that French dictionary sorting. That's how the weights work. A first pass uses the primary weights (where e and é (and E and É) have the same and the combining acute accent is ignored) a second pass (if equal) checks the accents, a 3rd pass capitalisation... — Stéphane Chazelas, Apr 03 '15 at 14:24
Ah, I didn't know you could do this with weights, thanks. Do many systems implement this correctly? — Gilles 'SO- stop being evil', Apr 03 '15 at 14:33
@Gilles, actually I was wrong. Except in a few specific locales, on GNU systems, the combining accents seem to be "undefined" like a bunch of other characters. So eƔ sorts the same as é for instance... — Stéphane Chazelas, Apr 03 '15 at 14:35
@Gilles, hmmm. Solaris does it properly (as in print -l e é 'e\u301' ê 'e\u302' E 'E\u301' É Ê 'E\u302' | sort). So looks like I don't know the full picture or they define a collating element for every combination. See the POSIX spec for details. — Stéphane Chazelas, Apr 03 '15 at 14:46
@Gilles, yes. That would be the proper way. If you define e\u0301 as a collating element with same weights as \u00e9, then they sort identically. There aren't that many combined characters, so doing it for every one is completely feasible. Solaris must do it, they must also have a last-resort weight based on code point to make sure they don't sort identically to have a strict total order. — Stéphane Chazelas, Apr 03 '15 at 14:56
Stephane, you are awesome, thank you for the thorough answer, and also thank you for taking the time to give me alternatives for my other questions. I hope you have a great weekend. — TuxForLife, Apr 04 '15 at 06:25

score -3 · Answer 2 · edited Apr 03 '15 at 08:00

-3

You are doing it wrong, = and == are not the same.

Try these examples:

if [[ "■" == "[⅕⅖⅗]" ]] ; then echo yes ; else echo no ; fi

if [[ "1" == "1" ]] ; then echo yes ; else echo no ; fi

if [[ "■" == "■" ]] ; then echo yes ; else echo no ; fi

edited Apr 03 '15 at 08:00

Archemar

31,554

answered Apr 03 '15 at 07:45

Xnap

11

1

That's not true. POSIX specifies that operator = should be used for checking equality. The problem are the missing quotes, not the operator. – scai Apr 03 '15 at 08:00
1

Also man bash says in the [[ section: "The = operator is equivalent to ==." – michas Apr 03 '15 at 08:33
1

@scai, POSIX doesn't specify the [[...]] operator. And = and == are the same in the shells were it's implemented (ksh/bash/zsh) and for pattern matching, not equality. – Stéphane Chazelas Apr 03 '15 at 08:35
When comparing to a pattern, the pattern must not be quoted, else it is taken as a literal string, hence the "no" in the first test. – xhienne Dec 18 '16 at 09:26

Bracket expression (without ranges) matching unexpected character in bash

2 Answers2

Linked