Using uniq on unicode text

Question

I want to remove duplicate lines from a file with words of Syriac script. The source file has 3 lines, 1st and 3rd are identical.

$ cat file.txt 
ܐܒܘܢ
ܢܗܘܐ
ܐܒܘܢ

When I use sort and uniq, the result presumes that all the 3 lines are identical, which is wrong:

$ cat file.txt | sort | uniq -c
      3 ܐܒܘܢ

Explicitly setting locale to Syriac doesn't help either.

$ LC_COLLATE=syr_SY.utf8 cat file.txt | sort | uniq -c      
     3 ܐܒܘܢ

Why would that happen? I'm using Kubuntu 18 and bash, if that matters.

Note that both the sort and the uniq need to have the right collation to work here, so you'd want LC_COLLATE=syr_SY.utf8 sort file.txt | LC_COLLATE=syr_SY.utf8 uniq -c (or perhaps better yet in the regular environment). — Michael Homer, Sep 16 '18 at 09:13
I don't agree it's a duplicate here as there are several things at play here, not only the fact that LC_COLLATE applies only to cat. — Stéphane Chazelas, Sep 16 '18 at 19:19

Stéphane Chazelas · Accepted Answer · 2019-12-27T13:59:41.307

The GNU implementation of uniq as found on Ubuntu, with -c, doesn't report counts of contiguous identical lines but counts of contiguous lines that sort the same¹.

Most international locales on GNU systems have that bug that many completely unrelated characters have been defined with the same sort order most of them because their sort order is not defined at all. Most other OSes make sure all characters have different sorting order.

$ expr ܐ = ܒ
1

(expr's = operator, for arguments that are not numerical, returns 1 if operands sort the same, 0 otherwise).

That's the same with ar_SY.UTF-8 or en_GB.UTF-8.

What you'd need is a locale where those characters have been given a different sorting order. If Ubuntu had locales for the Syriac language, you could expect those characters to have been given a different sorting order, but Ubuntu doesn't have such locales.

You can look at the output of locale -a for a list of supported locales. You can enable more locales by running dpkg-reconfigure locales as root. You can also define more locales manually using localedef based on the definition files in /usr/share/i18n/locales, but you'll find no data for the Syriac language there.

Note that in:

LC_COLLATE=syr_SY.utf8 cat file.txt | sort | uniq -c

You're only setting the LC_COLLATE variable for the cat command (which doesn't affect the way it outputs the content of the file, cat doesn't care about collation nor even character encoding as it's not a text utility). You'd want to set it for both sort and uniq. You'd also want to set LC_CTYPE to a locale that has a UTF-8 charset.

As your system doesn't have syr_SY.utf8 locale, that's the same as using the C locale (the default locale).

Actually, here the C locale or C.UTF-8 is probably the locale you'd want to use.

In those locales, the collation order is based on code point, Unicode code point for C.UTF-8, byte value for C, but that ends up being the same as the UTF-8 character encoding has that property.

$ LC_ALL=C expr ܐ = ܒ
0
$ LC_ALL=C.UTF-8 expr ܐ = ܒ
0

So with:

(export LANG=ar_SY.UTF-8 LC_COLLATE=C.UTF-8 LANGUAGE=syr:ar:en
 unset LC_ALL
 sort <file | uniq -c)

You'd have a LC_CTYPE with UTF-8 as the charset, a collation order based on code point, and the other settings relevant to your region, so for instance error messages in Syriac or Arabic if GNU coreutils sort or uniq messages had been translated in those languages (they haven't yet).

If you don't care about those other settings, it's just as easy (and also more portable) to use:

<file LC_ALL=C sort | LC_ALL=C uniq -c

Or

(export LC_ALL=C; <file sort | uniq -c)

as @isaac has already shown.

^{¹ note that POSIX compliant uniq implementations are not meant to compare strings using the locale's collation algorithm but instead do a byte-to-byte equality comparison. That was further clarified in the 2018 edition of the standard (see the corresponding Austin group bug). But GNU uniq currently does use strcoll(), even under POSIXLY_CORRECT; it also has a -i option for case-insenstive comparison which ironically doesn't use locale information and only works correctly on ASCII input}

Sorry for a naïve question, but wouldn't it then be a good idea to set C locale as default? — evb, Sep 16 '18 at 20:42
@evb, it is the default. But then as a user you generally prefer having messages in your own language, use your usual decimal point separator, date format... The administrator may want to set a default for all the users as well on systems where most users log in locally. — Stéphane Chazelas, Sep 16 '18 at 22:09
@StéphaneChazelas I can't seem to find anything on the POSIX page for uniq that explicitly says whether adjacent lines should be compared using sort order or whether they should be compared byte-by-byte. Should a POSIX-compliant uniq care about locale at all when checking whether two consecutive lines are identical? — Harold Fischer, Dec 27 '19 at 02:08
@HaroldFischer, see the discussion I started on Sun, 29 Mar 2015 21:48:54 +0100 ("May strcoll return 0 if strcmp returns 0" later corrected to "non-0") on the austin-group mailing list (you can use the gmane NNTP interface as publicly available archives). It was clarified in the 2018 edition. See http://austingroupbugs.net/view.php?id=963 — Stéphane Chazelas, Dec 27 '19 at 10:23

score 7 · Answer 2 · 2018-09-17T23:44:48.767

A (simplistic) portable solution:

$ ( LC_ALL=C sort syriac.txt | LC_ALL=C uniq -c )
      2 ܐܒܘܢ
      1 ܢܗܘܐ

For those of you that do not have a font that could render the Syriac script:

$ ( LC_ALL=C sort syriac.txt | LC_ALL=C uniq -c ) | xxd
00000000: 2020 2020 2020 3220 dc90 dc92 dc98 dca2        2 ........
00000010: 0a20 2020 2020 2031 20dc a2dc 97dc 98dc  .      1 .......
00000020: 900a                                     ..

EDIT That is closer to a hack than to a real solution. It works by making both sort and uniq process each line with the values of individual bytes instead of the collation order given by a locale table. A equivalent locale to use (as UTF-8 "code point sort order" turns out to be the same order as the "byte value sort order") is C.UTF-8.

This work in most systems AFAICT.

An equivalent solution is:

$ ( export LC_COLLATE=C.UTF-8; <syriac.txt sort | uniq -c )

The basic problem is that the characters from the Syriac language (Unicode code pointsU+0700–U+074F Syriac and U+0860-U+086F Syriac Supplement) do not have any collation sort order set yet.

That is a problem with the locale definition files inside /usr/share/i18n/locales (debian/ubuntu) and not even listed as a possible language in less /usr/share/i18n/SUPPORTED. That means that the information for that language needs to be reported to Debian i18n and built into valid locale files.

Usually, A locale name usually has the form ‘ll_CC’. Here ‘ll’ is an ISO 639 two-letter language code, and ‘CC’ is an ISO 3166 two-letter country code. And Syriac (Western variant)Syrj.

But Syriac has a three letter code already assigned in ISO 639-2 and Official list of 639-2 codes

The Country Code (ISO 3166) is usually a two letter code and probably should be SY. List of ISO 3166 country codes.

Just setting one or all of the environment variables related to locale is not enough and may fail (as it happens in your case) as all the tables are missing. Those tables set names of months, weekdays, year formulas, format for time, format for currency, language for reported errors (if a translation is available), etc. Please read: What should I set my locale to and what are the implications of doing so?

When the Unicode code points do not have a collation order explicitly defined they may become all the same: undefined. That is what happens here.

We may list the code points from your file (just to use one example point) with:

$ echo $(cat syriac.txt | grep -oP '\X' | sort)
ܐ ܒ ܘ ܢ ܢ ܗ ܘ ܐ ܐ ܒ ܘ ܢ

but if we try to get only unique values, all get erased:

$ echo $(cat syriac.txt | grep -oP '\X' | sort -u )
ܐ

that's because all characters are of the same collation value (weigth):

$ a=ܐ
$ b=ܒ
$ [[ $a == [=$b=] ]] && echo yes
yes

that means that var a value is at the same collation position [=…=] of var b value.

Instead, this lists the non-repeated characters:

$ echo $(cat syriac.txt | grep -oP '\X' | LC_COLLATE=C.UTF-8 sort -u )
ܐ ܒ ܗ ܘ ܢ

score 2 · Answer 3 · edited Sep 16 '18 at 16:01

2

First set LC_CTYPE:

$ export LC_CTYPE=syr_SY.utf8
$ <infile sort |uniq -c
      2 ܐܒܘܢ
      1 ܢܗܘܐ

edited Sep 16 '18 at 16:01

αғsнιη

41,407

answered Sep 16 '18 at 09:09

Ipor Sircer

14,546
1
27
39

thanks! The only problem is that I get a warning: bash: warning: setlocale: LC_CTYPE: cannot change locale (syr_SY.utf8) – evb Sep 16 '18 at 09:16
3

There is no syr_SY locale defined yet. As it doesn't exist, the locale in effect drops to default: C. That is why the commands did work. @evb – Sep 16 '18 at 14:35
@Isaac So how should I define syr_SY locale? – evb Sep 16 '18 at 17:47

Using uniq on unicode text

3 Answers3

Linked