The GNU implementation of uniq
as found on Ubuntu, with -c
, doesn't report counts of contiguous identical lines but counts of contiguous lines that sort the same¹.
Most international locales on GNU systems have that bug that many completely unrelated characters have been defined with the same sort order most of them because their sort order is not defined at all. Most other OSes make sure all characters have different sorting order.
$ expr ܐ = ܒ
1
(expr
's =
operator, for arguments that are not numerical, returns 1 if operands sort the same, 0 otherwise).
That's the same with ar_SY.UTF-8
or en_GB.UTF-8
.
What you'd need is a locale where those characters have been given a different sorting order. If Ubuntu had locales for the Syriac language, you could expect those characters to have been given a different sorting order, but Ubuntu doesn't have such locales.
You can look at the output of locale -a
for a list of supported locales. You can enable more locales by running dpkg-reconfigure locales
as root
. You can also define more locales manually using localedef
based on the definition files in /usr/share/i18n/locales
, but you'll find no data for the Syriac language there.
Note that in:
LC_COLLATE=syr_SY.utf8 cat file.txt | sort | uniq -c
You're only setting the LC_COLLATE variable for the cat
command (which doesn't affect the way it outputs the content of the file, cat
doesn't care about collation nor even character encoding as it's not a text utility). You'd want to set it for both sort
and uniq
. You'd also want to set LC_CTYPE
to a locale that has a UTF-8 charset.
As your system doesn't have syr_SY.utf8
locale, that's the same as using the C
locale (the default locale).
Actually, here the C locale or C.UTF-8 is probably the locale you'd want to use.
In those locales, the collation order is based on code point, Unicode code point for C.UTF-8, byte value for C, but that ends up being the same as the UTF-8 character encoding has that property.
$ LC_ALL=C expr ܐ = ܒ
0
$ LC_ALL=C.UTF-8 expr ܐ = ܒ
0
So with:
(export LANG=ar_SY.UTF-8 LC_COLLATE=C.UTF-8 LANGUAGE=syr:ar:en
unset LC_ALL
sort <file | uniq -c)
You'd have a LC_CTYPE with UTF-8 as the charset, a collation order based on code point, and the other settings relevant to your region, so for instance error messages in Syriac or Arabic if GNU coreutils sort
or uniq
messages had been translated in those languages (they haven't yet).
If you don't care about those other settings, it's just as easy (and also more portable) to use:
<file LC_ALL=C sort | LC_ALL=C uniq -c
Or
(export LC_ALL=C; <file sort | uniq -c)
as @isaac has already shown.
¹ note that POSIX compliant uniq
implementations are not meant to compare strings using the locale's collation algorithm but instead do a byte-to-byte equality comparison. That was further clarified in the 2018 edition of the standard (see the corresponding Austin group bug). But GNU uniq
currently does use strcoll()
, even under POSIXLY_CORRECT
; it also has a -i
option for case-insenstive comparison which ironically doesn't use locale information and only works correctly on ASCII input
sort
and theuniq
need to have the right collation to work here, so you'd wantLC_COLLATE=syr_SY.utf8 sort file.txt | LC_COLLATE=syr_SY.utf8 uniq -c
(or perhaps better yet in the regular environment). – Michael Homer Sep 16 '18 at 09:13cat
. – Stéphane Chazelas Sep 16 '18 at 19:19