39

In Unicode, some character combinations have more than one representation.

For example, the character ä can be represented as

  • "ä", that is the codepoint U+00E4 (two bytes c3 a4 in UTF-8 encoding), or as
  • "ä", that is the two codepoints U+0061 U+0308 (three bytes 61 cc 88 in UTF-8).

According to the Unicode standard, the two representations are equivalent but in different "normalization forms", see UAX #15: Unicode Normalization Forms.

The unix toolbox has all kinds of text transformation tools, sed, tr, iconv, Perl come to mind. How can I do quick and easy NF conversion on the command-line?

glts
  • 582
  • 3
    Looks like there is a "Unicode::Normalization" module for perl which should do this kind of thing: http://search.cpan.org/~sadahiro/Unicode-Normalize-1.16/Normalize.pm – goldilocks Sep 10 '13 at 19:36
  • @goldilocks if it had a CLI… I mean, I do perl -MUnicode::Normalization -e 'print NFC(… er what comes here now… – mirabilos Nov 15 '16 at 14:33

7 Answers7

40

You can use the uconv utility from ICU. Normalization is achieved through transliteration (-x).

$ uconv -x any-nfd <<<ä | hd
00000000  61 cc 88 0a                                       |a...|
00000004
$ uconv -x any-nfc <<<ä | hd
00000000  c3 a4 0a                                          |...|
00000003

On Debian, Ubuntu and other derivatives, uconv is in the libicu-dev package. On Fedora, Red Hat and other derivatives, and in BSD ports, it's in the icu package.

niels
  • 143
  • This works, thanks. You have to install a 30M dev library alongside it though. What's worse, I haven't been able to find proper documentation for uconv itself: where did you find any-nfd? It looks like development of this tool has been abandoned, last update was in 2005. – glts Sep 14 '13 at 16:07
  • 3
    @glts I found any-nfd by browsing through the list displayed by uconv -L. – Gilles 'SO- stop being evil' Sep 14 '13 at 23:38
  • 1
    On Ubuntu using sudo apt install icu-devtools to run uconv -x any-nfc, but not solve the simplest problem, e.g. a bugText.txt file with "Iglésias, Bad-á, Good-á" converted by uconv -x any-nfc bugText.txt > goodText.txt stay the same text. – Peter Krauss Nov 16 '18 at 11:40
  • @PeterKrauss Did that very test (Ubuntu 22-04.1), hd file before uconv shows the composite chars, hd after shows that it's been fixed... Worked as intended. – Déjà vu Feb 16 '23 at 03:43
9

Python has unicodedata module in its standard library, which allow translating Unicode representations through unicodedata.normalize() function:

import unicodedata

s1 = 'Spicy Jalape\u00f1o' s2 = 'Spicy Jalapen\u0303o'

t1 = unicodedata.normalize('NFC', s1) t2 = unicodedata.normalize('NFC', s2) print(t1 == t2) print(ascii(t1))

t3 = unicodedata.normalize('NFD', s1) t4 = unicodedata.normalize('NFD', s2) print(t3 == t4) print(ascii(t3))

Running with Python 3.x:

$ python3 test.py
True
'Spicy Jalape\xf1o'
True
'Spicy Jalapen\u0303o'

Python isn't well suited for shell one-liners, but it can be done if you don't want to create external script:

$ python3 -c $'import unicodedata\nprint(unicodedata.normalize("NFC", "ääääää"))'
ääääää

For Python 2.x you have to add encoding line (# -*- coding: utf-8 -*-) and mark strings as Unicode with u character:

$ python -c $'# -*- coding: utf-8 -*-\nimport unicodedata\nprint(unicodedata.normalize("NFC", u"ääääää"))'
ääääää
Pablo A
  • 2,712
Nykakin
  • 3,979
4

Check it with the tool hexdump:

echo  -e "ä\c" |hexdump -C 

00000000  61 cc 88                                          |a..|
00000003  

convert with iconv and check again with hexdump:

echo -e "ä\c" | iconv -f UTF-8-MAC -t UTF-8 |hexdump -C

00000000  c3 a4                                             |..|
00000002

printf '\xc3\xa4'
ä
mtt2p
  • 141
  • 6
    This only works on macOS. There is no 'utf-8-mac' on Linux, on FreeBSDs, etc. Also, decomposition by using this encoding does not follow the specification (it does follow the macOS filesystem normalization algorithm though). More info: http://search.cpan.org/~tomita/Encode-UTF8Mac-0.04/lib/Encode/UTF8Mac.pm – antekone Feb 14 '17 at 11:56
  • @antonone to be fair though there was no OS specified in the question. – Chris Davies Sep 15 '17 at 07:47
  • 3
    @roaima Yes, that's why I've assumed that the answer should work on all systems that are based on Unix/Linux. The answer above works only on macOS. If one's looking for a macOS-specific answer, then it'll work, in part. I just wanted to point that out, because the other day I've lost some time wondering why I have no utf-8-mac on Linux and if this is normal. – antekone Sep 15 '17 at 10:55
4

For completeness, with perl:

$ perl -CSA -MUnicode::Normalize=NFD -e 'print NFD($_) for @ARGV' $'\ue1' | uconv -x name
\N{LATIN SMALL LETTER A}\N{COMBINING ACUTE ACCENT}
$ perl -CSA -MUnicode::Normalize=NFC -e 'print NFC($_) for @ARGV' $'a\u301' | uconv -x name
\N{LATIN SMALL LETTER A WITH ACUTE}
3

coreutils has a patch to get a proper unorm. works fine for me on 4byte wchars. follow http://crashcourse.housegordon.org/coreutils-multibyte-support.html#unorm The remaining problem there are 2-byte wchar systems (cygwin, windows, plus aix and solaris on 32bit), which need to transform codepoints from upper planes into surrogate pairs and vice versa, and the underlying libunistring/gnulib cannot handle that yet.

I do maintain these patches at https://github.com/rurban/coreutils/tree/multibyte

perl has the unichars tool, which also does the various normalization forms on the cmdline. http://search.cpan.org/dist/Unicode-Tussle/script/unichars

rurban
  • 131
2

There's a perl utility called Charlint available from

https://www.w3.org/International/charlint/

which does what you want. You'll also have to download a file from

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt

After the first run you'll see Charlint complaining about incompatible entries in that file so you'll have to delete those lines from UnicodeData.txt.

0

Since uconv doesn't seem to be well documented, and the python solution posted here isn't actually a one-liner, here's a one-liner using ruby:

ruby -e '$stdin.each_line {|line| puts line.unicode_normalize(:nfd)}' <infile >outfile

Documentation: https://apidock.com/ruby/v2_5_5/String/unicode_normalize