GNU awk (and possibly some other awk variants):
$ echo 'Ü X' | LC_ALL='en_US.UTF-8' awk '{printf "|% 2s|% 2s|\n", $1, $2}'
| Ü| X|
Bash 3.0+ (and possibly some other shells, possibly with tweaks):
$ LC_ALL='en_US.UTF-8' a='Ü' b='X'
$ printf '|%*s%s|%*s%s|\n' "$(( 2 - ${#a} ))" '' "$a" "$(( 2 - ${#b} ))" '' "$b"
| Ü| X|
Note that the bash version has to set LC_ALL
in the shell that is executing ${#a}
, not just in printf
s environment as is happening with the awk
version, and so if you don't want LC_ALL
to change in the calling shell you need to save/restore it, i.e. o="$LC_ALL"; LC_ALL='en_US.UTF-8' ... "$b"; LC_ALL="$o"
, or do everything in a subshell, i.e. ( LC_ALL='en_US.UTF-8' ... "$b" )
.
Explanations:
From the GNU awk documentation:
-b
--characters-as-bytes
Cause gawk to treat all input data as single-byte characters. In addition, all output written with print or printf is treated as
single-byte characters.
Normally, gawk follows the POSIX standard and attempts to process its
input data according to the current locale (see Where You Are Makes a
Difference). This can often involve converting multibyte characters
into wide characters (internally), and can lead to problems or
confusion if the input data does not contain valid multibyte
characters. This option is an easy way to tell gawk, “Hands off my
data!”
Using GNU awk 5.2.2 setting an appropriate locale will treat multi-byte characters as single multi-byte characters:
$ echo 'Ü X' | LC_ALL='en_US.UTF-8' awk '{printf "|% 2s|% 2s|\n", $1, $2}'
| Ü| X|
whereas using a different locale, or using -b
, will treat all input as single-byte characters:
$ echo 'Ü X' | LC_ALL='C' awk '{printf "|% 2s|% 2s|\n", $1, $2}'
|Ü| X|
$ echo 'Ü X' | awk -b '{printf "|% 2s|% 2s|\n", $1, $2}'
|Ü| X|
When -b
is used the result is independent of your locale:
$ echo 'Ü X' | LC_ALL='en_US.UTF-8' awk -b '{printf "|% 2s|% 2s|\n", $1, $2}'
|Ü| X|
$ echo 'Ü X' | LC_ALL='C' awk -b '{printf "|% 2s|% 2s|\n", $1, $2}'
|Ü| X|
As @StéphaneChazelas mentioned in a comment, see Why is printf "shrinking" umlaut? for the related behavior of printf
in shell where @Léa Gris's answer suggests this will get the character counts, and so the formatted output, correct in bash 3.0 and later:
$ a='Ü' b='X' LC_ALL='en_US.UTF-8'
$ printf '|%*s%s|%*s%s|\n' "$(( 2 - ${#a} ))" '' "$a" "$(( 2 - ${#b} ))" '' "$b"
| Ü| X|
and that functionality is also affected by locale:
$ LC_ALL='C'
$ printf "|%*s%s|%*s%s|\n" "$(( 2 - ${#a} ))" '' "$a" "$(( 2 - ${#b} ))" '' "$b"
|Ü| X|
See also length-of-string-in-bash for more information on getting the length of characters in bash.
en_GB.UTF-8
, I get| Ü| X|
, with a space for padding for both. And if I useLANG=C awk ...
, then|Ü| X|
. – muru Nov 09 '23 at 10:03