Can awk be told to count the character string length rather than byte string length for '%10s' printf formats?

Question

Try this for an output of |Ü| X|:

echo 'Ü X' | awk '{printf("|% 2s|% 2s|\n", $1, $2)}'

Obviously awk counts the byte length, not the character length of the Ü, so the count is 2 and no left padding with space is needed, as is for the X.

Is it possible to run awk in a mode which counts character lengths for the %<count>s printf pattern, not byte length?

The same question exists for bash's printf. I hope the answer is not the same: "passthrough to libc printf" :-/

I was not using gawk, but whatever version Ubuntu 22.04 (Jammy Jellyfish) had installed for me. It did not occur to me that anything but gawk could be installed these days :-/

Are you using a Unicode locale? Which awk? With GNU awk 5.2.2 on Arch Linux in en_GB.UTF-8, I get | Ü| X|, with a space for padding for both. And if I use LANG=C awk ..., then |Ü| X|. — muru, Nov 09 '23 at 10:03

Ed Morton · Accepted Answer · 2023-11-09T15:00:21.420

GNU awk (and possibly some other awk variants):

$ echo 'Ü X' | LC_ALL='en_US.UTF-8' awk '{printf "|% 2s|% 2s|\n", $1, $2}'
| Ü| X|

Bash 3.0+ (and possibly some other shells, possibly with tweaks):

$ LC_ALL='en_US.UTF-8' a='Ü' b='X'
$ printf '|%*s%s|%*s%s|\n' "$(( 2 - ${#a} ))" '' "$a" "$(( 2 - ${#b} ))" '' "$b"
| Ü| X|

Note that the bash version has to set LC_ALL in the shell that is executing ${#a}, not just in printfs environment as is happening with the awk version, and so if you don't want LC_ALL to change in the calling shell you need to save/restore it, i.e. o="$LC_ALL"; LC_ALL='en_US.UTF-8' ... "$b"; LC_ALL="$o", or do everything in a subshell, i.e. ( LC_ALL='en_US.UTF-8' ... "$b" ).

Explanations:

From the GNU awk documentation:

-b
--characters-as-bytes
Cause gawk to treat all input data as single-byte characters. In addition, all output written with print or printf is treated as single-byte characters.

Normally, gawk follows the POSIX standard and attempts to process its input data according to the current locale (see Where You Are Makes a Difference). This can often involve converting multibyte characters into wide characters (internally), and can lead to problems or confusion if the input data does not contain valid multibyte characters. This option is an easy way to tell gawk, “Hands off my data!”

Using GNU awk 5.2.2 setting an appropriate locale will treat multi-byte characters as single multi-byte characters:

$ echo 'Ü X' | LC_ALL='en_US.UTF-8' awk '{printf "|% 2s|% 2s|\n", $1, $2}'
| Ü| X|

whereas using a different locale, or using -b, will treat all input as single-byte characters:

$ echo 'Ü X' | LC_ALL='C' awk '{printf "|% 2s|% 2s|\n", $1, $2}'
|Ü| X|
$ echo 'Ü X' | awk -b '{printf "|% 2s|% 2s|\n", $1, $2}'
|Ü| X|

When -b is used the result is independent of your locale:

$ echo 'Ü X' | LC_ALL='en_US.UTF-8' awk -b '{printf "|% 2s|% 2s|\n", $1, $2}'
|Ü| X|
$ echo 'Ü X' | LC_ALL='C' awk -b '{printf "|% 2s|% 2s|\n", $1, $2}'
|Ü| X|

As @StéphaneChazelas mentioned in a comment, see Why is printf "shrinking" umlaut? for the related behavior of printf in shell where @Léa Gris's answer suggests this will get the character counts, and so the formatted output, correct in bash 3.0 and later:

$ a='Ü' b='X' LC_ALL='en_US.UTF-8' 
$ printf '|%*s%s|%*s%s|\n' "$(( 2 - ${#a} ))" '' "$a" "$(( 2 - ${#b} ))" '' "$b"
| Ü| X|

and that functionality is also affected by locale:

$ LC_ALL='C'
$ printf "|%*s%s|%*s%s|\n" "$(( 2 - ${#a} ))" '' "$a" "$(( 2 - ${#b} ))" '' "$b"
|Ü| X|

See also length-of-string-in-bash for more information on getting the length of characters in bash.

Beware gawk -b triggers a warning when there's a POSIXLY_CORRECT variable in the environment. — Stéphane Chazelas, Nov 09 '23 at 11:11
@StéphaneChazelas and also when --posix is used, which makes sense as they're contradictory. — Ed Morton, Nov 09 '23 at 11:13
But despite the awk: warning: '--posix' overrides '--characters-as-bytes' warning, I find that echo é | POSIXLY_CORRECT= awk -b '{printf "|%2s|\n", $0}' outputs |é| not | é| which would contradict that statement (here with 5.0.1) — Stéphane Chazelas, Nov 09 '23 at 11:16
@StéphaneChazelas the output would then depend on your locale as that's the POSIX behavior. Try echo é | LC_ALL='en_US.UTF-8' POSIXLY_CORRECT= awk -b '{printf "|%2s|\n", $0}' — Ed Morton, Nov 09 '23 at 11:19
Yes, that's what I tried. Without -b, I get | é|, with I get |é| regardless of whether POSIXLY_CORRECT is in the environment or not. Maybe that was fixed in newer versions. — Stéphane Chazelas, Nov 09 '23 at 11:38
Could be. For me on gawk 5.2.2 echo é | LC_ALL='C' POSIXLY_CORRECT= awk -b '{printf "|%2s|\n", $0}' outputs |é| while echo é | LC_ALL='en_US.UTF-8' POSIXLY_CORRECT= awk -b '{printf "|%2s|\n", $0}' outputs | é| with the awk: warning: `--posix' overrides `--characters-as-bytes' warning in both cases. — Ed Morton, Nov 09 '23 at 11:55

Can awk be told to count the character string length rather than byte string length for '%10s' printf formats?

1 Answers1