POSIX requires printf
's %-20s
to count those 20 in terms of bytes not characters even though that makes little sense as printf
is to print text, formatted (see discussion at the Austin Group (POSIX) and bash
mailing lists).
The printf
builtin of bash
and most other POSIX shells honour that.
zsh
ignores that silly requirement (even in sh
emulation) so printf
works as you'd expect there. Same for the printf
builtin of fish
(not a POSIX-like shell).
The ü
character (U+00FC), when encoded in UTF-8 is made of two bytes (0xc3 and 0xbc), which explains the discrepancy.
$ printf %s 'Früchte und Gemüse' | wc -mcL
18 20 18
That string is made of 18 characters, is 18 columns wide (-L
being a GNU wc
extension to report the display width of the widest line in the input) but is encoded on 20 bytes.
In zsh
or fish
, the text would be aligned correctly.
Now, there are also characters that have 0-width (like combining characters such as U+0308, the combining diaresis) or have double-width like in many Asiatic scripts (not to mention control characters like Tab) and even zsh
wouldn't align those properly.
Example, in zsh
:
$ printf '%3s|\n' u ü $'u\u308' $'\u1100'
u|
ü|
ü|
ᄀ|
In bash
:
$ printf '%3s|\n' u ü $'u\u308' $'\u1100'
u|
ü|
ü|
ᄀ|
ksh93
has a %Ls
format specification to count the width in terms of display width.
$ printf '%3Ls|\n' u ü $'u\u308' $'\u1100'
u|
ü|
ü|
ᄀ|
That still doesn't work if the text contains control characters like TAB (how could it? printf
would have to know how far apart the tab stops are in the output device and what position it starts printing at). It does work by accident with backspace characters (like in the roff
output where X
(bold X
) is written as X\bX
) though as ksh93
considers all control characters as having a width of -1
.
Other options
In zsh
, you can use its padding parameter expansion flags (l
for left-padding, r
for right-padding), which when combined with the m
flag considers the display width of characters (as opposed to the number of characters in the string):
$ () { printf '%s|\n' "${(ml[3])@}"; } u ü $'u\u308' $'\u1100'
u|
ü|
ü|
ᄀ|
With expand
:
printf '%s\t|\n' u ü $'u\u308' $'\u1100' | expand -t3
That works with some expand
implementations (not GNU's though).
On GNU systems, you could use GNU awk
whose printf
counts in chars (not bytes, not display-widths, so still not OK for the 0-width or 2-width characters, but OK for your sample):
gawk 'BEGIN {for (i = 1; i < ARGC; i++) printf "%-3s|\n", ARGV[i]}
' u ü $'u\u308' $'\u1100'
If the output goes to a terminal, you can also use cursor positioning escape sequences. Like:
forward21=$(tput cuf 21)
printf '%s\r%s%s\n' \
"Früchte und Gemüse" "$forward21" "foo" \
"Milchprodukte" "$forward21" "bar" \
"12345678901234567890" "$forward21" "baz"
echo Früchte und Gemüse | wc -c -m
for the difference. – Stephen Kitt Mar 09 '17 at 11:47printf
is. – Stephen Kitt Mar 09 '17 at 11:48printf
that's UTF8 aware? And of course it has to count "visible" glyphs in exactly the same way as whatever program you are using to render the UTF8. Not so easy e.g. for Indian scripts... – dirkt Mar 09 '17 at 11:58CSI n G
where n is the n'th column. CSI is the escape character followed by a left block bracket. Example:printf "\\033[42GHello\\n"
will write Hello at the 42nd column. – Oskar Skog Mar 09 '17 at 12:15tput
than hard-coding such sequences. – Toby Speight Mar 09 '17 at 13:12