Why is printf "shrinking" umlaut?

Question

If I execute the following simple script:

#!/bin/bash
printf "%-20s %s\n" "Früchte und Gemüse"   "foo"
printf "%-20s %s\n" "Milchprodukte"        "bar"
printf "%-20s %s\n" "12345678901234567890" "baz"

It prints:

Früchte und Gemüse foo
Milchprodukte        bar
12345678901234567890 baz

that is, text with umlauts (such as ü) is "shrunk" by one character per umlaut.

Certainly, I have some wrong setting somewhere, but I am not able to figure out which one that could be.

This occurs if the file's encoding is UTF-8.

If I change its encoding to latin-1, the alignment is correct, but the umlauts are rendered wrong:

Fr�chte und Gem�se   foo
Milchprodukte        bar
12345678901234567890 baz

You expect printf to be aware of UTF-8 and other multibyte charsets? — frostschutz, Mar 09 '17 at 11:47
Looks like it's counting bytes rather than characters; see echo Früchte und Gemüse | wc -c -m for the difference. — Stephen Kitt, Mar 09 '17 at 11:47
Write your own printf that's UTF8 aware? And of course it has to count "visible" glyphs in exactly the same way as whatever program you are using to render the UTF8. Not so easy e.g. for Indian scripts... — dirkt, Mar 09 '17 at 11:58
You could write terminal escape sequences to manually place the cursor at the correct location. https://en.wikipedia.org/wiki/ANSI_escape_code The code will be CSI n G where n is the n'th column. CSI is the escape character followed by a left block bracket. Example: printf "\\033[42GHello\\n" will write Hello at the 42nd column. — Oskar Skog, Mar 09 '17 at 12:15
@Oskar - you can do that if you're sure that output is going to a terminal with suitable support. It's generally better to use tput than hard-coding such sequences. — Toby Speight, Mar 09 '17 at 13:12
@frostschutz Which printf implementation are you referring to? Thanks — jpaugh, Mar 09 '17 at 20:09

Stéphane Chazelas · Accepted Answer · 2020-06-12T08:37:11.600

POSIX requires printf's %-20s to count those 20 in terms of bytes not characters even though that makes little sense as printf is to print text, formatted (see discussion at the Austin Group (POSIX) and bash mailing lists).

The printf builtin of bash and most other POSIX shells honour that.

zsh ignores that silly requirement (even in sh emulation) so printf works as you'd expect there. Same for the printf builtin of fish (not a POSIX-like shell).

The ü character (U+00FC), when encoded in UTF-8 is made of two bytes (0xc3 and 0xbc), which explains the discrepancy.

$ printf %s 'Früchte und Gemüse' | wc -mcL
    18      20      18

That string is made of 18 characters, is 18 columns wide (-L being a GNU wc extension to report the display width of the widest line in the input) but is encoded on 20 bytes.

In zsh or fish, the text would be aligned correctly.

Now, there are also characters that have 0-width (like combining characters such as U+0308, the combining diaresis) or have double-width like in many Asiatic scripts (not to mention control characters like Tab) and even zsh wouldn't align those properly.

Example, in zsh:

$ printf '%3s|\n' u ü $'u\u308' $'\u1100'
  u|
  ü|
 ü|
  ᄀ|

In bash:

$ printf '%3s|\n' u ü $'u\u308' $'\u1100'
  u|
 ü|
ü|
ᄀ|

ksh93 has a %Ls format specification to count the width in terms of display width.

$ printf '%3Ls|\n' u ü $'u\u308' $'\u1100'
  u|
  ü|
  ü|
 ᄀ|

That still doesn't work if the text contains control characters like TAB (how could it? printf would have to know how far apart the tab stops are in the output device and what position it starts printing at). It does work by accident with backspace characters (like in the roff output where X (bold X) is written as X\bX) though as ksh93 considers all control characters as having a width of -1.

Other options

In zsh, you can use its padding parameter expansion flags (l for left-padding, r for right-padding), which when combined with the m flag considers the display width of characters (as opposed to the number of characters in the string):

$ () { printf '%s|\n' "${(ml[3])@}"; } u ü $'u\u308' $'\u1100'
  u|
  ü|
  ü|
 ᄀ|

With expand:

printf '%s\t|\n' u ü $'u\u308' $'\u1100' | expand -t3

That works with some expand implementations (not GNU's though).

On GNU systems, you could use GNU awk whose printf counts in chars (not bytes, not display-widths, so still not OK for the 0-width or 2-width characters, but OK for your sample):

gawk 'BEGIN {for (i = 1; i < ARGC; i++) printf "%-3s|\n", ARGV[i]}
     ' u ü $'u\u308' $'\u1100'

If the output goes to a terminal, you can also use cursor positioning escape sequences. Like:

forward21=$(tput cuf 21)
printf '%s\r%s%s\n' \
  "Früchte und Gemüse"    "$forward21" "foo" \
  "Milchprodukte"         "$forward21" "bar" \
  "12345678901234567890"  "$forward21" "baz"

That is incorrect. The ü caracter can be composed as u + ¨, which is 3 bytes. In the case of the question, it is encoded as 2 characters, but not all ü are created equally. — Ismael Miguel, Mar 09 '17 at 17:02
@IsmaelMiguel, u\u308 is two characters (in the Unix/wc -m sense at least) for one glyph/graphem/graphem-cluster and is already mentioned and included in this answer. — Stéphane Chazelas, Mar 09 '17 at 17:27
"that makes little sense as printf is to print text" Well, one could argue that printf deals with C chars (bytes); it shouldn't deal with text locales, and it shouldn't have the burden of understand the (possibly multibyte) charset encoding. But this line of defense conflicts with the (ISO C99) requirements that "%s" byte truncation should not result in "invalid" texts (truncated chars). Glibc even fails in that case (it prints nothing). A real mess. https://www.postgresql.org/message-id/000e0cd64822e8870604861d0168%40google.com — leonbloy, Mar 10 '17 at 21:12
@leonbloy, that might make sense of C's printf(3) (little sense after that C99 requirement you're mentioning, thanks for that), but not the printf(1) utility as every shell operator or other text utility deal with characters (or were modified to also deal with characters like wc which got a -m (while -c stayed byte) or cut that got a -b after -c could mean something else than bytes). — Stéphane Chazelas, Mar 10 '17 at 22:02
Even if it used characters rather than bytes, it still wouldn't be suitable for aligning columns. You need to know how many terminal cells each character occupies, which varies by character (0-2). — R.. GitHub STOP HELPING ICE, Mar 11 '17 at 01:04

Léa Gris · Answer 2 · 2020-06-14T00:22:17.423

${#var} characters count is correct since bash3.0+.

Try (with any version of bash):

bash -c "a="$'aáíóuúüoözu\u308\u1100'';printf "%s\n" "${a} ${#a}"'

That will give the correct count since bash 3.0.

Note however that $'u\u308' requires a bash to be 4.2+.

This makes it possible to compute a proper padding:

#!/usr/bin/env bash
strings=(
  'Früchte und Gemüse'
  'Milchprodukte'
  '12345678901234567890'
)
Initialize column width
cw=20
for str in "${strings[@]}"
do
Format column1 with computed padding
printf -v col1string '%s%*s' "$str" $((cw-${#str})) ''
Print column1 with computed padding, followed by column2
printf "%s %s\n" "$col1string" 'col2string'
done

Output:

Früchte und Gemüse   col2string
Milchprodukte        col2string
12345678901234567890 col2string

Working with featured alignment functions:

#!/usr/bin/env bash
Space pad align string to width
@params
$1: The alignment width
$2: The string to align
@stdout
aligned string
@return:
1: If a string exceeds alignment width
2: If missing arguments
align_left ()
{
  (($#==2)) || return 2
  ((${#2}>$1)) && return 1
  printf '%s%s' "$2" $(($1-${#2})) ''
}
align_right ()
{
  (($#==2)) || return 2
  ((${#2}>$1)) && return 1
  printf '%s%s' $(($1-${#2})) '' "$2"
}
align_center ()
{
  (($#==2)) || return 2
  ((${#2}>$1)) && return 1
  l=$((($1-${#2})/2))
  printf '%s%s%s' $l '' "$2" $(($1-${#2}-l)) ''
}
strings=(
  'Früchte und Gemüse'
  'Milchprodukte'
  '12345678901234567890'
)
echo 'Left-aligned:'
for str in "${strings[@]}"
do
  printf "| %s |\n" "$(align_left 20 "$str")"
done
echo
echo 'Right-aligned:'
for str in "${strings[@]}"
do
  printf "| %s |\n" "$(align_right 20 "$str")"
done
echo
echo 'Center-aligned:'
for str in "${strings[@]}"
do
  printf "| %s |\n" "$(align_center 20 "$str")"
done

Output:

Left-aligned:
| Früchte und Gemüse   |
| Milchprodukte        |
| 12345678901234567890 |
Right-aligned:
|   Früchte und Gemüse |
|        Milchprodukte |
| 12345678901234567890 |
Center-aligned:
|  Früchte und Gemüse  |
|    Milchprodukte     |
| 12345678901234567890 |

EDITS:

Add ksh-93 | POSIX implementation
More POSIXness with expr, now also tested working with:

ash (Busybox 1.x)
ksh93 Version A 2020.0.0
zsh 5.8

With advice from Stéphane Chazelas: replaced expr length "$2" by expr " $2" : '.*' - 1.
Updated introduction with isaac's comment.

${#var} characters count is correct since bash3.0+.

This seems to work as well with ksh or POSIX syntax:

#!/usr/bin/env sh
Space pad align or truncate string to width
@params
$1: The alignment width
$2: The string to align
@stdout
The aligned string
@return:
1: If the string was truncated alignment width
2: If missing arguments
__align_check ()
{
  if [ $# -ne 2 ]; then return 2; fi
  if [ "$(expr " $2" : '.*' - 1)" -gt "$1" ]; then
    printf '%s' "$(expr substr "$2" 1 $1)"
    return 1
  fi
}
align_left ()
{
  __align_check "$@" || return $?
  printf '%s%s' "$2" $(($1-$(expr " $2" : '.' - 1))) ''
}
align_right ()
{
  __align_check "$@" || return $?
  printf '%s%s' $(($1-$(expr " $2" : '.' - 1))) '' "$2"
}
align_center ()
{
  __align_check "$@" || return $?
  tpl=$(($1-$(expr " $2" : '.' - 1)))
  lpl=$((tpl/2))
  rpl=$((tpl-lpl))
  printf '%s%s%*s' $lpl '' "$2" $rpl ''
}
main ()
{
  hr="+----------------------+----------------------+----------------------

+------+"
  echo "$hr"
  printf '| %s | %s | %s | %s |\n' 

    "$(align_left 20 'Left-aligned')" 

    "$(align_center 20 'Center-aligned')" 

    "$(align_right 20 'Right-aligned')" 

    "$(align_center 4 'RC')"
  echo "$hr"
for str
  do
    printf '| %s | %s | %s | %s |\n' 

      "$(align_left 20 "$str")" 

      "$(align_center 20 "$str")" 

      "$(align_right 20 "$str")" 

      "$(align_right 4 "$?")"
  done
  echo "$hr"
}
main 

  'Früchte und Gemüse' 

  'Milchprodukte' 

  '12345678901234567890' 

  'This string is much too long'

Output:

+----------------------+----------------------+----------------------+------+
| Left-aligned         |    Center-aligned    |        Right-aligned |  RC  |
+----------------------+----------------------+----------------------+------+
| Früchte und Gemüse   |  Früchte und Gemüse  |   Früchte und Gemüse |    0 |
| Milchprodukte        |    Milchprodukte     |        Milchprodukte |    0 |
| 12345678901234567890 | 12345678901234567890 | 12345678901234567890 |    0 |
| This string is much  | This string is much  | This string is much  |    1 |
+----------------------+----------------------+----------------------+------+

${#var} is required by POSIX to expand to the number of characters in $var (at least as long as $var contains valid text in the locale), but there are still some shells like dash that don't support multibyte characters. head -c however is required to work with bytes, not characters so you can't use it there (you could use awk's substr(), though again not all awk implementations support multi-bytes, but then it would make more sense to do the whole thing in awk). — Stéphane Chazelas, Jun 12 '20 at 14:06
ksh93 has builtin support for properly padding text with %Ls as shown in my answer. — Stéphane Chazelas, Jun 12 '20 at 14:07
Well, following up on my first comment, there's no head -c in POSIX, but there's a tail -c which is required to work with bytes, so head implementations that support -c also work with bytes for consistency. — Stéphane Chazelas, Jun 12 '20 at 14:09
POSIX expr has no length. The expr API is also very broken. See for example expr length "length". POSIXly, you can do expr " $1" : '.*' - 1. Again ${#var} is POSIX to get the number of characters, so you don't need expr for that. — Stéphane Chazelas, Jun 12 '20 at 15:24
AFAICT, busybox expr's length and : count in bytes, not characters. — Stéphane Chazelas, Jun 12 '20 at 15:26
In fact, that ${#var} counts characters is true since bash3.0+. Try (with any version of bash) bash -c "a="$'aáíóuúüoözu\u308\u1100'';printf "%s\n" "${a} ${#a}"'. That will give the correct count since bash 3.0. Note however that $'u\u308' requires a bash to be 4.2+. Please edit the first line of your answer accordingly, thanks. — , Jun 13 '20 at 23:33

score 11 · Answer 3 · answered Mar 09 '17 at 12:36

11

If I change its encoding to latin-1, the alignment is correct, but the umlauts are rendered wrong:
Fr�chte und Gem�se   foo
Milchprodukte        bar
12345678901234567890 baz

Actually, no, but your terminal doesn't speak latin-1, and therefore you get junk rather than umlauts.

You can fix this by using iconv:

printf foo bar | iconv -f ISO8859-1 -t UTF-8

(or just run the whole shell script piped into iconv)

answered Mar 09 '17 at 12:36

Wouter Verhelst

9,331

3

This is a useful comment but does not answer the core question. – gerrit Mar 09 '17 at 14:29
1

@gerrit how so? If printf does the right thing when printing in latin1, then have it print in latin1 and convert it to UTF-8 later? Seems like a proper fix for the core question to me. – Wouter Verhelst Mar 09 '17 at 14:52
1

The core question is "Why is it shrinking umlaut", the answer (as in other answers) is "because it doesn't support utf-8". It's not asking why are the umlauts rendered wrong or how can I fix the umlaut rendering. Either way, your suggestion is useful for the subset of utf-8 that can be represented as iso8859-1 (only). – gerrit Mar 09 '17 at 15:18
4

@WouterVerhelst, yes though that can only apply to text that can be encoded in a single-byte charset. – Stéphane Chazelas Mar 09 '17 at 16:08
4

I too read the question as "how can I get the output right" rather than "I don't mind the faulty output, as long as I know why". – Mr Lister Mar 11 '17 at 12:38
@MrLister so did I, although I (now) realize that the question isn't posed as such. Ah well :-) – Wouter Verhelst Mar 13 '17 at 16:09

score 1 · Answer 4 · answered Dec 07 '23 at 12:32

1

I would have been happy to find this answer:

You could work around it by telling the terminal to move the cursor to the desired position, instead of having printf count the characters.:

$ printf "%s\033[10G-\n" "abc" "├─cd" "└──ef"
abc      -
├─cd     -
└──ef    -

Credit: https://unix.stackexchange.com/a/407135

answered Dec 07 '23 at 12:32

cdanzmann

111

Or use $(tput hpa 10) to avoid hardcoding the escape sequence (which is not the same for every terminal). Using \r and $(tput cuf 10) as shown in my answer is slightly more portable. – Stéphane Chazelas Dec 07 '23 at 13:53

Why is printf "shrinking" umlaut?

4 Answers4

Other options

Initialize column width

Format column1 with computed padding

Print column1 with computed padding, followed by column2

Space pad align string to width

@params

$1: The alignment width

$2: The string to align

@stdout

aligned string

@return:

1: If a string exceeds alignment width

2: If missing arguments

Space pad align or truncate string to width

@params

$1: The alignment width

$2: The string to align

@stdout

The aligned string

@return:

1: If the string was truncated alignment width

2: If missing arguments

Linked