printf: multibyte characters

Question

When trying to format printf output involving strings containing multi-byte characters, it became clear that printf does not count literal characters but the number of bytes, which makes formatting text difficult if single-byte and multi-byte characters are mixed. For example:

$ cat script
#!/bin/bash
declare -a a b
a+=("0")
a+=("00")
a+=("000")
a+=("0000")
a+=("00000")
b+=("0")
b+=("├─00")
b+=("├─000")
b+=("├─0000")
b+=("└─00000")
printf "%-15s|\n" "${a[@]}" "${b[@]}"

$ ./script
0              |
00             |
000            |
0000           |
00000          |
0              |
├─00       |
├─000      |
├─0000     |
└─00000    |

I found various suggested work-arounds (mainly wrappers using another language or utility to print the text). Are there any native bash solutions? None of the documented printf format strings appear to help. Would the locale settings be relevant in this situation, e.g., to use a fixed-width character encoding like UTF-32?

Even UTF-32 wouldn’t fix everything, since you’re trying to calculate the displayed width (think of combining marks and grapheme clusters) :-(. — Stephen Kitt, Nov 16 '17 at 22:27
Use zsh or fish where that was fixed, or use ksh93's printf '%-15Ls' or use expand or some of the other solutions at Why is printf "shrinking" umlaut? which looks like a duplicate to me. — Stéphane Chazelas, Nov 26 '17 at 21:16
@StéphaneChazelas: Well, the answers may overlap, but the question is not a duplicate. The question in the linked post ("why does printf not handle certain unicode characters appropriately") is different from this one ("are there any native bash solutions for formatted printing of strings combining single- and multi-byte characters"). Based on the POSIX requirement that printf respect bytes rather than characters, the answer to this question appears to be "no" if one wants to be able to print both to a terminal and to a file. — user001, Nov 26 '17 at 22:02
bash is a command line interpreter, you can invoke any command within bash like zsh -c 'printf "$@"' zsh "%-15s|\n" "${a[@]}" "${b[@]}" for instance, or any of the commands mentioned there to align text. native solution makes little sense for a tool that is designed to run other tools. You could implement a solution in bash that doesn't invoke non-builtin utilities, but that's not how you do things in shells. — Stéphane Chazelas, Nov 26 '17 at 22:21
Anyway I've reopened in case anyone wants to have a go at hacking a bash-with-no-non-builtin-command solution if that's really what you want (I had missed that part from the question). — Stéphane Chazelas, Nov 26 '17 at 22:26

ilkkachu · Answer 1 · 2017-11-26T22:22:25.160

You could work around it by telling the terminal to move the cursor to the desired position, instead of having printf count the characters.:

$ printf "%s\033[10G-\n" "abc" "├─cd" "└──ef"
abc      -
├─cd     -
└──ef    -

Well, assuming you're printing to a terminal, that is...

The control sequence there is <ESC>[nnG where nn is the column to move to, in decimal.

Of course, if the first column is longer than the allocated space, the result isn't too nice:

$ printf "%s\033[10G-\n" "abcdefghijkl"
abcdefghi-kl

To work around that, you could explicitly clear the rest of the line (<ESC>[K) before printing the following column.

$ printf "%s\033[10G\033[K-\n" "abcdefghijkl"
abcdefghi-

Another way would be to do the padding manually, assuming we have something that can determine the length of the string in characters. This seems to work in Bash for simple characters, but is of course a bit ugly. Zero-width and double width characters will probably break it, and I didn't test combining characters either.

#!/bin/bash
pad() { 
    # parameters:
    #  1: name of variable to pad
    #  2: length to pad to
    local string=${!1}
    local len=${#string}
    printf -v "$1" "%s%$(($2 - len))s" "$string" ""
}
echo "1234567890"
for x in "abc" "├─cd" "└──ef" ; do
    pad x 9
    printf "%s-\n" "$x"
done

And the output is:

1234567890
abc      -
├─cd     -
└──ef    -

Your clever solution does not externalize printing to another agent, and thus addresses the stated requirement of intrinsicality. — user001, Nov 26 '17 at 21:50

score 3 · Answer 2 · answered Jan 15 '18 at 00:42

3

here is a solution that uses wc -L.

for i in "${a[@]}" "${b[@]}"
do printf "%s%*s|\n" "$i" "$[15 - $(wc -L <<< "$i")]" ""
done

0              |
00             |
000            |
0000           |
00000          |
0              |
├─00           |
├─000          |
├─0000         |
└─00000        |

wc -L prints the display width of the input, so it works for double width characters and whatnot as well

answered Jan 15 '18 at 00:42

taiyu

31

Thanks, I was not aware of the -L option to wc. – user001 Jan 15 '18 at 03:22
You can do pretty much the same thing with the -m option. – G-Man Says 'Reinstate Monica' Jan 15 '18 at 05:15
@G-Man For single-width chars, yes. I think it does not work for 0 or double-width chars. – xebeche Jun 24 '19 at 09:12
@xebeche: By gosh, you’re right: wc -L and wc -m behave differently in the presence of zero-width and double-width characters! The man page doesn’t even begin to make that clear. (But it doesn’t do a good job of explaining the difference between -c and -m, either.) You deserve a medal for pointing that out! But, let’s be fair: igal’s answer, ilkkachu’s (second) answer … (Cont’d) – G-Man Says 'Reinstate Monica' Jun 24 '19 at 19:14
(Cont’d) … and Stéphane Chazelas’s answer also fail for those inputs; you should probably comment on them, too.   (Although ilkkachu does acknowledge in their answer that it might not handle those cases correctly.)   ilkkachu’s (first) answer (which works only when displayed on a terminal) and taiyu’s answer (this one) seem to be the only ones to get it right.   This answer deserves more votes! – G-Man Says 'Reinstate Monica' Jun 24 '19 at 19:14

igal · Answer 3 · 2017-11-26T21:46:55.027

I did a little web-searching, but I wasn't able to find a resolution for your problem in pure Bash, and I think there may not be one. I came across the following StackOverflow post:

UTF-8 Width Display Issue of Chinese Characters

The top-voted answer there (posted by user tchrist) includes the following:

Yes, this is a problem with all versions of printf that I am aware of. I briefly discuss the matter in this answer and also in this one.

I also came across the following post on the Unix StackExchange:

Why is printf "shrinking" umlaut?

The accepted solution there includes the following explanation:

POSIX requires printf's %-20s to count those 20 in terms of bytes not characters even though that makes little sense as printf is to print text, formatted (see discussion at the Austin Group (POSIX) and bash mailing lists).

It seems that what you want to do may not be possible with printf and that you'll have to roll your own solution.

I was able to produce the desired output using a Python script. Maybe you'll find it useful:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""script.py"""

# Set the default character encoding to UTF-8
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

# Array of ASCII characters
a=[("0")]
a+=[("00")]
a+=[("000")]
a+=[("0000")]
a+=[("00000")]

# Array of UTF-8 Characters
b=[("0")]
b+=[("├─00")]
b+=[("├─000")]
b+=[("├─0000")]
b+=[("└─00000")]

# Print the elements from both arrays
for x in a + b:
    print (u"%-15s|" % x).encode('utf-8')

Here is what I get when I run the script:

user@host:~$ python script.py

0              |
00             |
000            |
0000           |
00000          |
0              |
├─00           |
├─000          |
├─0000         |
└─00000        |

@ThomasDickey Thanks for the comment. I hope I'm not embarrassing myself too badly here, but it doesn't look like ASCII to me. When I apply hexdump to ├─ I get the following sequence: 94e2 e29c 8094 000a. What am I missing? — igal, Nov 26 '17 at 17:02
I was commenting about the 0's (which in the latter part of the example have more effect on the alignment than the unchanging line-drawing characters). — Thomas Dickey, Nov 26 '17 at 18:01
Note that it's fixed in the zsh and fish implementations of printf and ksh93 with an alternative syntax (that also addresses problems with zero-width and double-width characters). — Stéphane Chazelas, Nov 26 '17 at 21:25
Thanks, I figured that a width-aware printf implementation, as provided in other languages, might be necessary. — user001, Nov 26 '17 at 21:45

Stéphane Chazelas · Answer 4 · 2021-08-30T09:21:41.600

Why is printf "shrinking" umlaut? has a few proper solutions, by invoking proper tools to do that since bash misses the capability internally or by switching to different shells, but if you really wanted to implement it in bash with only builtin commands, there are ways for single-width (potentially multi-byte) characters.

In bash, like in all POSIX shells, you can get the width in characters of a $string with ${#string}, and ${#string} but in the C locale for the width in bytes.

So you can account for the discrepancy with something like:

clength() { clength=${#1}; }
blength() { local LC_ALL=C; blength=${#1}; }
align() {
  local format="$1" width="$2" arg blength clength
  shift 2
  for arg do
    clength "$arg"; blength "$arg"
    printf "$format" "$((width + blength - clength))" "$arg"
  done
}
a=(0 00 000 0000 00000)
b=(0 ├─00 ├─000 ├─0000 └─00000)
align '%-*s|\n' 12 "${a[@]}" "${b[@]}"

To account for zero-width (like combining marks) or double-width characters, there is no solution with bash only unless you're ready to hard code the list of such characters in your script (or use terminal escape sequences to tell the terminal to align the text (last example there, or there) and hard code the escape sequences for all supported terminal as bash doesn't have a builtin interface to terminfo/termcap either). zsh and ksh93 are the only shells that I know that have built in support for aligning characters of variable display width (example also in the linked Q&A).

printf: multibyte characters

4 Answers4

Linked