23

What would be the closest to a portable way to get the display width (on a terminal at least (one that displays characters in the current locale with the correct width)) of a string of characters from a shell script.

I'm primarily interested in the width of non-control characters but solutions that take into account control characters like backspace, carriage return, horizontal tabulation are welcome as well.

In other words, I'm looking for a shell API around the wcswidth() POSIX function.

That command should return:

$ that-command 'unix'   # 4 fullwidth characters
8
$ that-command 'Stéphane' # 9 characters, one of which zero-width
8
$ that-command 'もで 諤奯ゞ' # 5 double-width Japanese characters and a space
11

One could use ksh93's printf '%<n>Ls' that takes into account the character width for padding to <n> columns, or the col command (with for instance printf '++%s\b\b--\n' <character> | col -b) to try and derive that, there's a Text::CharWidth perl module at least, but are there more direct or portable approaches.

That's more or less a follow-up on that other question which was about displaying text at the right of the screen for which you would need to have that information before displaying the text.

6 Answers6

10

In a terminal emulator, one could use the cursor position report to get before/after positions, e.g., from

...record position
printf '%s' $string
...record position

and find how wide the characters printed on the terminal. Since that's an ECMA-48 (as well as VT100) control sequence supported by almost any terminal you are likely to use, it's fairly portable.

For reference

    CSI Ps n  Device Status Report (DSR).
              ...
                Ps = 6  -> Report Cursor Position (CPR) [row;column].
              Result is CSI r ; c R

Ultimately, the terminal emulator determines the printable width, because of these factors:

  • locale settings affect the way a string may be formatted, but the series of bytes sent to the terminal are interpreted based on how the terminal is configured (noting that some people will argue that it has to be UTF-8, while on the other hand portability was the feature requested in the question).
  • wcswidth alone does not tell how combining characters are handled; POSIX does not mention this aspect in the description of that function.
  • some characters (line-drawing for instance) which one might take for granted as single-width are (in Unicode) "ambiguous width", undermining portability of an application using wcswidth alone (see for example Chapter 2. Setting Up Cygwin). xterm for instance has provision for selecting double-width characters for configurations needed this.
  • to handle anything other than printable characters, you would have to rely upon the terminal emulator (unless you want to simulate that).

Shell APIs calling wcswidth are supported to varying degrees:

Those are more or less direct: simulating wcswidth in the case of Perl, calling C runtime from Ruby and Python. You could even use curses, e.g., from Python (which would handle combining characters):

  • initialize the terminal using setupterm (no text is written to the screen)
  • use the filter function (for single lines)
  • draw the text at the beginning of the line with addstr, checking for error (in case it is too long), and then for the ending position
  • if there is room, adjust the starting position.
  • call endwin (which should not do a refresh)
  • write the resulting information about the starting position to standard output

Using curses for output (rather than feeding the information back to a script or directly calling tput) would clear the whole line (filter does limit it to a line).

Thomas Dickey
  • 76,765
  • i think this must be the only way, really. if the terminal doesn't support double-width chars, then it doesn't much matter what wcswidth() has to say about anything at all. – mikeserv Nov 23 '15 at 23:22
  • In practice, the only problem I've had with this method is plink, which sets TERM=xterm even though it doesn't respond to any control sequence. But I don't use very exotic terminals. – Gilles 'SO- stop being evil' Nov 24 '15 at 00:15
  • Thanks. but the idea was to get that information prior to display the string on the terminal (to know where to display it, that's a follow-up on the recent question about displaying a string on the right of the terminal, maybe I should have mentioned that though my real question was really about how to get to wcswidth from the shell). @mikeserv, yes wcswidth() may be wrong about how a specific terminal would display a particular string, but that's as close as you can get to a terminal-independant solution and that's what col/ksh-printf use on my system. – Stéphane Chazelas Nov 24 '15 at 11:54
  • I'm aware of that, but wcswidth isn't directly accessible except via less-portable features (you could do this in perl, by making some assumptions - see http://search.cpan.org/dist/Text-CharWidth/CharWidth.pm). The right-alignment question by the way could be (perhaps) improved by writing the string to the lower-left and then using the cursor-position and insert-controls to shift it to the lower-right. – Thomas Dickey Nov 24 '15 at 11:59
  • 1
    @StéphaneChazelas - fold is apparently spec'd to handle multi-byte and extended width characters. Here's how it should handle backspace: The current count of line width shall be decremented by one, although the count never shall become negative. The fold utility shall not insert a immediately before or after any , unless the following character has a width greater than 1* and would cause the line width to exceed width.* maybe fold -w[num] and pr +[num] could be teamed up somehow? – mikeserv Nov 24 '15 at 14:26
  • @Stéphane Chazelas It would be a horrible hack but you could set the foreground color to black, print the character, and then try to measure the width. Of course, this won't work for Emoji, which some terminals render in color. You could also scrape the display width for all unicode points for a range of terminals ahead of time, and store these in the program. – MRule Sep 29 '21 at 11:03
10

For one-line strings, the GNU implementation of wc has a -L (a.k.a. --max-line-length) option that does exactly what you're looking for (except the control chars).

egmont
  • 5,866
  • 1
    Thanks. I had no idea it would return the display width. Note that the FreeBSD implementation also has a -L option, the doc says it returns the number of characters in the longest line, but my test seems to indicate it's a number of bytes instead (not the display width in anycase). OS/X has no -L even though I'd have expected it to derive from FreeBSD's. – Stéphane Chazelas Jan 29 '16 at 15:49
  • It seems to handle tab as well (assumes tab stops every 8 columns). – Stéphane Chazelas Jan 29 '16 at 15:53
  • Actually, for more-than-one-line strings, I would say it also does exactly what I'm looking for, as in it handles the LF control characters properly. – Stéphane Chazelas Apr 19 '16 at 18:11
  • @StéphaneChazelas: Are you still having the issue that this returns the number of bytes rather than the number of characters?  I tested it on your data and get the results you wanted: wc -L <<< 'unix' → 8, wc -L <<< 'Stéphane' → 8, and wc -L <<< 'もで 諤奯ゞ' → 11.  P.S. You consider “Stéphane” to be nine characters, one of which is zero-width?  It looks to me like eight characters, one of which is multi-byte. – G-Man Says 'Reinstate Monica' Jun 24 '19 at 20:41
  • @G-Man, I was refering to the FreeBSD implementation, which in FreeBSD 12.0 and a UTF-8 locale still seems to be counting bytes. Note that é can be written using one U+00E9 character or a U+0065 (e) character followed by U+0301 (combining acute accent), the latter being the one showed in the question. – Stéphane Chazelas Jun 24 '19 at 20:51
  • @StéphaneChazelas – I get 0 out of printf $'\xe2\x80\xa6' |wc -L on FreeBSD 11.2 vs 1 with GNU coreutils 8.30 on Debian Testing (11, Bullseye). That prints a horizontal ellipsis (U+2026 ) which is one character wide in a fixed-width font. – Adam Katz Dec 05 '19 at 15:59
  • @AdamKatz, on FreeBSD 12.1-RELEASE-p6 and in a UTF-8 locale, I get 0 for printf $'\xe2\x80\xa6' |wc -L and printf abc | wc -L (I suppose those don't contain any line), but I get 3 for both printf $'\xe2\x80\xa6\n' |wc -L and printf 'abc\n' | wc -L, so still the number of bytes in the (properly delimited) line with the most bytes. – Stéphane Chazelas Jul 23 '20 at 18:20
6

In my .profile, I call a script to determine the width of a string on a terminal. I use this when logging in on the console of a machine where I don't trust the system-set LC_CTYPE, or when I log in remotely and can't trust LC_CTYPE to match the remote side. My script queries the terminal, rather than calling any library, because that was the whole point in my use case: determine the encoding of the terminal.

This is fragile in several ways:

  • it modifies the display, so it isn't very nice user experience;
  • there's a race condition if another program displays something at the wrong time;
  • it locks up if the terminal doesn't respond. (A few years ago I asked how to improve on this, but it hasn't been much of an issue in practice so I never got around to switching to that solution. The only case I encountered of a terminal that doesn't respond was a Windows Emacs accessing remote files from a Linux machine with the plink method, and I solved it by using the plinkx method instead.)

This may or may not match your use case.

#! /bin/sh

if [ z"$ZSH_VERSION" = z ]; then :; else
  emulate sh 2>/dev/null
fi
set -e

help_and_exit () {
  cat <<EOF
Usage: $0 {-NUMBER|TEXT}
Find out the width of TEXT on the terminal.

LIMITATION: this program has been designed to work in an xterm. Only
xterm and sufficiently compatible terminals will work. If you think
this program may be blocked waiting for input from the the terminal,
try entering the characters "0n0n" (digit 0, lowercase letter n,
repeat).

Display TEXT and erase it. Find out the position of the cursor before
and after displaying TEXT so as to compute the width of TEXT. The width
is returned as the exit code of the program. A value of 100 is returned if
the text is wider than 100 columns.

TEXT may contain backslash-escapes: \\0DDD represents the byte whose numeric
value is DDD in octal. Use '\\\\' to include a single backslash character.

You may use -NUMBER instead of TEXT (if TEXT begins with a dash, use
"-- TEXT"). This selects one of the built-in texts that are designed
to discriminate between common encodings. The following table lists
supported values of NUMBER (leftmost column) and the widths of the
sample text in several encodings.

  1  ASCII=0 UTF-8=2 latinN=3 8bits=4
EOF
  exit
}

builtin_text () {
  case $1 in
    -*[!0-9]*)
      echo 1>&2 "$0: bad number: $1"
      exit 119;;
    -1) # UTF8: {\'E\'e}; latin1: {\~A\~A\copyright}; ASCII: {}
      text='\0303\0211\0303\0251';;
    *)
      echo 1>&2 "$0: there is no text number $1. Stop."
      exit 118;;
  esac
}

text=
if [ $# -eq 0 ]; then
  help_and_exit 1>&2
fi
case "$1" in
  --) shift;;
  -h|--help) help_and_exit;;
  -[0-9]) builtin_text "$1";;
  -*)
    echo 1>&2 "$0: unknown option: $1"
    exit 119
esac
if [ z"$text" = z ]; then
  text="$1"
fi

printf "" # test that it is there (abort on very old systems)

csi='\033['
dsr_cpr="${csi}6n" # Device Status Report --- Report Cursor Position
dsr_ok="${csi}5n" # Device Status Report --- Status Report

stty_save=`stty -g`
if [ z"$stty_save" = z ]; then
  echo 1>&2 "$0: \`stty -g' failed ($?)."
  exit 3
fi
initial_x=
final_x=
delta_x=

cleanup () {
  set +e
  # Restore terminal settings
  stty "$stty_save"
  # Restore cursor position (unless something unexpected happened)
  if [ z"$2" = z ]; then
    if [ z"$initial_report" = z ]; then :; else
      x=`expr "${initial_report}" : "\\(.*\\)0"`
      printf "%b" "${csi}${x}H"
    fi
  fi
  if [ z"$1" = z ]; then
    # cleanup was called explicitly, so don't exit.
    # We use `trap : 0' rather than `trap - 0' because the latter doesn't
    # work in older Bourne shells.
    trap : 0
    return
  fi
  exit $1
}
trap 'cleanup 120 no' 0
trap 'cleanup 129' 1
trap 'cleanup 130' 2
trap 'cleanup 131' 3
trap 'cleanup 143' 15

stty eol 0 eof n -echo
printf "%b" "$dsr_cpr$dsr_ok"
initial_report=`tr -dc \;0123456789`
# Get the initial cursor position. Time out if the terminal does not reply
# within 1 second. The trick of calling tr and sleep in a pipeline to put
# them in a process group, and using "kill 0" to kill the whole process
# group, was suggested by Stephane Gimenez at
# https://unix.stackexchange.com/questions/10698/timing-out-in-a-shell-script
#trap : 14
#set +e
#initial_report=`sh -c 'ps -t $(tty) -o pid,ppid,pgid,command >/tmp/p;
#                       { tr -dc \;0123456789 >&3; kill -14 0; } |
#                       { sleep 1; kill -14 0; }' 3>&1`
#set -e
#initial_report=`{ sleep 1; kill 0; } |
#                { tr -dc \;0123456789 </dev/tty; kill 0; }`
if [ z"$initial_report" = z"" ]; then
  # We couldn't read the initial cursor position, so abort.
  cleanup 120
fi
# Write some text and get the final cursor position.
printf "%b%b" "$text" "$dsr_cpr$dsr_ok"
final_report=`tr -dc \;0123456789`

initial_x=`expr "$initial_report" : "[0-9][0-9]*;\\([0-9][0-9]*\\)0" || test $? -eq 1`
final_x=`expr "$final_report" : "[0-9][0-9]*;\\([0-9][0-9]*\\)0" || test $? -eq 1`
delta_x=`expr "$final_x" - "$initial_x" || test $? -eq 1`

cleanup
# Zsh has function-local EXIT traps, even in sh emulation mode. This
# is a long-standing bug.
trap : 0

if [ $delta_x -gt 100 ]; then
  delta_x=100
fi
exit $delta_x

The script returns the width in its return status, clipped to 100. Sample usage:

widthof -1
case $? in
  0) export LC_CTYPE=C;; # 7-bit charset
  2) locale_search .utf8 .UTF-8;; # utf8
  3) locale_search .iso88591 .ISO8859-1 .latin1 '';; # 8-bit with nonprintable 128-159, we assume latin1
  4) locale_search .iso88591 .ISO8859-1 .latin1 '';; # some full 8-bit charset, we assume latin1
  *) export LC_CTYPE=C;; # weird charset
esac
  • This was helpful to me (though I mostly used your condensed version). I made its usage a little prettier by adding printf "\r%*s\r" $((${#text}+8)) " "; to the end of cleanup (adding 8 is arbitrary; it needs to be long enough to cover the wider output of older locales but narrow enough to avoid a line wrap). This makes the test invisible, though it also assumes nothing has been printed on the line (which is fine in a ~/.profile) – Adam Katz Dec 15 '19 at 22:08
  • Actually, it appears from a little experimentation that in zsh (5.7.1) you can just do text="Éé" and then ${#text} will give you the display width (I get 4 in a non-unicode terminal and 2 in a unicode-compliant terminal). This is not true for bash. – Adam Katz Dec 15 '19 at 23:23
  • @AdamKatz ${#text} doesn't give you the display width. It gives you the number of characters in the encoding used by current locale. Which is useless for my purpose since I want to determine the terminal's encoding. It is useful if you want the display width for some other reason, but it isn't accurate because not every character is one unit wide. For example combining accents have a width of 0, and Chinese ideograms have a width of 2. – Gilles 'SO- stop being evil' Dec 16 '19 at 13:28
  • Yeah, good point. It may satisfy Stéphane’s question but not your original intent (which is actually what I wanted to do too, thus my adapting your code). Hopefully my first comment was helpful to you, Gilles. – Adam Katz Dec 16 '19 at 16:38
5

Eric Pruitt wrote an impressive implementation of wcwidth() and wcswidth() in Awk available at wcwidth.awk. It mainly provides 4 functions

wcscolumns(), wcstruncate(), wcwidth(), wcswidth()

where wcscolumns() also tolerates non-printable characters.

$ cat wcscolumns.awk 
{ printf "%d\n", wcscolumns($0) }
$ awk -f wcwidth.awk -f wcscolumns.awk <<< 'unix'
8
$ awk -f wcwidth.awk -f wcscolumns.awk <<< 'Stéphane'
8
$ awk -f wcwidth.awk -f wcscolumns.awk <<< 'もで 諤奯ゞ'
11
$ awk -f wcwidth.awk -f wcscolumns.awk <<< $'My sign is\t鼠鼠'
14

I opened an issue asking about the handling of TABs since wcscolumns($'My sign is\t鼠鼠') should be greater than 14. Update: Eric added the function wcsexpand() to expand TABs to spaces:

$ cat >wcsexpand.awk 
{ printf "%d\n", wcscolumns( wcsexpand($0, 8) ) }
$ awk -f wcwidth.awk -f wcsexpand.awk <<< $'My sign is\t鼠鼠'
20
$ echo $'鼠\tone\n鼠鼠\ttwo'
鼠      one
鼠鼠    two
$ awk -f wcwidth.awk -f wcsexpand.awk <<< $'鼠\tone\n鼠鼠\ttwo'
11
11
xebeche
  • 911
2

To expand on the hints at possible solutions using col and ksh93 in my question:

Using the col from bsdmainutils on Debian (may not work with other col implementations), to get the width of a single non-control character:

charwidth() {
  set "$(printf '...%s\b\b...\n' "$1" | col -b)"
  echo "$((${#1} - 4))"
}

Example:

$ charwidth x
1
$ charwidth $'\u301'
0
$ charwidth $'\u94f6'
2

Extended for a string:

stringwidth() {
   awk '
     BEGIN{
       s = ARGV[1]
       l = length(s)
       for (i=0; i<l; i++) {
         s1 = s1 ".."
         s2 = s2 "\b\b"
       }
       print s1 s s2 s1
       exit
     }' "$1" | col -b | awk '
        {print length - 2 * length(ARGV[2]); exit}' - "$1"
}

Using ksh93's printf '%Ls':

charwidth() {
  set "$(printf '.%2Ls.' "$1")"
  echo "$((5 - ${#1}))"
}

stringwidth() { set "$(printf '.%Ls.' "$((2${#1}))" "$1")" "$1" echo "$((2 + 3 * ${#2} - ${#1}))" }

Using perl's Text::CharWidth:

stringwidth() {
  perl -MText::CharWidth=mbswidth -le 'print mbswidth shift' -- "$@"
}
1

With zsh, while ${#string} gives you the length in character of the string like in ksh (or in bytes if the multibyte option is turned off), since version 4.3.7 (2008), and this change, when combined with the m parameter expansion flag, that becomes the display width (using the standard wcwidth() function underneath, even providing its own currently based on data from Unicode 9 for systems that don't have it). So there, it's just:

width=${(m)#string}

Note that ASCII control characters (including TAB, BS, NL, CR, NUL) and bytes not forming part of valid characters are counted as 1.

With older versions of zsh, you can use the l left-padding parameter expansion flag.

It pads according to the display width of the character, by default before version 4.3.3 and if you add the m flag since 4.3.3.

So one can calculate the width of a string with something like:

width() print $(($#1 * 3 - ${#${(ml[$#1 * 2])1}}))

Examples (with some comparisons with GNU wc -L):

$ width 'unix'
8
$ width $'Ste\u0301phane'
8
$ width 'もで 諤奯ゞ'
11

$ print 'a\tb' a b $ width $'a\tb' 3 $ print 'a\tb' | wc -L 9

$ print 'a\bb' b $ width $'a\bb' 3 $ print 'a\bb' | wc -L 2