Why is the stdout line buffering for cut -c differ from cut -b?

Question

Our RHEL 7 machines have great long Log files and I asked about buffering of cut in this question. That question remains but a bit of experimentation showed a different issue.

I decided to try using cut by bytes, not by character and discovered the output buffering is different, on one machine but not the other:

On one machine, the two loops:

for ((ii=0;ii<5;ii++)); do  date; usleep 500000 ; done |  cut -b 1-99
for ((ii=0;ii<5;ii++)); do  date; usleep 500000 ; done |  cut -c 1-99

(observe the -c vs -b for cut) both display the dates five times, as the loops are progressing.

On the other machine, this loop doing a cut by byte:

for ((ii=0;ii<5;ii++)); do  date; usleep 500000 ; done |  cut -b 1-99

displays the times as the loop is progressing while this loop:

for ((ii=0;ii<5;ii++)); do  date; usleep 500000 ; done |  cut -c 1-99

holds the output until the loop is complete. If I set it to run forever, it displays a set of times, every 8192 bytes of output. There are two times per second, as expected but the output is buffered.

Two questions,

Why is one system different from the other?
Why is the output buffering different for the two usages of cut?

As to your second question, -b cuts by byte count, while -c cuts by character count. In the modern Unicode and UTF-8 and UTF-16 era, not all characters are one byte. — DopeGhoti, Oct 27 '22 at 00:26
Assuming that^ is the problem, what locales are the two systems using? — muru, Oct 27 '22 at 06:06
I can't believe it but some of our machines are set LANG=en_US and others LANG=en_US.UTF-8. If I set the locale to en_US I stop having the problem with cut. — user1683793, Oct 27 '22 at 21:03

ilkkachu · Answer 1 · 2022-10-27T12:24:16.563

It's not the stdout buffering, actually, not this time. (The default for stdout buffering would be to only line buffer output going to a terminal.)

First, that's not an upstream coreutils feature, and you can't see that issue in e.g. Debian. Whatever the man page and the --help output say, the actual upstream code considers -c and -b the same, see e.g.: https://github.com/coreutils/coreutils/blob/v9.1/src/cut.c#L483

However, there's an internationalization patch, coreutils-i18n which provides support for multi-byte characters based on the locale, and which Red Hat appears to carry.

The patch also provides a separate input buffering macro, used for cut -c, here:

+/* Refill the buffer BUF to get a multibyte character. */
+#define REFILL_BUFFER(BUF, BUFPOS, BUFLEN, STREAM)                        \
+  do                                                                        \
+    {                                                                        \
+      if (BUFLEN < MB_LEN_MAX && !feof (STREAM) && !ferror (STREAM))        \
+        {                                                                \
+          memmove (BUF, BUFPOS, BUFLEN);                                \
+          BUFLEN += fread (BUF + BUFLEN, sizeof(char), BUFSIZ, STREAM); \
+          BUFPOS = BUF;                                                        \
+        }                                                                \
+    }         
+  while (0)

It's not a loop, but the fread() there blocks until EOF or until it has a full buffer . Running the program under ltrace (not strace) showed it block on fread_unlocked() on the CentOS system I tried.

There's nothing you can do about that, the implementation tells stdio it needs BUFLEN bytes so that's that. No, disabling input buffering doesn't help, since it only affects if stdio reads ahead more than the application asked for.

The i18n patch seems to have had other issues too, at least in the past, see e.g. https://lwn.net/Articles/535735/ and https://bugzilla.redhat.com/show_bug.cgi?id=499220

If you only have ASCII characters, you can switch to cut -b, which does the same you'd get with cut -c on some other Linux systems anyway. Alternatively, switch to some other tool, maybe something like perl -C -ne 'print substr($_, 0,99)'.

As I observed above, changing the locale made the problem go away. I wonder we had the i18n patch in place, I would not have seen this in the first place. — user1683793, Oct 27 '22 at 21:06
@user1683793, I didn't look too closely at how it works with locales, but I guess it might be it detects that your locale doesn't have wide chars, and falls back to the original implementation, which doesn't have that input buffering issue — ilkkachu, Oct 28 '22 at 06:52

Why is the stdout line buffering for cut -c differ from cut -b?

1 Answers1