Print out binary data as is without breaking the terminal

Question

I've written a Python module to dump objects. When using it with objects that hold binary data (e.g. numpy.ndarray) in a linux terminal (TERM=linux) though, printing out the data results in the terminal's font breaking - apparently, some characters are treated as terminal control sequences. In Windows, printing works fine, even in Cygwin's mintty terminal (it has TERM=xterm though).

The same happens when I cat a binary file.

I can fix that with reset, of course, but at the cost of losing the output, and it's generally inconvenient. While I do know that most, if not all control characters have alternative graphical representations in fonts (e.g. for CR, it's ♪).

So, is there some way to alter the raw stream to make the linux terminal treat special characters that were in it like literals? Basically, I wish to see something like this:

I'm primarily interested in a programmatic way (=what needs to be done from terminal's standpoint and an implementation in common system libraries if there is one); a way in shell would be a plus.

Python's repr() doesn't fit my needs: it expands any non-printable-ASCII characters and into variable-length sequences, including national letters, while the module's design goal is for the dump printout to be concise and readable.

@drewbenn "the module's design goal is for the dump printout to be concise and readable". The image is PC's (and, consequently, DOS' and Windows') native way to represent binary data, and it's as close to that goal as possible: it always represents one byte with one character (so I readily see how many bytes everything takes, thus offsets), has almost 1-to-1 glyph-to-value mapping, and allows to quickly identify any embedded readable data (even UTF-8 in it, while not readable, has a very distinctive pattern). Basically, if it's there and it does the job, why invent anything else? — ivan_pozdeev, Dec 02 '16 at 21:12
@drewbenn I used that module in IronPython, Windows and Cygwin CPython for years, and only in Linux proper, ran into problems. It's only natural I wish to get around them with as little difference in behaviour as possible. — ivan_pozdeev, Dec 02 '16 at 21:19
@drewbenn I can show its usual output and let you decide for yourself if xxd or hexdump-style output would be out of place there. — ivan_pozdeev, Dec 02 '16 at 21:23

score 3 · Answer 1 · edited Apr 13 '17 at 12:36

3

The showconsolefont program can display 256 different (or 512 different...) glyphs at once on the Linux console. But it does this using a system call (which happens to only work for connections to the console device). Its manual page doesn't mention that.

But glyphs (which are used to display characters) aren't the same thing as characters. You would display a character by printing it on the terminal, and the terminal maps that to a glyph. There's no escape sequence which can tell the Linux console to treat control characters as printable.

For instance, showconsolefont doesn't actually write control characters for cells 0-31. It maps printable characters into the range 0-31 using (you guessed it) a system call.

Further reading:

Why does showconsolefont have different output in tmux?

edited Apr 13 '17 at 12:36

Community

1

answered Dec 02 '16 at 16:13

Thomas Dickey

76,765

https://linux.die.net/man/4/console_codes is a reference of linux terminal control codes. – ivan_pozdeev Dec 02 '16 at 21:43
If there had been a suitable escape sequence, I would have mentioned it. – Thomas Dickey Dec 02 '16 at 21:44
Okay, I see that Linux doesn't use PC's native format due to its incompatible non-PC legacy, and, apparently, no one cares enough to suggest changing that. Despite having been developed for PC in the first place - that's what I cannot believe. How crazy one must be to throw that away without an adequate alternative?! Perhaps because since everything is open-source, the ability to readably and accurately represent unknown binary data isn't so critical here. – ivan_pozdeev Dec 02 '16 at 22:34
The canonical "concise" format appears to be what less and vim use: ^<character>. Now, all that leaves is "implementation in common system libraries" - which library functionality encodes/displays characters like that? Is highlighting them in blue (vim) or white background (less) a part of it or their private logic? – ivan_pozdeev Dec 02 '16 at 22:41
This is sounding like a new question. The original question was how to get the terminal to treat all characters as literals. Applications doing their own conversion is a different (and simpler) question. – Thomas Dickey Dec 02 '16 at 22:49

score 0 · Answer 2 · edited May 23 '17 at 12:40

0

I'm not aware of any way to change the terminal to accept all characters. The control characters are a feature of the terminal, and it's usually the duty of the program to pay attention to the terminal type, produce the right control characters for terminal features it wants to use, and escape any control characters it wants to print.

Information how to change a Python program to do this can be found for example in this stackoverflow question.

In the shell, you can use e.g. tr to convert control characters to other ASCII characters (though not unicode characters). See this question for alternatives that can use unicode characters.

Unicode defines a Control Picture group to display control characters, for example, carriage return is ␍. If never heard about ♪ representing CR, and if so, that's purely accidental in some font you happen to use.

And in the shell, you wouldn't just cat a binary file, but use hexdump -C or similar to examine it.

edited May 23 '17 at 12:40

Community

1

answered Nov 30 '16 at 07:32

dirkt

32,309

2

Should mention cat -v (assuming that you have that option). – icarus Nov 30 '16 at 07:41
@icarus: I actually didn't know about cat -v. Yes, that's also an option. – dirkt Nov 30 '16 at 07:47
Really, I didn't think there's anyone out there who didn't see the signature "binary data" printout at least once to recognize it in description... Clarified the question. – ivan_pozdeev Nov 30 '16 at 11:33
The problem is that there's no "standard way" to do "binary data printout". ASCII control characters are called "control characters" for a reason: They don't have any standard printable representation. If your font happens to have one, that's an accident. Same goes for the characters with high bit set, that's highly dependent on the encoding (and funny things happen with utf8). So you first have to explain which characters you'd actually like to see for those values, in which encoding. Then you can try to add some conversion to actually print them (possibly via utf8). – dirkt Nov 30 '16 at 11:54
"ASCII control characters are called "control characters" for a reason: They don't have any standard printable representation" - oh, really? Built into and the same across all PCs in the world and throughout history - that's standard enough for me. – ivan_pozdeev Nov 30 '16 at 11:58
So you mean code page 437? That's supported by iconv to some degree, so you can convert the output to utf8, say, and properly display it. Code page 437 support under Linux isn't particularly good. – dirkt Nov 30 '16 at 15:02
Well, no such luck: python -c '"\1\2\3\4\5".encode("utf-8")' -> '\x01\x02\x03\x04\x05'. Control characters in UTF-8 are still control characters, and the terminal won't display them as is. – ivan_pozdeev Nov 30 '16 at 15:22
Well, of course that won't work. You must specify that you want to convert characters encoded using code page 437 into utf-8, and you must also find some tool that does convert the control characters (iconv only converts the the high-bit-one characters, for example, I just checked). But at least now we finally know what you want, the wikipedia page gives you the corresponding unicode, and you can write your own translation if it's really important to you. – dirkt Nov 30 '16 at 15:28
If all else fails, a little C program with the codes from the wikipedia page should take maybe 10 mins max to write. – dirkt Nov 30 '16 at 15:28
All cp437 glyphs are present in PC hardware font, and Linux terminal uses it by default. I wanted to hear about vt100 capabilities or something that would allow to print any character and leave it to hardware or whatever to represent it. Okay, looks like you're in the dark about these matter and cannot help me here. – ivan_pozdeev Nov 30 '16 at 15:37
Linux stopped using the PC hardware font ages ago (when it switched to the framebuffer for the virtual consoles). Also, Linux terminals in any reasonable modern distribution use utf-8, unless you explicitely set it otherwise (which you probably did). And no, I'm not aware of any vt100 capabilities which would completely disable control character processing, and I don't think they exist (because it's difficult to turn it back on if all characters are printable). – dirkt Nov 30 '16 at 15:56
However, what you can is to configure your system to use utf-8 everywhere (like everyone else), and then treat your binary output as if it was encoded in code page 437, and convert it to unicode. That will make all characters printable, including control characters, because they are now all utf-8, so you'll get the effect that you want (CR is ♪, etc.). If you don't want that, be my guest. :-) – dirkt Nov 30 '16 at 15:59

Print out binary data as is without breaking the terminal

2 Answers2

Linked