A large (and growing) number of utilities on Unix-like systems seem to be choosing the JSON format for data interchange¹, even though JSON strings cannot directly represent arbitrary file paths, process names, command line arguments and more generally C strings² which may contain text encoded in a variety of charsets or not meant to be text at all.
For instance, many util-linux, Linux LVM, systemd
utilities, curl
, GNU parallel
, ripgrep, sqlite3, tree, many FreeBSD utilities with their --libxo=json
option ... can output data in JSON format which can then allegedly be parsed programmatically "reliably".
But if some of the strings they're meant to output (like file names) contain text not encoded in UTF-8, that seems to all fall apart.
I see different type of behaviour across utilities in that case:
- those that transform the bytes that can't be decoded either by replacing them with a replacement character such as
?
(likeexiftool
) or U+FFFD (�) or using some form of encoding sometimes in a non-reversible way ("\\x80"
for instance incolumn
) - those that switch to a different representation like from
"json-string"
to[65, 234]
array of bytes injournalctl
or from{"text":"foo"}
to{"bytes":"base64-encoded"}
inrg
. - those that handle it in a bogus way like
curl
- and a great majority that just dump those bytes that don't make up valid UTF-8 as-is, that is with JSON strings containing invalid UTF-8.
Most util-linux
utilities are in the last category. For example, with lsfd
:
$ sh -c 'lsfd -Joname -p "$$" --filter "(ASSOC == \"3\")"' 3> $'\x80' | sed -n l
{$
"lsfd": [$
{$
"name": "/home/chazelas/tmp/\200"$
}$
]$
}$
That means they output invalid UTF-8, and therefore invalid JSON.
Now, though strictly invalid, that output is still unambiguous and could in theory be post-processed³.
However, I've checked a lot of the JSON processing utilities and none of them were able to process that. They either:
- error out with a decoding error
- replace those bytes with U+FFFD
- fail some miserable way or another
I feel like I'm missing something. Surely when that format was chosen, that must have been taken into account?
TL;DR
So my questions are:
- does that JSON format with strings not properly UTF-8 encoded (with some byte values >= 0x80 that don't form part of valid UTF-8-encoded characters) have a name?
- Are there any tools or programming language modules (preferably
perl
, but I'm open to others) that can process that format reliably? - Or can that format be converted to/from valid JSON so it can be processed by JSON processing utilities such as
jq
,json_xs
,mlr
... Preferably in a way that preserves valid JSON strings and without losing information?
Additional info
Below is the state of my own investigations. That's just supporting data you might find useful. That's just a quick dump, commands are in zsh
syntax and were run on a Debian unstable system (and FreeBSD 12.4-RELEASE-p5 for some). Sorry for the mess.
lsfd (and most util-linux utilities): outputs raw:
$ sh -c 'lsfd -Joname -p "$$" --filter "(ASSOC == \"3\")"' 3> $'\x80' | sed -n l
{$
"lsfd": [$
{$
"name": "/home/chazelas/\200"$
}$
]$
}$
column: Escapes ambiguously:
$ printf '%s\n' $'St\351phane' 'St\xe9phane' $'a\0b' | column -JC name=firstname
{
"table": [
{
"firstname": "St\\xe9phane"
},{
"firstname": "St\\xe9phane"
},{
"firstname": "a"
}
]
}
Switching to a locale using latin1 (or any single-byte charset covering the whole byte range) helps to get a raw format instead:
$ printf '%s\n' $'St\351phane' $'St\ue9phane' | LC_ALL=C.iso88591 column -JC name=firstname | sed -n l
{$
"table": [$
{$
"firstname": "St\351phane"$
},{$
"firstname": "St\303\251phane"$
}$
]$
}$
journalctl: array of bytes:
$ logger $'St\xe9phane'
$ journalctl -r -o json | jq 'select(._COMM == "logger").MESSAGE'
[
83,
116,
233,
112,
104,
97,
110,
101
]
curl: bogus
$ printf '%s\r\n' 'HTTP/1.0 200' $'Test: St\xe9phane' '' | socat -u - tcp-listen:8000,reuseaddr &
$ curl -w '%{header_json}' http://localhost:8000
{"test":["St\uffffffe9phane"]
}
Could have made sense with \U
except now unicode is restricted to codepoints up to \U0010FFFF
only.
cvtsudoers: raw
$ printf 'Defaults secure_path="/home/St\351phane/bin"' | cvtsudoers -f json | sed -n l
{$
"Defaults": [$
{$
"Options": [$
{ "secure_path": "/home/St\351phane/bin" }$
]$
}$
]$
}$
dmesg: raw
$ printf 'St\351phane\n' | sudo tee /dev/kmsg
$ sudo dmesg -J | sed -n /phane/l
"msg": "St\351phane"$
iproute2: raw and buggy
For ip link
at least, even control characters 0x1 .. 0x1f (only some of which are not allowed in interface names) are output raw which is invalid in JSON.
$ ifname=$'\1\xe9'
$ sudo ip link add name $ifname type dummy
$ sudo ip link add name $ifname type dummy
(added twice! First time got renamed to __
).
$ ip l
[...]
14: __: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 12:22:77:40:6f:8c brd ff:ff:ff:ff:ff:ff
15: �: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 12:22:77:40:6f:8c brd ff:ff:ff:ff:ff:ff
$ ip -j l | sed -n l
[...]
dcast":"ff:ff:ff:ff:ff:ff"},{"ifindex":14,"ifname":"__","flags":["BRO\
ADCAST","NOARP"],"mtu":1500,"qdisc":"noop","operstate":"DOWN","linkmo\
de":"DEFAULT","group":"default","txqlen":1000,"link_type":"ether","ad\
dress":"12:22:77:40:6f:8c","broadcast":"ff:ff:ff:ff:ff:ff"},{"ifindex\
":15,"ifname":"\001\351","flags":["BROADCAST","NOARP"],"mtu":1500,"qd\
isc":"noop","operstate":"DOWN","linkmode":"DEFAULT","group":"default"\
,"txqlen":1000,"link_type":"ether","address":"12:22:77:40:6f:8c","bro\
adcast":"ff:ff:ff:ff:ff:ff"}]$
$ ip -V
ip utility, iproute2-6.5.0, libbpf 1.2.2
exiftool: changes bytes to ?
$ exiftool -j $'St\xe9phane.txt'
[{
"SourceFile": "St?phane.txt",
"ExifToolVersion": 12.65,
"FileName": "St?phane.txt",
"Directory": ".",
"FileSize": "0 bytes",
"FileModifyDate": "2023:09:30 10:04:21+01:00",
"FileAccessDate": "2023:09:30 10:04:26+01:00",
"FileInodeChangeDate": "2023:09:30 10:04:21+01:00",
"FilePermissions": "-rw-r--r--",
"Error": "File is empty"
}]
lsar: interprets byte values as if they were unicode codepoints for tar:
$ tar cf f.tar $'St\xe9phane.txt' $'St\ue9phane.txt'
$ lsar --json f.tar| grep FileNa
"XADFileName": "Stéphane.txt",
"XADFileName": "Stéphane.txt",
For zip: URI-encoding
$ bsdtar --format=zip -cf a.zip St$'\351'phane.txt Stéphane.txt
$ lsar --json a.zip | grep FileNa
"XADFileName": "St%e9phane.txt",
"XADFileName": "Stéphane.txt",
lsipc: raw
$ ln -s /usr/lib/firefox-esr/firefox-esr $'St\xe9phane'
$ ./$'St\xe9phane' -new-instance
$ lsipc -mJ | grep -a phane | sed -n l
"command": "./St\351phane -new-instance"$
"command": "./St\351phane -new-instance"$
GNU parallel: raw
$ parallel --results -.json echo {} ::: $'\xe9' | sed -n l
{ "Seq": 1, "Host": ":", "Starttime": 1696068481.231, "JobRuntime": 0\
.001, "Send": 0, "Receive": 2, "Exitval": 0, "Signal": 0, "Command": \
"echo '\351'", "V": [ "\351" ], "Stdout": "\351\\u000a", "Stderr": ""\
}$
rg: switches from "text":"..." to "bytes":"base64..."
$ echo $'St\ue9phane' | rg --json '.*'
{"type":"begin","data":{"path":{"text":"<stdin>"}}}
{"type":"match","data":{"path":{"text":"<stdin>"},"lines":{"text":"Stéphane\n"},"line_number":1,"absolute_offset":0,"submatches":[{"match":{"text":"Stéphane"},"start":0,"end":9}]}}
{"type":"end","data":{"path":{"text":"<stdin>"},"binary_offset":null,"stats":{"elapsed":{"secs":0,"nanos":137546,"human":"0.000138s"},"searches":1,"searches_with_match":1,"bytes_searched":10,"bytes_printed":235,"matched_lines":1,"matches":1}}}
{"data":{"elapsed_total":{"human":"0.002445s","nanos":2445402,"secs":0},"stats":{"bytes_printed":235,"bytes_searched":10,"elapsed":{"human":"0.000138s","nanos":137546,"secs":0},"matched_lines":1,"matches":1,"searches":1,"searches_with_match":1}},"type":"summary"}
$ echo $'St\xe9phane' | LC_ALL=C rg --json '.*'
{"type":"begin","data":{"path":{"text":"<stdin>"}}}
{"type":"match","data":{"path":{"text":"<stdin>"},"lines":{"bytes":"U3TpcGhhbmUK"},"line_number":1,"absolute_offset":0,"submatches":[{"match":{"text":"St"},"start":0,"end":2},{"match":{"text":"phane"},"start":3,"end":8}]}}
{"type":"end","data":{"path":{"text":"<stdin>"},"binary_offset":null,"stats":{"elapsed":{"secs":0,"nanos":121361,"human":"0.000121s"},"searches":1,"searches_with_match":1,"bytes_searched":9,"bytes_printed":275,"matched_lines":1,"matches":2}}}
{"data":{"elapsed_total":{"human":"0.002471s","nanos":2471435,"secs":0},"stats":{"bytes_printed":275,"bytes_searched":9,"elapsed":{"human":"0.000121s","nanos":121361,"secs":0},"matched_lines":1,"matches":2,"searches":1,"searches_with_match":1}},"type":"summary"}
Interesting "x-user-defined" encoding:
$ echo $'St\xe9\xeaphane' | rg -E x-user-defined --json '.*' | jq -a .data.lines.text
null
"St\uf7e9\uf7eaphane\n"
null
null
With characters in private-use area for non-ASCII text. https://www.w3.org/International/docs/encoding/#x-user-defined
sqlite3: raw
$ sqlite3 -json a.sqlite3 'select * from a' | sed -n l
[{"a":"a"},$
{"a":"\351"}]$
tree: raw
$ tree -J | sed -n l
[$
{"type":"directory","name":".","contents":[$
{"type":"file","name":"\355\240\200\355\260\200"},$
{"type":"file","name":"a.zip"},$
{"type":"file","name":"f.tar"},$
{"type":"file","name":"St\303\251phane.txt"},$
{"type":"link","name":"St\351phane","target":"/usr/lib/firefox-es\
r/firefox-esr"},$
{"type":"file","name":"St\351phane.txt"}$
]}$
,$
{"type":"report","directories":1,"files":6}$
]$
lslocks: raw
$ lslocks --json | sed -n /phane/l
"path": "/home/chazelas/1/St\351phane.txt"$
@raf's rawhide: raw
$ rh -j | sed -n l
[...]
{"path":"./St\351phane", "name":"St\351phane", "start":".", "depth":1\
[...]
FreeBSD ps --libxo=json: escape:
$ sh -c 'sleep 1000; exit' $'\xe9' &
$ ps --libxo=json -o args -p $!
{"process-information": {"process": [{"arguments":"sh -c sleep 1000; exit \\M-i"}]}
}
$ sh -c 'sleep 1000; exit' '\M-i' &
$ ps --libxo=json -o args -p $!
{"process-information": {"process": [{"arguments":"sh -c sleep 1000; exit \\\\M-i"}]}
}
FreeBSD wc --libxo=json: raw
$ wc --libxo=json $'\xe9' | LC_ALL=C sed -n l
{"wc": {"file": [{"lines":10,"words":10,"characters":21,"filename":"\351"}]}$
}$
See also that bug report about sesutil map --libxo
where both reporter and developers expect the output should be UTF-8. And that discussion introducing libxo where the question of encoding was discussed but with no real conclusion.
JSON processing tools
jsesc: accepts but transforms to U+FFFD
$ jsesc -j $'\xe9'
"\uFFFD"
jq: accepts, transforms to U+FFFD but bogus:
$ print '"a\351b"' | jq -a .
"a\ufffd"
$ print '"a\351bc"' | jq -a .
"a\ufffdbc"
gojq: same without the bug
$ echo '"\xe9ab"' | gojq -j . | uconv -x hex
\uFFFD\u0061\u0062
json_pp: accepts, transforms to U+FFFD
$ print '"a\351b"' | json_pp -json_opt ascii,pretty
"a\ufffdb"
json_xs: same
$ print '"a\351b"' | json_xs | uconv -x hex
\u0022\u0061\uFFFD\u0062\u0022\u000A
Same with -e
:
$ print '"\351"' | PERL_UNICODE= json_xs -t none -e 'printf "%x\n", ord($_)'
fffd
jshon: error
$ printf '{"file":"St\351phane"}' | jshon -e file -u
json read error: line 1 column 11: unable to decode byte 0xe9 near '"St'
json5: accepts, transforms to U+FFFD
$ echo '"\xe9"' | json5 | uconv -x hex
\u0022\uFFFD\u0022
jc: error
$ echo 'St\xe9phane' | jc --ls
jc: Error - ls parser could not parse the input data.
If this is the correct parser, try setting the locale to C (LC_ALL=C).
For details use the -d or -dd option. Use "jc -h --ls" for help.
mlr: accepts, converts to U+FFFD
$ echo '{"f":"St\xe9phane"}' | mlr --json cat | sed -n l
[$
{$
"f": "St\357\277\275phane"$
}$
]$
vd: error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 1: invalid continuation byte
JSON::Parse: error
$ echo '"\xe9"'| perl -MJSON::Parse=parse_json -l -0777 -ne 'print parse_json($_)'
JSON error at line 1, byte 3/4: Unexpected character '"' parsing string starting from byte 1: expecting bytes in range 80-bf: 'x80-\xbf' at -e line 1, <> chunk 1.
jo: error
$ echo '\xe9' | jo -a
jo: json.c:1209: emit_string: Assertion `utf8_validate(str)' failed.
zsh: done echo '\xe9' |
zsh: IOT instruction jo -a
Can use base64:
$ echo '\xe9' | jo a=@-
jo: json.c:1209: emit_string: Assertion `utf8_validate(str)' failed.
zsh: done echo '\xe9' |
zsh: IOT instruction jo a=@-
$ echo '\xe9' | jo a=%-
{"a":"6Qo="}
jsed: accept and transform to U+FFFD
$ echo '{"a":"\xe9"}' | ./jsed get --path a | uconv -x hex
\uFFFD%
¹ See zgrep -li json ${(s[:])^"$(man -w)"}/man[18]*/*(N)
for a list of commands that may be processing JSON.
² And C strings cannot represent arbitrary JSON strings as C strings contrary to JSON strings can't contain NULs
³ Though its handling could become problematic as concatenating two such strings could end up forming valid characters and break some assumptions.
LANG=C
or to another locale? I've found many tools work "better" in C/POSIX locale. – Stephen Harris Sep 30 '23 at 14:56column
, switching to a locale using latin1 as charset is a way to get the same raw (and invalid as JSON) format as with util-linux utilities instead of the one with ambiguous non-reversible escaping one you get in UTF-8 locales. – Stéphane Chazelas Sep 30 '23 at 15:17