How to process JSON with strings containing invalid UTF-8

Question

A large (and growing) number of utilities on Unix-like systems seem to be choosing the JSON format for data interchange¹, even though JSON strings cannot directly represent arbitrary file paths, process names, command line arguments and more generally C strings² which may contain text encoded in a variety of charsets or not meant to be text at all.

For instance, many util-linux, Linux LVM, systemd utilities, curl, GNU parallel, ripgrep, sqlite3, tree, many FreeBSD utilities with their --libxo=json option ... can output data in JSON format which can then allegedly be parsed programmatically "reliably".

But if some of the strings they're meant to output (like file names) contain text not encoded in UTF-8, that seems to all fall apart.

I see different type of behaviour across utilities in that case:

those that transform the bytes that can't be decoded either by replacing them with a replacement character such as ? (like exiftool) or U+FFFD (�) or using some form of encoding sometimes in a non-reversible way ("\\x80" for instance in column)
those that switch to a different representation like from "json-string" to [65, 234] array of bytes in journalctl or from {"text":"foo"} to {"bytes":"base64-encoded"} in rg.
those that handle it in a bogus way like curl
and a great majority that just dump those bytes that don't make up valid UTF-8 as-is, that is with JSON strings containing invalid UTF-8.

Most util-linux utilities are in the last category. For example, with lsfd:

$ sh -c 'lsfd -Joname -p "$$" --filter "(ASSOC == \"3\")"' 3> $'\x80' | sed -n l
{$
   "lsfd": [$
      {$
         "name": "/home/chazelas/tmp/\200"$
      }$
   ]$
}$

That means they output invalid UTF-8, and therefore invalid JSON.

Now, though strictly invalid, that output is still unambiguous and could in theory be post-processed³.

However, I've checked a lot of the JSON processing utilities and none of them were able to process that. They either:

error out with a decoding error
replace those bytes with U+FFFD
fail some miserable way or another

I feel like I'm missing something. Surely when that format was chosen, that must have been taken into account?

TL;DR

So my questions are:

does that JSON format with strings not properly UTF-8 encoded (with some byte values >= 0x80 that don't form part of valid UTF-8-encoded characters) have a name?
Are there any tools or programming language modules (preferably perl, but I'm open to others) that can process that format reliably?
Or can that format be converted to/from valid JSON so it can be processed by JSON processing utilities such as jq, json_xs, mlr... Preferably in a way that preserves valid JSON strings and without losing information?

Additional info

Below is the state of my own investigations. That's just supporting data you might find useful. That's just a quick dump, commands are in zsh syntax and were run on a Debian unstable system (and FreeBSD 12.4-RELEASE-p5 for some). Sorry for the mess.

lsfd (and most util-linux utilities): outputs raw:

$ sh -c 'lsfd -Joname -p "$$" --filter "(ASSOC == \"3\")"' 3> $'\x80' | sed -n l
{$
   "lsfd": [$
      {$
         "name": "/home/chazelas/\200"$
      }$
   ]$
}$

column: Escapes ambiguously:

$ printf '%s\n' $'St\351phane' 'St\xe9phane' $'a\0b' | column -JC name=firstname
{
   "table": [
      {
         "firstname": "St\\xe9phane"
      },{
         "firstname": "St\\xe9phane"
      },{
         "firstname": "a"
      }
   ]
}

Switching to a locale using latin1 (or any single-byte charset covering the whole byte range) helps to get a raw format instead:

$ printf '%s\n' $'St\351phane' $'St\ue9phane' | LC_ALL=C.iso88591 column -JC name=firstname  | sed -n l
{$
   "table": [$
      {$
         "firstname": "St\351phane"$
      },{$
         "firstname": "St\303\251phane"$
      }$
   ]$
}$

journalctl: array of bytes:

$ logger $'St\xe9phane'
$ journalctl -r -o json | jq 'select(._COMM == "logger").MESSAGE'
[
  83,
  116,
  233,
  112,
  104,
  97,
  110,
  101
]

curl: bogus

$ printf '%s\r\n' 'HTTP/1.0 200' $'Test: St\xe9phane' '' |  socat -u - tcp-listen:8000,reuseaddr &
$ curl -w '%{header_json}' http://localhost:8000
{"test":["St\uffffffe9phane"]
}

Could have made sense with \U except now unicode is restricted to codepoints up to \U0010FFFF only.

cvtsudoers: raw

$ printf 'Defaults secure_path="/home/St\351phane/bin"' | cvtsudoers -f json  | sed -n l
{$
    "Defaults": [$
        {$
            "Options": [$
                { "secure_path": "/home/St\351phane/bin" }$
            ]$
        }$
    ]$
}$

dmesg: raw

$ printf 'St\351phane\n' | sudo tee /dev/kmsg
$ sudo dmesg -J | sed -n /phane/l
         "msg": "St\351phane"$

iproute2: raw and buggy

For ip link at least, even control characters 0x1 .. 0x1f (only some of which are not allowed in interface names) are output raw which is invalid in JSON.

$ ifname=$'\1\xe9'
$ sudo ip link add name $ifname type dummy
$ sudo ip link add name $ifname type dummy

(added twice! First time got renamed to __).

$ ip l
[...]
14: __: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 12:22:77:40:6f:8c brd ff:ff:ff:ff:ff:ff
15: �: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 12:22:77:40:6f:8c brd ff:ff:ff:ff:ff:ff
$ ip -j l | sed -n l
[...]
dcast":"ff:ff:ff:ff:ff:ff"},{"ifindex":14,"ifname":"__","flags":["BRO\
ADCAST","NOARP"],"mtu":1500,"qdisc":"noop","operstate":"DOWN","linkmo\
de":"DEFAULT","group":"default","txqlen":1000,"link_type":"ether","ad\
dress":"12:22:77:40:6f:8c","broadcast":"ff:ff:ff:ff:ff:ff"},{"ifindex\
":15,"ifname":"\001\351","flags":["BROADCAST","NOARP"],"mtu":1500,"qd\
isc":"noop","operstate":"DOWN","linkmode":"DEFAULT","group":"default"\
,"txqlen":1000,"link_type":"ether","address":"12:22:77:40:6f:8c","bro\
adcast":"ff:ff:ff:ff:ff:ff"}]$

$ ip -V
ip utility, iproute2-6.5.0, libbpf 1.2.2

exiftool: changes bytes to ?

$ exiftool -j $'St\xe9phane.txt'
[{
  "SourceFile": "St?phane.txt",
  "ExifToolVersion": 12.65,
  "FileName": "St?phane.txt",
  "Directory": ".",
  "FileSize": "0 bytes",
  "FileModifyDate": "2023:09:30 10:04:21+01:00",
  "FileAccessDate": "2023:09:30 10:04:26+01:00",
  "FileInodeChangeDate": "2023:09:30 10:04:21+01:00",
  "FilePermissions": "-rw-r--r--",
  "Error": "File is empty"
}]

lsar: interprets byte values as if they were unicode codepoints for tar:

$ tar cf f.tar $'St\xe9phane.txt' $'St\ue9phane.txt'
$ lsar --json f.tar| grep FileNa
      "XADFileName": "Stéphane.txt",
      "XADFileName": "StÃ©phane.txt",

For zip: URI-encoding

$ bsdtar --format=zip -cf a.zip St$'\351'phane.txt Stéphane.txt
$ lsar --json a.zip | grep FileNa
      "XADFileName": "St%e9phane.txt",
      "XADFileName": "Stéphane.txt",

lsipc: raw

$ ln -s /usr/lib/firefox-esr/firefox-esr $'St\xe9phane'
$ ./$'St\xe9phane' -new-instance
$ lsipc -mJ | grep -a phane | sed -n l
         "command": "./St\351phane -new-instance"$
         "command": "./St\351phane -new-instance"$

GNU parallel: raw

$ parallel --results -.json echo {} ::: $'\xe9' | sed -n l
{ "Seq": 1, "Host": ":", "Starttime": 1696068481.231, "JobRuntime": 0\
.001, "Send": 0, "Receive": 2, "Exitval": 0, "Signal": 0, "Command": \
"echo '\351'", "V": [ "\351" ], "Stdout": "\351\\u000a", "Stderr": ""\
 }$

rg: switches from "text":"..." to "bytes":"base64..."

$ echo $'St\ue9phane' | rg --json '.*'
{"type":"begin","data":{"path":{"text":"<stdin>"}}}
{"type":"match","data":{"path":{"text":"<stdin>"},"lines":{"text":"Stéphane\n"},"line_number":1,"absolute_offset":0,"submatches":[{"match":{"text":"Stéphane"},"start":0,"end":9}]}}
{"type":"end","data":{"path":{"text":"<stdin>"},"binary_offset":null,"stats":{"elapsed":{"secs":0,"nanos":137546,"human":"0.000138s"},"searches":1,"searches_with_match":1,"bytes_searched":10,"bytes_printed":235,"matched_lines":1,"matches":1}}}
{"data":{"elapsed_total":{"human":"0.002445s","nanos":2445402,"secs":0},"stats":{"bytes_printed":235,"bytes_searched":10,"elapsed":{"human":"0.000138s","nanos":137546,"secs":0},"matched_lines":1,"matches":1,"searches":1,"searches_with_match":1}},"type":"summary"}

$ echo $'St\xe9phane' | LC_ALL=C rg --json '.*'
{"type":"begin","data":{"path":{"text":"<stdin>"}}}
{"type":"match","data":{"path":{"text":"<stdin>"},"lines":{"bytes":"U3TpcGhhbmUK"},"line_number":1,"absolute_offset":0,"submatches":[{"match":{"text":"St"},"start":0,"end":2},{"match":{"text":"phane"},"start":3,"end":8}]}}
{"type":"end","data":{"path":{"text":"<stdin>"},"binary_offset":null,"stats":{"elapsed":{"secs":0,"nanos":121361,"human":"0.000121s"},"searches":1,"searches_with_match":1,"bytes_searched":9,"bytes_printed":275,"matched_lines":1,"matches":2}}}
{"data":{"elapsed_total":{"human":"0.002471s","nanos":2471435,"secs":0},"stats":{"bytes_printed":275,"bytes_searched":9,"elapsed":{"human":"0.000121s","nanos":121361,"secs":0},"matched_lines":1,"matches":2,"searches":1,"searches_with_match":1}},"type":"summary"}

Interesting "x-user-defined" encoding:

$ echo $'St\xe9\xeaphane' | rg -E x-user-defined --json '.*'  | jq -a .data.lines.text
null
"St\uf7e9\uf7eaphane\n"
null
null

With characters in private-use area for non-ASCII text. https://www.w3.org/International/docs/encoding/#x-user-defined

sqlite3: raw

$ sqlite3 -json a.sqlite3 'select * from a' | sed -n l
[{"a":"a"},$
{"a":"\351"}]$

tree: raw

$ tree -J | sed -n l
[$
  {"type":"directory","name":".","contents":[$
    {"type":"file","name":"\355\240\200\355\260\200"},$
    {"type":"file","name":"a.zip"},$
    {"type":"file","name":"f.tar"},$
    {"type":"file","name":"St\303\251phane.txt"},$
    {"type":"link","name":"St\351phane","target":"/usr/lib/firefox-es\
r/firefox-esr"},$
    {"type":"file","name":"St\351phane.txt"}$
  ]}$
,$
  {"type":"report","directories":1,"files":6}$
]$

lslocks: raw

$ lslocks --json | sed -n /phane/l
         "path": "/home/chazelas/1/St\351phane.txt"$

@raf's rawhide: raw

$ rh -j | sed -n l
[...]
{"path":"./St\351phane", "name":"St\351phane", "start":".", "depth":1\
[...]

FreeBSD ps --libxo=json: escape:

$ sh -c 'sleep 1000; exit' $'\xe9' &
$ ps --libxo=json -o args -p $!
{"process-information": {"process": [{"arguments":"sh -c sleep 1000; exit \\M-i"}]}
}
$ sh -c 'sleep 1000; exit' '\M-i' &
$ ps --libxo=json -o args -p $!
{"process-information": {"process": [{"arguments":"sh -c sleep 1000; exit \\\\M-i"}]}
}

FreeBSD wc --libxo=json: raw

$ wc --libxo=json  $'\xe9' | LC_ALL=C sed -n l
{"wc": {"file": [{"lines":10,"words":10,"characters":21,"filename":"\351"}]}$
}$

See also that bug report about sesutil map --libxo where both reporter and developers expect the output should be UTF-8. And that discussion introducing libxo where the question of encoding was discussed but with no real conclusion.

JSON processing tools

jsesc: accepts but transforms to U+FFFD

$ jsesc  -j $'\xe9'
"\uFFFD"

jq: accepts, transforms to U+FFFD but bogus:

$ print '"a\351b"' | jq -a .
"a\ufffd"
$ print '"a\351bc"' | jq -a .
"a\ufffdbc"

gojq: same without the bug

$ echo '"\xe9ab"' | gojq -j . | uconv -x hex
\uFFFD\u0061\u0062

json_pp: accepts, transforms to U+FFFD

$ print '"a\351b"' | json_pp -json_opt ascii,pretty
"a\ufffdb"

json_xs: same

$ print '"a\351b"' | json_xs | uconv -x hex
\u0022\u0061\uFFFD\u0062\u0022\u000A

Same with -e:

$ print '"\351"' | PERL_UNICODE= json_xs -t none -e 'printf "%x\n", ord($_)'
fffd

jshon: error

$ printf '{"file":"St\351phane"}' | jshon -e file -u
json read error: line 1 column 11: unable to decode byte 0xe9 near '"St'

json5: accepts, transforms to U+FFFD

$ echo '"\xe9"' | json5 | uconv -x hex
\u0022\uFFFD\u0022

jc: error

$ echo 'St\xe9phane' | jc --ls
jc:  Error - ls parser could not parse the input data.
             If this is the correct parser, try setting the locale to C (LC_ALL=C).
             For details use the -d or -dd option. Use "jc -h --ls" for help.

mlr: accepts, converts to U+FFFD

$ echo '{"f":"St\xe9phane"}' | mlr --json cat | sed -n l
[$
{$
  "f": "St\357\277\275phane"$
}$
]$

vd: error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 1: invalid continuation byte

JSON::Parse: error

$ echo '"\xe9"'| perl -MJSON::Parse=parse_json -l -0777 -ne 'print parse_json($_)'
JSON error at line 1, byte 3/4: Unexpected character '"' parsing string starting from byte 1: expecting bytes in range 80-bf: 'x80-\xbf' at -e line 1, <> chunk 1.

jo: error

$ echo '\xe9' | jo -a
jo: json.c:1209: emit_string: Assertion `utf8_validate(str)' failed.
zsh: done             echo '\xe9' |
zsh: IOT instruction  jo -a

Can use base64:

$ echo '\xe9' | jo a=@-
jo: json.c:1209: emit_string: Assertion `utf8_validate(str)' failed.
zsh: done             echo '\xe9' |
zsh: IOT instruction  jo a=@-
$ echo '\xe9' | jo a=%-
{"a":"6Qo="}

jsed: accept and transform to U+FFFD

$ echo '{"a":"\xe9"}' | ./jsed get --path a | uconv -x hex
\uFFFD%

^{¹ See zgrep -li json ${(s[:])^"$(man -w)"}/man[18]*/*(N) for a list of commands that may be processing JSON.}

^{² And C strings cannot represent arbitrary JSON strings as C strings contrary to JSON strings can't contain NULs}

^{³ Though its handling could become problematic as concatenating two such strings could end up forming valid characters and break some assumptions.}

Does this behaviour change if you set LANG=C or to another locale? I've found many tools work "better" in C/POSIX locale. — Stephen Harris, Sep 30 '23 at 14:56
@StephenHarris, JSON is meant to be in UTF-8 regardless of the locale. There are probably tools that take into account the locale when decoding strings before producing the JSON, but using the C locale, is more likely to harm than help as C locales usually don't specify characters other than the ASCII ones. — Stéphane Chazelas, Sep 30 '23 at 14:59
@StephenHarris changing the locale is a good point though that I've not much tested. For instance, with column, switching to a locale using latin1 as charset is a way to get the same raw (and invalid as JSON) format as with util-linux utilities instead of the one with ambiguous non-reversible escaping one you get in UTF-8 locales. — Stéphane Chazelas, Sep 30 '23 at 15:17

Stéphane Chazelas · Answer 1 · 2023-11-17T12:50:42.820

A possible (not fully satisfactory) approach if one doesn't need to consider any of strings in the JSON as text is to pre-process the input to the JSON-processing tool (jq, mlr...) with iconv -f latin1 -t utf-8 and post-process its output with iconv -f utf-8 -t latin, that is convert all bytes >= 0x80 to the character with the corresponding Unicode code point, or in other words, consider the input as if it was encoded in latin1.

$ exec 3> $'\x80\xff'
$ ls -ld "$(lsfd -Jp "$$" | jq -r '.lsfd[]|select(.assoc=="3").name')"
ls: cannot access '/home/chazelas/1/��': No such file or directory

Doesn't work because jq transformed those bytes to U+FFFD, but:

$ ls -ld "$(lsfd -Jp "$$" | iconv -fl1  | jq -r '.lsfd[]|select(.assoc=="3").name' | iconv -tl1)"
-rw-r--r-- 1 chazelas chazelas 0 Sep 30 15:51 '/home/chazelas/tmp/'$'\200\377'

Works. Now there are many ways that can fall apart:

the length of strings in number of bytes and characters do change in the process, so any check on length you're going to make will likely be inaccurate (though the length in characters of the JSON strings will correspond to the length in byte of the file name).
you need to make sure the JSON processing tool does not escape characters as \uxxxx (don't use -a in jq for instance) or the characters won't be converted back to bytes afterwards
the JSON-processing tool must also not produce new strings with characters with codepoints >= 0x80, or if you do, you need to do the double encoding. Like: jq -r '"Fichier trouvÃ© : " + .file' instead of jq -r '"Fichier trouvé : " + .file' if you want them to be encoded in UTF-8 after going through iconv -f utf-8 -t latin1.
Any text-based check or operation such as test of character class, sorting, etc won't be valid.

Using the x-user-defined charsets as can be used in HTML instead of latin1 would avoid some of those problems because all the bytes >= 0x80 would be mapped to contiguous characters in the private-use area (so would not be classified by mistake as alpha/blank and would not be included in some [a-z]/[0-9]... ranges), but AFAICT, none of iconv/uconv/recode support that charset.

Using latin1 has the advantage that you can check for byte values based on codepoint. For example, to find the open files whose name contains byte 0x80:

$ lsfd -Jp "$$" | iconv -fl1 -tu8 | jq -r '.lsfd[]|select(.name|contains("\u0080"))' | iconv -fu8 -tl1
{
  "command": "zsh",
  "pid": 8127,
  "user": "chazelas",
  "assoc": "3",
  "mode": "-w-",
  "type": "REG",
  "source": "0:38",
  "mntid": 42,
  "inode": 2501864,
  "name": "/home/chazelas/tmp/��"
}

(�� is how my UTF-8 terminal emulator renders those bytes here; u8 and l1 abbreviations of UTF-8 and Latin1 aka ISO-8859-1 respectively, they may not be supported by all iconv implementations).

You could define a binary ksh (or any shell supporting the pipefail soon-to-be-standard option, that is most shells except dash) helper script such as:

#! /bin/ksh -
set -o pipefail
iconv -f latin1 -t utf-8 |
  "$@" |
  iconv -f utf-8 -t latin1

An then use things like:

lsfd -J |
  binary jq -j '
    .lsfd[] |
    select(
      .assoc=="1" and
      .type=="REG" and
      (.name|match("[^\u0000-\u007f]"))
    ) | (.name + "\u0000")' |
  LC_ALL=C sort -zu |
  xargs -r0 ls -ldU --

To list the regular files opened on the stdout of any process and whose path contains a byte with the 8^th bit set (greater than 0x7f / 127).

In the same vein, the JSON perl module (and its underlying JSON::XS and JSON::PP implementations), with its object-oriented interface, doesn't do the text decoding/encoding by itself, it works on already decoded text. By default, as long as the PERL_UNICODE environment variable is not set, input/output is decoded/encoded in latin1.

Utilities such as json_xs/json_pp that expose those modules as command line tools explicitly decode/encode as UTF-8, but if you use those modules directly, you can skip that step and work in latin1:

$ exec 3> $'\x80\xff'
$ lsfd -Jp "$$" | perl -MJSON -l -0777 -ne '
   $_ = JSON->new->decode($_);
   print $_->{name} for grep {$_->{assoc} == 3} @{$_->{lsfd}}' |
   sed -n l
/home/chazelas/tmp/\200\377$

They even have an explicit latin1 flag similar to the ascii one to make sure the JSON they produce, when encoded in latin1 can represent characters outside the U+0000 .. U+00FF range by expressing them as \uxxxx. Without that flag, those characters would end up encoded in UTF-8 with a warning message.

Using latin1 also makes is relatively easy to deal with journalctl's representation of messages as [1, 2, 3] where we just need to convert those byte values to the character with the corresponding Unicode codepoint (and when encoded as latin1, you get the right byte back)

Some of the restrictions mentioned above apply also here, it's just the equivalent of the iconv commands is done internally in perl or more precisely we go straight from byte to character with the same value without going though a byte to utf-8 and utf-8 to character steps.

$ logger $'St\xe9phane'
$ journalctl --since today -o json | perl -MJSON -MData::Dumper -lne '
   BEGIN{$j = JSON->new}
   $j->incr_parse($_);
   while ($obj = $j->incr_parse) {
     $msg = $obj->{MESSAGE};
     # handle array of integer representation
     $msg = join "", map(chr, @$msg) if ref $msg eq "ARRAY";
     print $msg
   }' |
   sed -n '/phane/l'
St\351phane$

Through that lens, we can answer all the questions:

What's the name of that format? That's latin1-encoded JSON instead of UTF-8 encoded JSON, or whatever single-byte charset that is a superset of ASCII and has a Unicode mapping covering the whole byte range we decide to use (to interpret the input as and produce output as).

The advantage of those over UTF-8 is that every byte sequence is valid text in those encodings so they can be used to represent as text any Unix file name, command argument, C string as produced by those utilities.

The JSON RFC does not strictly forbid using encodings other than UTF-8 as long as it's within a closed ecosystem. It would only be invalid for interoperability. The previous version of the RFC was even more lax about that. If those tools that produce that format properly documented that they did, that could be considered as not a bug.
What tool can process that format? Any that can decode/encode JSON in arbitrary charsets and not only UTF-8. As seen above, current versions of jello do. The JSON/JSON::XS/JSON::PP perl modules explicitly support latin1.
How to preprocess that format so it can be processed by regular JSON utilities? Pre-process input by recoding from Latin1 (or other single byte charset) to UTF-8 and post-process output by recoding back.

The more I research this the more I think these contexts don't qualify as a "closed ecosystem" as per RFC 8259 8.1. Ultimately the problem is that multiple tools are generating invalid json. That's not to say it would be easy for them given the arbitrary binary data they need to encode. It does feel like there ought to be bug reports to the tools generating this json. This solution only really works when you are not concerned with the meaning of the data (just passing it through) or the chosen character set supports it. Latin 1 may be an exception. — Philip Couling, Oct 02 '23 at 17:20
@PhilipCouling, yes I agree those tools have at least a documentation bug. They should at least point out that they don't produce UTF-8 encoded JSON and that they should be processed by tools/code (IOW, in an ecosystem) that expect the data to be encoded in a single-byte charset. The main problem was the choice of JSON in the first place. — Stéphane Chazelas, Oct 03 '23 at 05:11
Maybe/maybe not. Most things are headed for Unicode these days including file paths by way of terminal encoding, and most interchange generally is json or its derivative yaml. XML is on the way out (thank god). The mistake was to forget the unhappy path and therefore failing to build in some escape sequence since the data they encode can have malicious or or otherwise undesirable input. — Philip Couling, Oct 03 '23 at 07:20
FWIW, rawhide (rh) does explicitly document the fact that JSON is only supported in locales that use utf8 encoding. But I'll add a statement that non-utf8 "JSON" can be piped through iconv -t utf-8 but that even that won't help unless every user on the system uses the same charset as the current user. If that's not the case, surely the real solution is for them to start using utf8 :-). It wouldn't be possible to recode possibly multiple unknown charsets into utf8. String-or-byte-array fields sound like the only solution but it damages human readability on single-charset non-utf8 systems. — raf, Nov 18 '23 at 01:42
@raf, I saw that statement in rh.1 but I found it rather confusing. Here it's not really about the user's locale (the current value of the LC_* variables), but about the fact that file paths are arbitrary arrays of non-0 bytes, even if generally meant to be text which as you say could be encoded in any number of encoding (even different encoding for different directory components of a path). iconv -t utf-8 would only help if all strings happened to be encoded in the user's locale charset. — Stéphane Chazelas, Nov 18 '23 at 07:50
@raf, maybe something like "File and user names are represented as JSON strings, but in their original encoding (which might not be UTF-8) as it's impossible to determine which it is automatically, you may need to convert into UTF-8 manually if the output is to be consumed by a JSON processor that can only cope with UTF-8 encoded JSON". — Stéphane Chazelas, Nov 18 '23 at 07:55
The reference to the user's locale is about the extreme likelihood that the user's own files at least will be encoded in the charset of the locale that they use. Thanks for the suggested text. I've already updated the caveat, but this has some more detail I will use. — raf, Nov 19 '23 at 22:49

Philip Couling · Answer 2 · 2023-10-03T11:29:59.627

The more I research this topic, the more convinced I am that the behaviour of lsfd etc. is incorrect; RFC 8259 8.1 says:

JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8

The fact that you are hitting the as a problem demonstrates these outputs are not encapsulated in a closed ecosystem and therefore the JSON text is in violation of RFC 8259.

To my mind it would be good practice here to ensure that bug reports have been opened against respective projects to alert them to the problem. It's then for the project maintainers to decide if and how to deal with the problem.

I think from the project maintainers point of view this should be solvable: lsfd could honor the LC_CTYPE / LANG environment variables, assume the input is from that locale and translate it to UTF-8.

does that JSON format with strings not properly UTF-8 encoded (with some byte values >= 0x80 that don't form part of valid UTF-8-encoded character) have a name?

Answer: "Broken"

I'm joking but only slightly. In fact what's happening here is that the JSON is being written in UTF-8, but no check is performed to ensure all input is also UTF-8. So technically what you are seeing is a mix of character sets not a json file encoded in a non-standard character set.

Are there any tools or programming language modules (preferably perl, but I'm open to others) that can process that format reliably?

A few might get pleasing results in specific circumstances, such as specifically processing based on the (false) assumption that the input is entirely LATIN-1. This works because all special characters in JSON are single UTF-8 Byte code values identical to ASKII character codes lower than 128. Many single byte character sets have the same meaning first 127 byte codes.

But let's be clear here, what we are discussing is processing output that was intended to be UTF-8 but isn't. So working solutions here work mostly by luck and not design! This is akin to "undefined behaviour".

For some character sets there might be work arounds. For those work arounds to succeed, the character set either needs every byte code to be mapped into unicode or you need to be confident that non unmapped byte codes will ever be used in practice. The character set also needs to share single byte character codes with UTF-8, specifically characters for []{}:""''\.

LATIN-1 is the only one I know that will work, and this only works specifically because Unicode has a special block named Latin-1 Supplement. This allows LATIN-1 to be translated to Unicode by simply copying the byte value as a unicode code point.

However the similar cp1252 has gaps which cannot be mapped into unicode, and the solutions break down very quickly.

The way I'd propose dealing with such broken behaviour would be to work with Python3 that specifically understands the difference between a sequence of bytes and a string intended to represent text.

You can read raw bytes in Python3 and then decode into a string assuming the encoding of your choice:

import sys
import json
data = sys.stdin.buffer.read()
string_data = data.decode("LATIN1")
decoded_structure = json.loads(string_data)

You can then manipulate the json mostly with [] operators. Eg: for the json with latin-1 Ç:

{
   "lsfd": [
      {
         "name": "/home/chazelas/tmp/Ç"
      }
   ]
}

You can print the name with:

import sys
import json
data = sys.stdin.buffer.read()
string_data = data.decode("LATIN1")
decoded_structure = json.loads(string_data)
print(decoded_structure["lsfd"]["name"].encode("LATIN1"))

This approach also lets you deal with the data as bytes before thinking about it as a string. That's useful when things get really messy, for example the input is supposed to be encoded in cp1252 but contains invalid bytes for cp1252.

import sys
import json
data = sys.stdin.buffer.read()
data = data.replace(b'\x90', b'\x90')
data = data.replace(b'\x9D', b'\x9D')
string_data = data.decode("cp1252")
decoded_structure = json.loads(string_data)
print(decoded_structure["lsfd"]["name"].encode("cp1252"))

"I think from the project maintainers point of view this should be solvable: lsfd could honor the LC_CTYPE / LANG environment variables, assume the input is from that locale and translate it to UTF-8": that would only work for things that deal with text. Filenames are not text. They can be interpreted as text by users and can be text encoded in different charset by different users in different charsets. With lsfd -Jp "$pid" | jq -j '.lsfd[]|.name'' for instance, I want to get a list of raw file paths whether they're meant to represent text in one or several different charsets or not. — Stéphane Chazelas, Oct 03 '23 at 11:46
"lsfd could honor the LC_CTYPE / LANG environment variables, assume the input is from that locale and translate it to UTF-8". With the way lsfd behave now, you can get that outcome with lsfd | iconv -t utf-8 — Stéphane Chazelas, Oct 03 '23 at 11:50
@StéphaneChazelas no, that's not really true at all. You've not considered the issue of a character set having character codes overlapping ASKII values for JSON special characters. I'm talking about pre-processing the input, iconv post-processes the output. That can have a different result. — Philip Couling, Oct 03 '23 at 11:53
@StéphaneChazelas And I disagree with your characterisation of filenames as "not text". They are generated by humans at keyboards and by-and large there for humans to read. Otherwise we'd just have numerical abstract identifiers on them all. The fact that nobody is preventing you writing invalid byte sequences into the file name isn't proof that they are not text, just the lack of a safeguard. The whole point of what I'm saying here is that in light of that lack of a safeguard in the Kernel, lsof should have it's own. — Philip Couling, Oct 03 '23 at 11:57
In the case of lsfd (not lsof), that's the same, the text that lsfd outputs in its JSON (object key names) is ASCII (not ASKII) only so invariant across locales on a system (if we ignore the bogus ms-kanji still found on some BSDs). The only bytes >= 0x80 it outputs comes from input (process names, file names...) which don't have to be text, so that's the same ones iconv -t utf-8 will recode and that lsfd would recode if it was doing the recoding internally (typically with iconv()). — Stéphane Chazelas, Oct 03 '23 at 12:44
File names are generally intended to be text, but are not guaranteed to be, let alone text encoded in a given charset but every thing and user on a system. find . -print0 gives a faithful representation of a list of files, a find . -print-json could not it was printing the list of files as UTF-8-encoded JSON strings. It could if it was printing them as arrays of integer byte values, printing them as [{"encoding":"latin1", "path":"user/st\u00e9phane"},{"encoding":"utf-8","path":"user/st\u00e9phane"}] would fall apart for paths with directory components encoded in different charsets. — Stéphane Chazelas, Oct 03 '23 at 13:01
Note that treating input/output as latin1-encoded is something perl users are used to, as that's what perl does by default. It can get in the way when processing UTF-8 as text (/\h/ matching on byte 0xa0 comes to mind), but in many other cases it avoids all the problems caused by decoding errors. — Stéphane Chazelas, Oct 03 '23 at 13:05
byte-wise truncation is a typical case where you can end-up with improper UTF-8. Process names on Linux are truncated to 15 bytes which makes it easy for process names to have invalid UTF-8 even when that results from them executing files with names properly encoded in UTF-8. — Stéphane Chazelas, Oct 03 '23 at 13:16
Assuming file names are text when the system gives you no such guarantee is a common way to introduce vulnerabilities (see the Bytes vs characters section at Why is looping over find's output bad practice? for instance). Here, I'm questioning the choice of using JSON which kind of forces that assumption. — Stéphane Chazelas, Oct 03 '23 at 13:21
@StéphaneChazelas no assuming file names are text isn't the problem. Assuming they are valid text is the problem. It's a subtly, yet desperately important difference. If they are not text, then the "right" approach for tools like lsof would be to base64 or hex encode filenames and process names as abstract sequences of bytes. Hopefully you can see why that idea is unlikely to be popular. The mistake is forgetting to guard against invalid text which is a problem of any text processing where you don't control the input. — Philip Couling, Oct 03 '23 at 15:16

Stéphane Chazelas · Answer 3 · 2023-10-08T10:50:07.277

In python3 (at least version 3.11.5 where I'm testing this) and its json module, the behaviour is similar to that of perl and its JSON modules. The input/output is decoded/encoded outside of the python module, and in this case as per the locale's charset, though the character encoding can be overridden with the PYTHONIOENCODING environment variable.

The C and C.UTF-8 locales (contrary to other locales using UTF-8 as the charset) seem to be a special case, where input/output is decoded/encode in UTF-8 (even though the charset of the C locale is invariably ASCII), but bytes that don't form part of valid UTF-8 are decoded with codepoints in the range 0xDC80 to 0xDCFF (those code points landing in those used for the second half of the UTF-16 surrogate pairs, so not valid character code points which makes them safe to use here).

The same can be achieved without changing the locale by setting

PYTHONIOENCODING=utf-8:surrogateescape

Then we can process JSON meant to be overall encoded in UTF-8 but that may contain strings that are not UTF-8.

$ printf '"\xe9"' | PYTHONIOENCODING=utf-8:surrogateescape  python3 -c '
import json, sys; _ = json.load(sys.stdin); print(hex(ord(_)))'
0xdce9

0xe9 byte decoded as character 0xdce9.

$ printf '"\xe9"' | PYTHONIOENCODING=utf-8:surrogateescape  python3 -c '
import json, sys; _ = json.load(sys.stdin); print(_)' | od -An -vtx1
 e9 0a

0xdce9 is encoded back to the 0xe9 byte on output.

Example processing the output of lsfd:

$ exec 3> $'\x80\xff'
$ lsfd -Jp "$$" | PYTHONIOENCODING=utf-8:surrogateescape python3 -c '
import json, sys
_ = json.load(sys.stdin)
for e in _["lsfd"]:
  if e["assoc"] == "3":
    print(e["name"])' | sed -n l
/home/chazelas/tmp/\200\377$

Note: if generating some JSON on output, you'll want to pass ensure_ascii=False or otherwise for bytes that can't be decoded into UTF-8, you'd get:

$ printf '"\xe9"' | PYTHONIOENCODING=utf-8:surrogateescape python3 -c '
import json, sys; _ = json.load(sys.stdin); print(json.dumps(_))'
"\udce9"

Which most things outside of python would reject.

$ printf '"\xe9"' | PYTHONIOENCODING=utf-8:surrogateescape python3 -c '
import json, sys
_ = json.load(sys.stdin)
print(json.dumps(_, ensure_ascii=False))' | sed -n l
"\351"$

Also, as noted in the question, if you have two JSON strings that are the result of an UTF-8 encoded string split in the middle of a character, concatenating them in JSON will not merge those byte sequences into a character until they're encoded back to UTF-8:

$ printf '{"a":"St\xc3","b":"\xa9phane"}' | PYTHONIOENCODING=utf-8:surrogateescape python3 -c '
import json, sys
_ = json.load(sys.stdin)
myname = _["a"] + _["b"]; print(len(myname), myname)'
9 Stéphane

My name has been reconstituted OK on output, but note how the length is incorrect as myname contains \udcc3 and \udca9 escape characters rather than a reconstituted \u00e9 character.

You can force that merging by going through encode and decode steps using the IO encoding:

$ printf '{"a":"St\xc3","b":"\xa9phane"}' |
   PYTHONIOENCODING=utf-8:surrogateescape python3 -c '
import json,sys
_ = json.load(sys.stdin)
myname = (_["a"] + _["b"]).encode(sys.stdout.encoding,sys.stdout.errors).decode(sys.stdout.encoding,sys.stdout.errors)
print(len(myname), myname)'
8 Stéphane

In any case, it's also possible to encode/decode in latin1 like in perl so that the character values match the byte values in the strings by calling it in a locale that uses that charset or using PYTHONIOENCODING=latin1.

vd (visidata), though written in python3, when input is coming from stdin doesn't seem to honour $PYTHONIOENCODING, and in the C or C.UTF-8 locales doesn't seem to be doing that surrogate escaping (see this issue), but calling it with --encoding=latin1 with version 2.5 or newer (where that issue was fixed) or in a locale that uses the latin1 charset seems to work so you can do:

lsfd -J | binary jq .lsfd | LC_CTYPE=C.iso88591 vd -f json

For a visual lsfd that doesn't crash if there are command or file names in the output of lsfd -J that are not UTF-8 encoded text.

When passing the JSON file as the path of a file as argument, then it seems to decode the input as per the --encoding and --encoding-errors options which by default are utf-8 and surrogateescape respectively, and honour the locale's charset for output.

So, in a shell with process substitution support such as ksh, zsh, bash (or rc, es, akanga with a different syntax), you can just do:

vd -f json <(lsfd -J | binary jq .lsfd)

However, I find it sometimes fails randomly for non-regular files such as those pipes (see that other issue). Using a format with one json perl line (jsonl) works better:

vd -f jsonl <(lsfd -J | binary jq -c '.lsfd[]')

Or use the =(...) form of process substitution in zsh (or (...|psub -f) in fish, same as (...|psub) in current versions) that uses a temp file instead of a pipe:

vd -f json =(lsfd -J | binary jq .lsfd)