JSON cannot directly represent arbitrary file paths (which are sequences of bytes other than zero) in its strings which are sequences of UTF-8 encoded characters. Also note that the output of find
is not post-processable unless you use -print0
.
For instance a file path might be $'/home/St\xc3\xa9phane\nChazelas/ISO-8859-1/R\xe9sum\xe9'
(here using ksh93
-style $'...'
notation to represent byte values), with a é
UTF-8 encoded in Stéphane
, and ISO-8859-1 encoded in Résumé
.
JSON cannot represent that file path unless you use some encoding. That could be for instance URI encoding:
{ "path": "/home/St%C3%A9phane\nChazelas/ISO-8859-1/R%E9sum%E9" }
Another approach could be to interpret the path as if it were ISO-8859-1 encoded (or any single-byte charset where any byte value can make a valid character¹):
{ "path": "/home/Stéphane\nChazelas/ISO-8859-1/Résumé" }
jq
has some support for doing URI encoding, but AFAIK cannot be fed non-UTF-8 input. AFAIK, it doesn't have any support for encoding conversion either.
On a GNU system, for the second approach where file paths are considered to be ISO-8859-1 encoded you may however be able to do something like:
find ~ -type d -print0 |
iconv -f iso-8859-1 -t utf-8 |
tr '\0\n' '\n\0' |
jq -Rc '{"path":sub("\u0000";"\n"),"type":"directory"}'
Which on our example above gives:
{"path":"/home/Stéphane\nChazelas/ISO-8859-1/Résumé.pdf","type":"directory"}
¹ though iso-8859-1 specifically is an obvious choice as its code points match those of Unicode. So if your json string contains a U+00E9 character for instance, you know it corresponds to the 0xE9 byte. You could add the -a
option to jq
for non-ASCII characters to be represented as \uXXXX
instead.
find
is not post-processable unless you use-print0
. – Stéphane Chazelas Apr 17 '22 at 19:35