1

I have a text file containing the following unicode strings with regular text.

Cat a.txt

{"relationship":{"type:Memberkey","id""824-\u0001\u0019BFGHDICA2166-01-01","source"}

Here \u0001 and \u0019 are unicode strings and is causing our program to fail .

Is there a generic command to replace any such string?

Neo
  • 11
  • 2
  • Provide an example of replacement text, please. – Piotr P. Karwasz Nov 19 '19 at 21:14
  • current text - {"relationship":{"type:Memberkey","id""824-\u0001\u0019BFGHDICA2166-01-01","source"} Replace text -{"relationship":{"type:Memberkey","id""824-BFGHDICA2166-01-01","source"} Need a generic command to remove the \u000* unicode string – Neo Nov 19 '19 at 21:20
  • Your input appears to be JSON, but it is not well formed. If you can provide a well formed MWE then you may get a better answer—possibly including JSON pre-processing with a robust JSON tool that supports Unicode (as per-spec). Also, the Unicode output "\u0001" may depend on your cat, locale and terminal; or may be the literal file contents, hexdump -C on the input is one way to check that. – mr.spuratic Nov 20 '19 at 10:02

2 Answers2

0

Unfortunately, it depends what you mean by "generic command" and what you mean by "replace."

I take you to mean that you wish to reduce Unicode to the most-similar UTF-8, in which case you want to look at iconv.

You may find this guide helpful.

You might instead be wanting to replace such strings with some arbitrary text of your own, in which case you want to look at regex. You may find this guide helpful.

Edit: If you aren't quite sure what you want, then you likely should start by playing around with (a copy of) the file itself. Once thing you should know (if you don't already) is the printf command. You may find this guide helpful.

JCD
  • 81
0

If you just want to get rid of those control characters, you can use sed:

sed -i 's/\\u001[[:xdigit:]]//;s/\\u000[0-9bBcCeEFF]//' your_file

I assume you want to keep the CR and LF characters, even if they are encoded as \u000a and \u000d.