1

How do I remove these type of codes: \u003c/p\u003e\n\u003cp\u003e from my text file? I tried sed, but it does not work well because of the back slash.

Tran
  • 11
  • 1
  • 2
  • Hi Tran. Please show the contents of a sample file, the expected output file and the exact command you tried. Did you try escaping the backslash? – Quasímodo May 13 '20 at 19:18
  • Do you want to remove them or decode them? If remove, do you want to remove the HTML tag names as well? What about the encoded newlines? – JdeBP May 13 '20 at 19:50
  • How did you get to see those \u003c things? I would like to see them through od (octal dump). I believe they are just a character representation of UTF-16 codes, a figment of your editor: in which case, any attempt to match them with text in sed or other text tools is futile. – Paul_Pedant May 14 '20 at 09:21

2 Answers2

2

In most syntaxes with quoted strings, a backslash before a punctuation character stands for that punctuation character, instead of letting the punctuation character have its usual special effect. In particular, two backslashes stand for one backslash. Backslash followed by a letter or digit usually works the opposite way: it causes the character to have a special effect.

Put the sed code in single quotes '…' to protect it from shell expansion. If you need a single quote inside the sed code, use '\'' (quote-backslash-quote-quote: the first quote terminates the single-quoted segment, then there's a quote character which is interpreted literally because there's a backslash just before it, and the last quote starts a new single-quoted segment).

Sed is a good tool if there's a small number of backslash sequences to replace. In the sed s command, use double-backslash to stand for a backslash. Use successive s commands for each backslash sequence. Put the transformation that converts double-backslash to a backslash last so that the resulting backslash is not itself replaced. Here, in the last command, I use . to stand for any character in the regex, \(.\) makes it a numbered group (note that here, the backslashes cause the parentheses to become special: this is a quirk of the basic regular expression syntax that sed uses), and \1 stands for that group in the replacement text.

sed -e 's/\\u003c/</g; s/\\u003e/>/g; s/\\n/\n/g; s/\\\(.\)/\1/'

Alternatively, to convert backslash sequences with an arbitrary number after \u, you can use Perl. Perl has an s operator that's similar to sed's s command, but with a slightly different regular expression syntax and the replacement allows writing Perl code.

perl -pe 's/\\u([0-9a-f]{4})/chr($1)/eg; s/\\n/\n/g; s/\\(.)/$1/g'
0

Those caracters belong to the < and the > HTML (or similar) tags. You could remove them, but I suggest you convert them first, to keep the file structure and then try to remove them if they are not needed.

Depending on the size of your input you could do this:

$ echo -e ($cat encodedfile.txt) > decodedfile.txt

For bigger files, this should do it:

$ cat encodedfile.txt | while read -r a; do echo -e $a; done > decodedfile.txt