how to remove ï»¿ from Shell script output

Question

I'm getting ï»¿ in my shell script output. can someone please assist as how to remove these ï»¿ special characters from the output

NOTE : I'm storing my output in txt file and sending the output through mail

stray UTF-8 byte order mark? set your text editor to UTF-8 without BOM encoding. Other suggestions for example here https://stackoverflow.com/questions/3255993/how-do-i-remove-%C3%AF-from-the-beginning-of-a-file — it's a common issue, you'll have to locate the source of these BOMs yourself — frostschutz, Jan 29 '23 at 09:00
Does the answer(s) to this other question answer your question? bash replacing special characters in a variable/ — Kusalananda, Jan 29 '23 at 09:53

Stéphane Chazelas · Answer 1 · 2023-01-29T14:55:24.297

ï»¿ in ISO8859-1 is encoded as 0xef 0xbb 0xbf.

$ printf %s 'ï»¿' | iconv -t ISO8859-1 | hexdump -C
00000000  ef bb bf                                          |...|
00000003

That sequence of three bytes also happens to be the UTF-8 encoding of the U+FEFF ZERO WIDTH NO-BREAK SPACE no break space character.

$ printf %s $'\ufeff' | iconv -t UTF-8 | hexdump -C
00000000  ef bb bf                                          |...|
00000003

That U+FEFF character is also used as the byte order mark (BOM) in UTF-16. It doesn't make sense in UTF-8, but you still see those in Microsoft text files for instance (see also: How can I remove the BOM from a UTF-8 file?).

It can also happen when some UTF-16 encoded text is decoded to UTF-8 as UCS-2LE/BE or UTF-16LE/BE instead of UTF-16:

$ printf X | iconv -t UTF-16 | iconv -f UTF-16LE -t UTF-8 | hexdump -C
00000000  ef bb bf 58                                       |...X|
00000004

Here, it's either that there is a UTF-8 encoded BOM, and your display device assumes text encoded in ISO8859-1 so renders it as ï»¿, or there's been double UTF-8 encoding. That is, text was already UTF-8 encoded, but something thought that was ISO8859-1 and re-encoded it in UTF-8.

In the first case, you should be able to remove the BOM by piping to dos2unix which should remove that BOM and fix other problems with Microsoft files. Or you could delete occurrences of that three byte sequence with sed $'s/\xef\xbb\xbf//g' (assuming a shell with support for those ksh-style $'...').

In the second case, while you could remove those ï»¿ with sed 's/ï»¿//g', it's possible that the corruption is not limited to that BOM character, and best would be to reverse the double-UTF-8 encoding by doing:

<your-file iconv -f UTF-8 -t ISO8859-1 | one-of-the-above

how to remove ï»¿ from Shell script output

1 Answers1