How can I remove the BOM from a UTF-8 file?

Question

I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file?

$ file test.xml
test.xml:  XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines

Similar: AWK with BOM: Is there any cool way to handle Unicode BOM with regexp? — Stéphane Chazelas, Jul 23 '17 at 10:40
I've made a farily simple tool to do just that a few months ago: https://oskog97.com/read/?path=/small-scripts/killbom&referer=/small-scripts/&title=Small+scripts Might be worth installing something like it in /usr/local/bin if you have many UTF-8 encoded files with BOMs. — Oskar Skog, Jul 23 '17 at 11:24
Weirdly, cross-posted at https://stackoverflow.com/questions/45240387/how-can-i-remove-the-bom-from-a-utf-8-file — tripleee, Jan 12 '21 at 07:27
In UTF8, U+FEFF is encoded as 3 bytes: EF BB BF, one thing you could do is combine xxd and xxd -r to change those first three bytes to something within printable ascii range, like 41 41 41, so that "AAA" will appear in the BOM's place, which you can then simply delete and save with a regular text editor. Bit of a roundabout way but it works. — Braden Best, Aug 11 '21 at 23:29
xxd -p and xxd -p -r handily allow removal (or addition) of characters in the hex dump. On my system, however, I did then have to reformat to use xxd -p standard line lengths for all but the last line in order to get xxd -p -r to work properly, so this made the process much less handy. — kbulgrien, Aug 23 '23 at 23:10

score 157 · Accepted Answer · edited May 28 '20 at 13:05

157

If you're not sure if the file contains a UTF-8 BOM, then this (assuming the GNU implementation of sed) will remove the BOM if it exists, or make no changes if it doesn't.

sed '1s/^\xEF\xBB\xBF//' < orig.txt > new.txt

You can also overwrite the existing file with the -i option:

sed -i '1s/^\xEF\xBB\xBF//' orig.txt

If you are using the BSD version of sed (eg macOS) then you need to have bash do the escaping:

 sed $'1s/\xef\xbb\xbf//' < orig.txt > new.txt

edited May 28 '20 at 13:05

Matthew Buckett

153

answered Jul 23 '17 at 14:08

CSM

2,170

4

this may not work in a utf8 locale, but prepending a locale override to c or posix will always work. – hildred Jul 23 '17 at 15:29
3

@hildred I've tested it with the en_US.UTF-8 locale and it worked. When will it fail? – m13r Jul 24 '17 at 06:55
2

@m13r, It depends on the version of sed and compile options. In the failure case a very new version of sed with Unicode character classes will bring the three byte sequence in as a single character which does not match the three character sequence. However in such case you can do a sixteen bit character match. However this is a new feature and not universally present. If you want to test I recommend compiling the latest version. – hildred Jul 24 '17 at 16:25
6

To fix it to work with a unicode-enabled sed do LC_ALL=C sed '1s/^\xEF\xBB\xBF//' – Joshua Jul 24 '17 at 17:41
@CSM nice, but for one special case it does not work: Bevore: -<U+FEFF>\chapter{xxx} After: +\chapter{xxx}^M Explanation: Using MS-word for typos in latex-file. Latex under Linux is showing errors mentioned. Output is from a git system. How could I alter the expression to catch this special case too? – Cutton Eye Feb 20 '18 at 15:55
@CSM, the second command sed -i '1s/^\xEF\xBB\xBF//' orig.txt fails for me but sed '1s/^\xEF\xBB\xBF//' < orig.txt > orig.txt works.. I am in en_US.UTF-8 locale.. Any idea why? – alpha_989 Jul 15 '18 at 01:52
Actually no worries.. it was for a different reason. Just thought I would leave this here.. that when you redirect a file using > it doesn't necessary write exactly the bits in the original file. This was the reason why the 2 methods were different for me. There were some invalid utf-8 characters which were being removed when I didn't use the in-place change option. – alpha_989 Jul 15 '18 at 21:21
Does the 1 in 1s/ only run it once? – mazunki Oct 22 '19 at 16:16
3

@mazunki, 1s/ means only search the first line; other lines are unaffected. The ^ means only match at the start of the (first) line. \xEF\xBB\xBF is the UTF-8 BOM (escaped hex string). // means replace with nothing. I could have added 1 to the end (for 1s/^xEF\xBB\xBF//1), which would mean only match the first occurrence of the pattern on the line. But as the the search is anchored with ^, this won't make any difference. If the file doesn't have the BOM at the start of the first line, the pattern won't match, and thus no change is made. – CSM Oct 27 '19 at 18:47
In my case I had multiple occurrences of BOM in same string .. so I used this command: sed -i 's/\xEF\xBB\xBF//g' orig.txt – wnasich Dec 12 '20 at 22:24
1

I joined the Unix community so that I could like this answer! This problem was driving me nuts. Thank you! – Dmitry Kamenetsky Apr 15 '21 at 01:55

score 129 · Answer 2 · answered Jul 23 '17 at 10:42

129

A BOM doesn't make sense in UTF-8. Those are generally added by mistake by bogus software on Microsoft OSes.

dos2unix will remove it and also take care of other idiosyncrasies of Windows text files.

dos2unix test.xml

answered Jul 23 '17 at 10:42

Stéphane Chazelas

544,893

25

I agree that a UTF-8 encoded BOM does not make sense, but believe it or not, there are lots of people who think it is a great idea that helps differentiate UTF-8 from other 8-bit encodings. So it is a matter of taste. Windows Notepad adds a BOM on purpose. – Johan Myréen Jul 23 '17 at 14:02
27

What does it matter if it makes sense or not, when the context is just a question on how to remove it? According to Wikipedia, Notepad requires the BOM to recognize a file as UTF-8, and Google Docs also adds it while exporting a file as text. I doubt they all do it by mistake. – ilkkachu Jul 23 '17 at 14:09
Comments are not for extended discussion; this conversation has been moved to chat. – terdon Jul 24 '17 at 14:07
3

Is there a way of not converting the line endings and just remove the BOM with dos2unix? – m13r Jul 25 '17 at 07:55
@m13r It should be easier with unix2dos -r <olddos.txt >nobomdos.txt – Jul 25 '17 at 08:11
@Arrow My version of dos2unix does not have the option -r. I am using version 6.0.4 (2013-12-30). – m13r Jul 26 '17 at 05:40
4

@m13r Then use the sed script in this answer. That will remove only the bom (if it exist), nothing else will be changed. – Jul 26 '17 at 05:51
2

@JohanMyréen there are people who think it is a great idea to draw red line with blue crayon, but it doesn't change the fact that a line drawn with a blue crayon is blue, even if you call it red. – Nov 09 '18 at 08:33
@9ilsdx9rvj0lo There are people (like me!) who think it's a great idea not to clutter text files with metadata like UTF-8 encoded BOMs, but that doesn't change the fact that files like this do exist and you will occasionally need to deal with them. – Johan Myréen Nov 09 '18 at 09:00
6

@JohanMyréen yes, but it is not correct calling them UTF-8. They are not UTF-8 files. They are UTF-8-with-BOM files, which is another file format. I suppose those Windows freaks won't be happy getting ODT files callled MSOffice files :) – Nov 09 '18 at 09:47
4

Re: "A BOM doesn't make sense in UTF-8", at least in Persian language it make sense, since without BOM they don't appear as Persian as I'm always adding BOM bytes to the beginning of a Persian context file in *nix env to be able to correctly shown its Persian content in Windows env like excel or notepad, etc. – αғsнιη Nov 10 '21 at 14:41
1

You can now use fromdos – Daniel Pinyol Mar 30 '22 at 13:17
1

HTTP and other Internet standards are clearly wrong, they use CRLF line endings, which is "a Microsoft idiosyncrasy" ... really? This is really getting old. There are many many MANY valid criticisms of Microsoft's business practices or Windows, but this ain't one ... – 0xC0000022L Feb 13 '23 at 08:25

score 96 · Answer 3 · edited Feb 28 '23 at 23:10

96

Using VIM

Open file in VIM:
```
 vi text.xml
```
Remove BOM encoding:
```
 :set nobomb
```
Save the file and quit:
```
 :x
```

For a non-interactive solution, try the following command line:

vi -c ":set nobomb" -c ":wq" text.xml

That should remove the BOM, save the file and quit, all from the command line.

edited Feb 28 '23 at 23:10

adamency

388

answered Dec 24 '17 at 18:05

Joshua Pinter

1,210

1

Oddly with vim 8 on a mac, I have a csv utf-8 file made by Excel and it starts with <feff>, yet :set nobomb doesn't modify or remove it. – dlamblin Oct 09 '19 at 21:11
1

This is much faster than tail on large files. – user239558 Dec 02 '19 at 20:14
1

For multiple files: vim -c ":bufdo set nobomb|update" -c "q" * – Dennis Williamson Sep 07 '21 at 13:41

score 35 · Answer 4 · edited Sep 18 '23 at 15:14

35

It is possible to remove the BOM from a file with the tail command:

tail -c +4 withBOM.txt > withoutBOM.txt

Be aware that this chops the first 3 bytes (-c +N makes the output start at byte nr. N, so it cuts the first (N-1) bytes) from the file, so be sure that the file really contains the BOM before running tail.

edited Sep 18 '23 at 15:14

AdminBee

22,803

answered Jul 23 '17 at 10:05

m13r

2,745

4

Why 4? The BOM has 3 byte. – deviantfan Jul 23 '17 at 17:12
13

@deviantfan Which is why you need to start at the 4th byte if you want to skip it. – Stéphane Chazelas Jul 23 '17 at 18:33
14

tail is using 1 based indexing?! WTF! – CodesInChaos Jul 23 '17 at 19:31
7

@CodesInChaos, tail -c -1 or tail -c 1 (what tail is generally used for) is the content starting with the last byte, tail -c +1 starting with the first byte. tail -c 0/tail -c +0 for that would be a lot more unintuitive. – Stéphane Chazelas Jul 23 '17 at 23:05
3

@deviantfan: (dd bs=1 count=3 of=/dev/null; cat) <input >output. Or with GNU (head -c3 >/dev/null; cat) -- even in UTF8 or other non-singlebyte locale; GNU head does 'char'=byte. – dave_thompson_085 Jul 24 '17 at 06:16
It doesnt work in windows using bash – david Jan 02 '18 at 08:47
@david You need the tail command for this to work... bash has nothing to do with this and only calls the tail binary... What error do you get? Maybe have a look at this: https://stackoverflow.com/questions/187587/a-windows-equivalent-of-the-unix-tail-command I have tested it with "Git Bash for Windows" and it worked for me.. – m13r Jan 03 '18 at 10:18
1

This tail command works only when the file has a 4-bytes BOM, otherwise it's destructive as it blindly chops the first 4 characters of the input file regardless of what it contains. – Cyril Chaboisseau Oct 13 '21 at 07:01

score 8 · Answer 5 · answered Jan 23 '20 at 19:40

8

I use a vim one-liner on the regular for this:

vim --clean -c 'se nobomb|wq' filename

vim --clean -c 'bufdo se nobomb|wqa' filename1 filename2 ...

answered Jan 23 '20 at 19:40

Robyn Murdock

81

This should also be achievable using VIM's ex personality. – JdeBP Oct 07 '20 at 09:46

Nominal Animal · Answer 6 · 2017-07-24T14:25:57.850

You can use

LANG=C LC_ALL=C sed -e 's/\r$// ; 1 s/^\xef\xbb\xbf//' -i -- filename

to remove the byte order mark from the beginning of the file, if it has any, as well as convert any CR LF newlines to LF only. The LANG=C LC_ALL=C tells the shell you want the command to run in the default C locale (also known as the default POSIX locale), where the three bytes forming the Byte Order Mark are treated as bytes. The -i option to sed means in-place. If you use -i.old, then sed saves the original file as filename.old, and the new file (with the modifications, if any) as filename.

I personally like to have this as ~/bin/fix-ms; for example, as

#!/bin/dash
export LANG=C LC_ALL=C
if [ $# -gt 0 ]; then
    for FILE in "$@" ; do
        sed -e 's/\r$// ; 1 s/^\xef\xbb\xbf//' -i -- "$FILE" || exit 1
    done
else
    exec sed -e 's/\r$// ; 1 s/^\xef\xbb\xbf//'
fi

so that if I need to apply this to say all C source files and headers (my old code from the MS-DOS era, for example!), I just run

find . -name '*.[CHch]' -print0 | xargs -r0 ~/bin/ms-fix

or, if I just want to look at such a file, without modifying it, I can run

~/bin/ms-fix < filename | less

and not see the ugly <U+FEFF> in my UTF-8 terminal.

Why not simply sed -e 's/\r$// ; 1 s/^\xef\xbb\xbf//' -i -- "$@"? — Stéphane Chazelas, Jul 24 '17 at 14:02
@StéphaneChazelas: Because I want the script to exit immediately if there is an issue with a replacement, which sed -e 's/\r$// ; 1 s/^\xef\xbb\xbf//' -i -- "$@" does not do; it does return an exit code, but it processes all files listed in the argument list before exiting. — Nominal Animal, Jul 24 '17 at 14:24
@StéphaneChazelas: The -- before the file name(s) is, of course, important: without it, file names beginning with a dash may be considered options by sed. I edited those into my answer; thank you for the reminder! — Nominal Animal, Jul 24 '17 at 14:27

score 2 · Answer 7 · answered Aug 04 '20 at 22:15

I have a slightly different problem, and am putting this here for someone who, like me, ends up here with data full of ZERO WIDTH NO-BREAK SPACE characters (which are known as Byte Order Mark when they are the first character of the file).

I got this data by copying out of grafana query metrics field, and it had multiple (17) \xef\xbb\xbf sequences (which show up in vim as rate<feff>(<feff>node<feff>{<feff>job<feff>) in a single line with only 81 actual characters.

I modified Nominal Animal's code just slightly:

LANG=C LC_ALL=C sed -e 's/\xef\xbb\xbf//g'

And the :set nobomb thing in vim only removes the very first one in the file.

tried this:

LANG=C vim b

Then vim doesn't show them, but they are still there (even after a write...)

score 2 · Answer 8 · answered Apr 06 '22 at 19:08

I had the same question and ended up writing a dedicated utility bom(1) for this. It's available here.

Here's the man page:

NAME
     bom -- Decode Unicode byte order mark
SYNOPSIS
     bom --strip [--expect types] [--lenient] [--prefer32] [--utf8] [file]
     bom --detect [--expect types] [--prefer32] [file]
     bom --print type
     bom --list
     bom --help
     bom --version
DESCRIPTION
     bom decodes, verifies, reports, and/or strips the byte order mark (BOM) at the
     start of the specified file, if any.
 When no file is specified, or when file is -, read standard input.


OPTIONS
     -d, --detect
             Report the detected BOM type to standard output and then exit.
         See SUPPORTED BOM TYPES for possible values.

 -e, --expect types
         Expect to find one of the specified BOM types, otherwise exit with an
         error.

         Multiple types may be specified, separated by commas.

         Specifying NONE is acceptable and matches when the file has no (sup-
         ported) BOM.

 -h, --help
         Output command line usage help.

 -l, --lenient
         Silently ignore any illegal byte sequences encountered when converting
         the remainder of the file to UTF-8.

         Without this flag, bom will exit immediately with an error if an ille-
         gal byte sequence is encountered.

         This flag has no effect unless the --utf8 flag is given.

 --list  List the supported BOM types and exit.

 -p, --print type
         Output the byte sequence corresponding to the type byte order mark.

 --prefer32
         Used to disambiguate the byte sequence FF FE 00 00, which can be
         either a UTF-32LE BOM or a UTF-16LE BOM followed by a NUL character.

         Without this flag, UTF-16LE is assumed; with this flag, UTF-32LE is
         assumed.

 -s, --strip
         Strip the BOM, if any, from the beginning of the file and output the
         remainder of the file.

 -u, --utf8
         Convert the remainder of the file to UTF-8, assuming the character
         encoding implied by the detected BOM.

         For files with no (supported) BOM, this flag has no effect and the
         remainder of the file is copied unmodified.

         For files with a UTF-8 BOM, the identity transformation is still
         applied, so (for example) illegal byte sequences will be detected.

 -v, --version
         Output program version and exit.


SUPPORTED BOM TYPES
     The supported BOM types are:
 NONE    No supported BOM was detected.

 UTF-7   A UTF-7 BOM was detected.

 UTF-8   A UTF-8 BOM was detected.

 UTF-16BE
         A UTF-16 (Big Endian) BOM was detected.

 UTF-16LE
         A UTF-16 (Little Endian) BOM was detected.

 UTF-32BE
         A UTF-32 (Big Endian) BOM was detected.

 UTF-32LE
         A UTF-32 (Little Endian) BOM was detected.

 GB18030
         A GB18030 (Chinese National Standard) BOM was detected.


EXAMPLES
     To tell what kind of byte order mark a file has:
       $ bom --detect

 To normalize files with byte order marks into UTF-8, and pass other files
 through unchanged:

       $ bom --strip --utf8

 Same as previous example, but discard illegal byte sequences instead of gener-
 ating an error:

       $ bom --strip --utf8 --lenient

 To verify a properly encoded UTF-8 or UTF-16 file with a byte-order-mark and
 output it as UTF-8:

       $ bom --strip --utf8 --expect UTF-8,UTF-16LE,UTF-16BE

 To just remove any byte order mark and get on with your life:

       $ bom --strip file


RETURN VALUES
     bom exits with one of the following values:
 0       Success.

 1       A general error occurred.

 2       The --expect flag was given but the detected BOM did not match.

 3       An illegal byte sequence was detected (and --lenient was not speci-
         fied).


SEE ALSO
     iconv(1)
 bom: Decode Unicode byte order mark, https://github.com/archiecobbs/bom.

score 0 · Answer 9 · answered Oct 16 '18 at 17:58

Recently I found this tiny command-line tool which adds or removes the BOM on arbitary UTF-8 encoded files: UTF BOM Utils (new link at github)

Little drawback, you can download only the plain C++ source code. You have to create the makefile (with CMake, for example) and compile it by yourself, binaries are not provided on this page.

score 0 · Answer 10 · edited Nov 10 '21 at 11:07

I know it's been a while, but since I had a slightly different issue, I'm posting so others may benefit.

My text file was randomly haunted by characters \fe\ff, luckily for me they appeared at start of the lines and the set of allowed characters is limited to alphanumeric.

The below command in vim cuts first non-alphanumeric character, but use it with caution as your set of allowed characters might vary.

:%s/^[^a-zA-Z0-9]//g

kbulgrien · Answer 11 · 2023-08-23T22:43:35.617

The answer posted by Smirk was a great hint about how to do this on an VERY OLD UNIX system that has ancient versions of vim, ex, iconv, piconv, etc. I did not want to restrict to treatment of only alpha-numeric as non-BOM characters, so these patterns assume two or three leading non-printable ASCII on the first line only are the BOM characters to remove. A non-interactive method was also desired.

An excommands file was created as follows:

" UTF-8 Byte-Order-Mark (BOM) characters
1,1g/^[^ -~][^ -~][^ -~][ -~]/s/^...//
" UTF-16LE, UTF-16 (Big Endian) BOM
" ex happens to strip unwanted NULs
1,1g/^[^ -~][^ -~][ -~]/s/^..//

To remove the BOM characters:

ex - file-w-BOM <excommands

To use interactively, just enter as a colon command in vim. For example:

:1,1g/^[^ -~][^ -~][^ -~][ -~]/s/^...//

NOTE: For some reason, the ex on my VERY OLD UNIX system just happened to remove the unwanted NUL bytes from UTF-16LE files in a way that didn't garble data that all cleanly corresponded with ASCII characters. This was fortunate since both iconv and piconv on the VERY OLD UNIX system were also unable to properly re-encode UTF-16LE as something else.

CAVEAT: The above is sure to BREAK files that contain multi-byte characters that do not map to plain ASCII, so the solution must only be used with this in mind.

How can I remove the BOM from a UTF-8 file?

11 Answers11

Using VIM

Linked

Related