I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file?
$ file test.xml
test.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines
I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file?
$ file test.xml
test.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines
If you're not sure if the file contains a UTF-8 BOM, then this (assuming the GNU implementation of sed
) will remove the BOM if it exists, or make no changes if it doesn't.
sed '1s/^\xEF\xBB\xBF//' < orig.txt > new.txt
You can also overwrite the existing file with the -i
option:
sed -i '1s/^\xEF\xBB\xBF//' orig.txt
If you are using the BSD version of sed
(eg macOS) then you need to have bash do the escaping:
sed $'1s/\xef\xbb\xbf//' < orig.txt > new.txt
en_US.UTF-8
locale and it worked. When will it fail?
– m13r
Jul 24 '17 at 06:55
-<U+FEFF>\chapter{xxx}
After: +\chapter{xxx}^M
Explanation: Using MS-word for typos in latex-file. Latex under Linux is showing errors mentioned. Output is from a git system. How could I alter the expression to catch this special case too?
– Cutton Eye
Feb 20 '18 at 15:55
sed -i '1s/^\xEF\xBB\xBF//' orig.txt
fails for me but sed '1s/^\xEF\xBB\xBF//' < orig.txt > orig.txt
works.. I am in en_US.UTF-8
locale.. Any idea why?
– alpha_989
Jul 15 '18 at 01:52
>
it doesn't necessary write exactly the bits in the original file. This was the reason why the 2 methods were different for me. There were some invalid utf-8
characters which were being removed when I didn't use the in-place
change option.
– alpha_989
Jul 15 '18 at 21:21
1s/
means only search the first line; other lines are unaffected. The ^
means only match at the start of the (first) line. \xEF\xBB\xBF
is the UTF-8 BOM (escaped hex string). //
means replace with nothing. I could have added 1
to the end (for 1s/^xEF\xBB\xBF//1
), which would mean only match the first occurrence of the pattern on the line. But as the the search is anchored with ^
, this won't make any difference. If the file doesn't have the BOM at the start of the first line, the pattern won't match, and thus no change is made.
– CSM
Oct 27 '19 at 18:47
sed -i 's/\xEF\xBB\xBF//g' orig.txt
– wnasich
Dec 12 '20 at 22:24
A BOM doesn't make sense in UTF-8. Those are generally added by mistake by bogus software on Microsoft OSes.
dos2unix
will remove it and also take care of other idiosyncrasies of Windows text files.
dos2unix test.xml
dos2unix
?
– m13r
Jul 25 '17 at 07:55
dos2unix
does not have the option -r
. I am using version 6.0.4 (2013-12-30).
– m13r
Jul 26 '17 at 05:40
"A BOM doesn't make sense in UTF-8"
, at least in Persian language it make sense, since without BOM they don't appear as Persian as I'm always adding BOM bytes to the beginning of a Persian context file in *nix env to be able to correctly shown its Persian content in Windows env like excel or notepad, etc.
– αғsнιη
Nov 10 '21 at 14:41
Open file in VIM:
vi text.xml
Remove BOM encoding:
:set nobomb
Save the file and quit:
:x
For a non-interactive solution, try the following command line:
vi -c ":set nobomb" -c ":wq" text.xml
That should remove the BOM, save the file and quit, all from the command line.
<feff>
, yet :set nobomb
doesn't modify or remove it.
– dlamblin
Oct 09 '19 at 21:11
vim -c ":bufdo set nobomb|update" -c "q" *
– Dennis Williamson
Sep 07 '21 at 13:41
It is possible to remove the BOM from a file with the tail
command:
tail -c +4 withBOM.txt > withoutBOM.txt
Be aware that this chops the first 3 bytes (-c +N
makes the output start at byte nr. N
, so it cuts the first (N-1) bytes) from the file, so be sure that the file really contains the BOM before running tail
.
tail -c -1
or tail -c 1
(what tail
is generally used for) is the content starting with the last byte, tail -c +1
starting with the first byte. tail -c 0
/tail -c +0
for that would be a lot more unintuitive.
– Stéphane Chazelas
Jul 23 '17 at 23:05
(dd bs=1 count=3 of=/dev/null; cat) <input >output
. Or with GNU (head -c3 >/dev/null; cat)
-- even in UTF8 or other non-singlebyte locale; GNU head does 'char'=byte.
– dave_thompson_085
Jul 24 '17 at 06:16
tail
command for this to work... bash
has nothing to do with this and only calls the tail
binary... What error do you get? Maybe have a look at this: https://stackoverflow.com/questions/187587/a-windows-equivalent-of-the-unix-tail-command
I have tested it with "Git Bash for Windows" and it worked for me..
– m13r
Jan 03 '18 at 10:18
tail
command works only when the file has a 4-bytes BOM, otherwise it's destructive as it blindly chops the first 4 characters of the input file regardless of what it contains.
– Cyril Chaboisseau
Oct 13 '21 at 07:01
I use a vim one-liner on the regular for this:
vim --clean -c 'se nobomb|wq' filename
vim --clean -c 'bufdo se nobomb|wqa' filename1 filename2 ...
You can use
LANG=C LC_ALL=C sed -e 's/\r$// ; 1 s/^\xef\xbb\xbf//' -i -- filename
to remove the byte order mark from the beginning of the file, if it has any, as well as convert any CR LF newlines to LF only. The LANG=C LC_ALL=C
tells the shell you want the command to run in the default C locale (also known as the default POSIX locale), where the three bytes forming the Byte Order Mark are treated as bytes. The -i
option to sed means in-place. If you use -i.old
, then sed saves the original file as filename.old
, and the new file (with the modifications, if any) as filename
.
I personally like to have this as ~/bin/fix-ms
; for example, as
#!/bin/dash
export LANG=C LC_ALL=C
if [ $# -gt 0 ]; then
for FILE in "$@" ; do
sed -e 's/\r$// ; 1 s/^\xef\xbb\xbf//' -i -- "$FILE" || exit 1
done
else
exec sed -e 's/\r$// ; 1 s/^\xef\xbb\xbf//'
fi
so that if I need to apply this to say all C source files and headers (my old code from the MS-DOS era, for example!), I just run
find . -name '*.[CHch]' -print0 | xargs -r0 ~/bin/ms-fix
or, if I just want to look at such a file, without modifying it, I can run
~/bin/ms-fix < filename | less
and not see the ugly <U+FEFF>
in my UTF-8 terminal.
sed -e 's/\r$// ; 1 s/^\xef\xbb\xbf//' -i -- "$@"
?
– Stéphane Chazelas
Jul 24 '17 at 14:02
sed -e 's/\r$// ; 1 s/^\xef\xbb\xbf//' -i -- "$@"
does not do; it does return an exit code, but it processes all files listed in the argument list before exiting.
– Nominal Animal
Jul 24 '17 at 14:24
--
before the file name(s) is, of course, important: without it, file names beginning with a dash may be considered options by sed. I edited those into my answer; thank you for the reminder!
– Nominal Animal
Jul 24 '17 at 14:27
I have a slightly different problem, and am putting this here for someone who, like me, ends up here with data full of ZERO WIDTH NO-BREAK SPACE
characters (which are known as Byte Order Mark
when they are the first character of the file).
I got this data by copying out of grafana query metrics field, and it had multiple (17) \xef\xbb\xbf
sequences (which show up in vim as rate<feff>(<feff>node<feff>{<feff>job<feff>
) in a single line with only 81 actual characters.
I modified Nominal Animal's code just slightly:
LANG=C LC_ALL=C sed -e 's/\xef\xbb\xbf//g'
And the :set nobomb
thing in vim
only removes the very first one in the file.
tried this:
LANG=C vim b
Then vim
doesn't show them, but they are still there (even after a write...)
I had the same question and ended up writing a dedicated utility bom(1)
for this. It's available here.
Here's the man page:
NAME
bom -- Decode Unicode byte order mark
SYNOPSIS
bom --strip [--expect types] [--lenient] [--prefer32] [--utf8] [file]
bom --detect [--expect types] [--prefer32] [file]
bom --print type
bom --list
bom --help
bom --version
DESCRIPTION
bom decodes, verifies, reports, and/or strips the byte order mark (BOM) at the
start of the specified file, if any.
When no file is specified, or when file is -, read standard input.
OPTIONS
-d, --detect
Report the detected BOM type to standard output and then exit.
See SUPPORTED BOM TYPES for possible values.
-e, --expect types
Expect to find one of the specified BOM types, otherwise exit with an
error.
Multiple types may be specified, separated by commas.
Specifying NONE is acceptable and matches when the file has no (sup-
ported) BOM.
-h, --help
Output command line usage help.
-l, --lenient
Silently ignore any illegal byte sequences encountered when converting
the remainder of the file to UTF-8.
Without this flag, bom will exit immediately with an error if an ille-
gal byte sequence is encountered.
This flag has no effect unless the --utf8 flag is given.
--list List the supported BOM types and exit.
-p, --print type
Output the byte sequence corresponding to the type byte order mark.
--prefer32
Used to disambiguate the byte sequence FF FE 00 00, which can be
either a UTF-32LE BOM or a UTF-16LE BOM followed by a NUL character.
Without this flag, UTF-16LE is assumed; with this flag, UTF-32LE is
assumed.
-s, --strip
Strip the BOM, if any, from the beginning of the file and output the
remainder of the file.
-u, --utf8
Convert the remainder of the file to UTF-8, assuming the character
encoding implied by the detected BOM.
For files with no (supported) BOM, this flag has no effect and the
remainder of the file is copied unmodified.
For files with a UTF-8 BOM, the identity transformation is still
applied, so (for example) illegal byte sequences will be detected.
-v, --version
Output program version and exit.
SUPPORTED BOM TYPES
The supported BOM types are:
NONE No supported BOM was detected.
UTF-7 A UTF-7 BOM was detected.
UTF-8 A UTF-8 BOM was detected.
UTF-16BE
A UTF-16 (Big Endian) BOM was detected.
UTF-16LE
A UTF-16 (Little Endian) BOM was detected.
UTF-32BE
A UTF-32 (Big Endian) BOM was detected.
UTF-32LE
A UTF-32 (Little Endian) BOM was detected.
GB18030
A GB18030 (Chinese National Standard) BOM was detected.
EXAMPLES
To tell what kind of byte order mark a file has:
$ bom --detect
To normalize files with byte order marks into UTF-8, and pass other files
through unchanged:
$ bom --strip --utf8
Same as previous example, but discard illegal byte sequences instead of gener-
ating an error:
$ bom --strip --utf8 --lenient
To verify a properly encoded UTF-8 or UTF-16 file with a byte-order-mark and
output it as UTF-8:
$ bom --strip --utf8 --expect UTF-8,UTF-16LE,UTF-16BE
To just remove any byte order mark and get on with your life:
$ bom --strip file
RETURN VALUES
bom exits with one of the following values:
0 Success.
1 A general error occurred.
2 The --expect flag was given but the detected BOM did not match.
3 An illegal byte sequence was detected (and --lenient was not speci-
fied).
SEE ALSO
iconv(1)
bom: Decode Unicode byte order mark, https://github.com/archiecobbs/bom.
Recently I found this tiny command-line tool which adds or removes the BOM on arbitary UTF-8 encoded files: UTF BOM Utils (new link at github)
Little drawback, you can download only the plain C++ source code. You have to create the makefile (with CMake, for example) and compile it by yourself, binaries are not provided on this page.
I know it's been a while, but since I had a slightly different issue, I'm posting so others may benefit.
My text file was randomly haunted by characters \fe\ff
, luckily for me they appeared at start of the lines and the set of allowed characters is limited to alphanumeric.
The below command in vim
cuts first non-alphanumeric character, but use it with caution as your set of allowed characters might vary.
:%s/^[^a-zA-Z0-9]//g
The answer posted by Smirk was a great hint about how to do this on an VERY OLD UNIX system that has ancient versions of vim, ex, iconv, piconv, etc. I did not want to restrict to treatment of only alpha-numeric as non-BOM characters, so these patterns assume two or three leading non-printable ASCII on the first line only are the BOM characters to remove. A non-interactive method was also desired.
An excommands file was created as follows:
" UTF-8 Byte-Order-Mark (BOM) characters
1,1g/^[^ -~][^ -~][^ -~][ -~]/s/^...//
" UTF-16LE, UTF-16 (Big Endian) BOM
" ex happens to strip unwanted NULs
1,1g/^[^ -~][^ -~][ -~]/s/^..//
To remove the BOM characters:
ex - file-w-BOM <excommands
To use interactively, just enter as a colon command in vim. For example:
:1,1g/^[^ -~][^ -~][^ -~][ -~]/s/^...//
NOTE: For some reason, the ex on my VERY OLD UNIX system just happened to remove the unwanted NUL bytes from UTF-16LE files in a way that didn't garble data that all cleanly corresponded with ASCII characters. This was fortunate since both iconv and piconv on the VERY OLD UNIX system were also unable to properly re-encode UTF-16LE as something else.
CAVEAT: The above is sure to BREAK files that contain multi-byte characters that do not map to plain ASCII, so the solution must only be used with this in mind.
EF BB BF
, one thing you could do is combinexxd
andxxd -r
to change those first three bytes to something within printable ascii range, like41 41 41
, so that "AAA" will appear in the BOM's place, which you can then simply delete and save with a regular text editor. Bit of a roundabout way but it works. – Braden Best Aug 11 '21 at 23:29xxd -p
andxxd -p -r
handily allow removal (or addition) of characters in the hex dump. On my system, however, I did then have to reformat to usexxd -p
standard line lengths for all but the last line in order to getxxd -p -r
to work properly, so this made the process much less handy. – kbulgrien Aug 23 '23 at 23:10