How can I check if a UTF-8 text file has a BOM from command line?
file
command shows me:
UTF-8 Unicode text
But, I don't know if it means there is no BOM in the file.
I'm using Ubuntu 12.04.
How can I check if a UTF-8 text file has a BOM from command line?
file
command shows me:
UTF-8 Unicode text
But, I don't know if it means there is no BOM in the file.
I'm using Ubuntu 12.04.
file
will tell you if there is a BOM. You can simply test it with:
printf '\ufeff...\n' | file -
/dev/stdin: UTF-8 Unicode (with BOM) text
Some shells such as ash
or dash
have a printf
builtin that does not support \u
, in which case you need to use printf
from the GNU coreutils, e.g. /usr/bin/printf
.
Note: according to the file
changelog, this feature existed already in 2007. So, this should work on any current machine.
file
version is file-5.09
and the result was /dev/stdin: ASCII text
. Is it depends on version of file
?
– ironsand
Dec 01 '14 at 03:55
printf
and completely missed that, sorry. Cheers.
– Vlastimil Burián
Feb 20 '22 at 11:57
\u
escape sequence (at least, not yet). It specifies \ddd
with a 3-digit octal number, so that a portable version could be: printf '\357\273\277...\n' | file -
(but it is rather difficult to remember).
– vinc17
Feb 20 '22 at 19:55
If you execute stat fileName
it should give you exact the three characters. When I opened the file in the editor, I was unable to see anything. So noticing that the file size was 3 gave me clarity that it has a BOM.
Also, the post here was helful in my case.
hexdump -n 3 -C 2.txt
00000000 ef bb bf
ef bb bf // YES
Another variant -- dos2unix
:
$ dos2unix -ib *.txt
no-bom f1.txt # this file has no BOM
utf-8 f2.txt # this file has BOM + UTF-8
This command has options to change the file format, such as add or remove BOMs
Using file
, as indicated in vinc17's answer, did not work on my machine1. Based on a previous answer by akshita007 and also identified as a solution to a similar question, I recommend checking the first three bytes of your file
head -c3 [file] | hexdump -C
If you have a file with a BOM, the output should look something like this
head -c3 file-with-bom | hexdump -C
00000000 ef bb bf |...|
00000003
Without a bom, you won't see the EF BB BF
bytes but the actual content of the file:
head -c3 file-without-bom | hexdump -C
00000000 22 49 64 |"Id|
00000003
Note: My test file has CSV content and starts with a quoted header, so it shows "Id
as the first three bytes without a BOM.
1 Using file-5.41 in Ubuntu 22.04.3, I simply get CSV text for a file regardless of the BOM.
A UTF-8 file with a BOM starts with the 3 bytes 0xef 0xbb 0xbf, the UTF-8 encoding of the U+FEFF character.
You can find those files efficiently in bash by reading the first 3 bytes of the files:
find . -type f -size +2c -print0 |
while IFS= read -rd '' file; do
IFS= LC_ALL=C read -rd '' -n3 first3 < "$file" &&
[[ $first3 = $'\xef\xbb\xbf' ]] &&
printf '%s\n' "$file"
done
That doesn't check whether the rest of the file is valid UTF-8, but then neither does file
, it's just a heuristic.