How can I check if a UTF-8 text file has a BOM?

Question

How can I check if a UTF-8 text file has a BOM from command line?

file command shows me:

UTF-8 Unicode text

But, I don't know if it means there is no BOM in the file.

I'm using Ubuntu 12.04.

Note that there is no BOM in UTF-8: that's a feature of UTF-16. An UTF-8 file may start with the U+FEFF character, but in that case it's a zero-width space. — Gilles 'SO- stop being evil', Dec 02 '14 at 14:16

vinc17 · Accepted Answer · 2022-02-20T11:30:39.317

46

file will tell you if there is a BOM. You can simply test it with:

printf '\ufeff...\n' | file -
/dev/stdin: UTF-8 Unicode (with BOM) text

Some shells such as ash or dash have a printf builtin that does not support \u, in which case you need to use printf from the GNU coreutils, e.g. /usr/bin/printf.

Note: according to the file changelog, this feature existed already in 2007. So, this should work on any current machine.

edited Feb 20 '22 at 11:30

answered Dec 01 '14 at 03:49

vinc17

12,174

1

Thanks for answer. My file version is file-5.09 and the result was /dev/stdin: ASCII text. Is it depends on version of file? – ironsand Dec 01 '14 at 03:55
Thank you for that addition! I am using POSIX printf and completely missed that, sorry. Cheers. – Vlastimil Burián Feb 20 '22 at 11:57
2

@LinuxSecurityFreak POSIX does not specify the \u escape sequence (at least, not yet). It specifies \ddd with a 3-digit octal number, so that a portable version could be: printf '\357\273\277...\n' | file - (but it is rather difficult to remember). – vinc17 Feb 20 '22 at 19:55

score 9 · Answer 2 · edited Sep 18 '18 at 21:53

9

If you execute stat fileName it should give you exact the three characters. When I opened the file in the editor, I was unable to see anything. So noticing that the file size was 3 gave me clarity that it has a BOM.

Also, the post here was helful in my case.

hexdump -n 3 -C 2.txt
00000000 ef bb bf
ef bb bf // YES

edited Sep 18 '18 at 21:53

answered Sep 18 '18 at 21:33

akshita007

199

JJoao · Answer 3 · 2023-09-25T19:36:19.767

3

Another variant -- dos2unix:

$ dos2unix -ib   *.txt
  no-bom f1.txt                 # this file has no BOM
  utf-8  f2.txt                 # this file has BOM + UTF-8

This command has options to change the file format, such as add or remove BOMs

edited Sep 25 '23 at 19:36

answered Sep 25 '23 at 16:08

JJoao

12,170
1
23
45

score 1 · Answer 4 · answered Sep 25 '23 at 15:28

Using file, as indicated in vinc17's answer, did not work on my machine¹. Based on a previous answer by akshita007 and also identified as a solution to a similar question, I recommend checking the first three bytes of your file

head -c3 [file] | hexdump -C

If you have a file with a BOM, the output should look something like this

head -c3 file-with-bom | hexdump -C
00000000  ef bb bf                                          |...|
00000003

Without a bom, you won't see the EF BB BF bytes but the actual content of the file:

head -c3 file-without-bom | hexdump -C
00000000  22 49 64                                          |"Id|
00000003

Note: My test file has CSV content and starts with a quoted header, so it shows "Id as the first three bytes without a BOM.

¹ Using file-5.41 in Ubuntu 22.04.3, I simply get CSV text for a file regardless of the BOM.

score 0 · Answer 5 · answered Sep 25 '23 at 16:45

A UTF-8 file with a BOM starts with the 3 bytes 0xef 0xbb 0xbf, the UTF-8 encoding of the U+FEFF character.

You can find those files efficiently in bash by reading the first 3 bytes of the files:

find . -type f -size +2c -print0 |
  while IFS= read -rd '' file; do
    IFS= LC_ALL=C read -rd '' -n3 first3 < "$file" &&
      [[ $first3 = $'\xef\xbb\xbf' ]] &&
      printf '%s\n' "$file"
  done

That doesn't check whether the rest of the file is valid UTF-8, but then neither does file, it's just a heuristic.

How can I check if a UTF-8 text file has a BOM?

5 Answers5