35

How can I check if a UTF-8 text file has a BOM from command line?

file command shows me:

UTF-8 Unicode text

But, I don't know if it means there is no BOM in the file.

I'm using Ubuntu 12.04.

ironsand
  • 5,205

5 Answers5

46

file will tell you if there is a BOM. You can simply test it with:

printf '\ufeff...\n' | file -
/dev/stdin: UTF-8 Unicode (with BOM) text

Some shells such as ash or dash have a printf builtin that does not support \u, in which case you need to use printf from the GNU coreutils, e.g. /usr/bin/printf.


Note: according to the file changelog, this feature existed already in 2007. So, this should work on any current machine.

vinc17
  • 12,174
  • 1
    Thanks for answer. My file version is file-5.09 and the result was /dev/stdin: ASCII text. Is it depends on version of file? – ironsand Dec 01 '14 at 03:55
  • Thank you for that addition! I am using POSIX printf and completely missed that, sorry. Cheers. – Vlastimil Burián Feb 20 '22 at 11:57
  • 2
    @LinuxSecurityFreak POSIX does not specify the \u escape sequence (at least, not yet). It specifies \ddd with a 3-digit octal number, so that a portable version could be: printf '\357\273\277...\n' | file - (but it is rather difficult to remember). – vinc17 Feb 20 '22 at 19:55
9

If you execute stat fileName it should give you exact the three characters. When I opened the file in the editor, I was unable to see anything. So noticing that the file size was 3 gave me clarity that it has a BOM.

Also, the post here was helful in my case.

hexdump -n 3 -C 2.txt
00000000 ef bb bf
ef bb bf // YES
3

Another variant -- dos2unix:

$ dos2unix -ib   *.txt
  no-bom f1.txt                 # this file has no BOM
  utf-8  f2.txt                 # this file has BOM + UTF-8

This command has options to change the file format, such as add or remove BOMs

JJoao
  • 12,170
  • 1
  • 23
  • 45
1

Using file, as indicated in vinc17's answer, did not work on my machine1. Based on a previous answer by akshita007 and also identified as a solution to a similar question, I recommend checking the first three bytes of your file

head -c3 [file] | hexdump -C

If you have a file with a BOM, the output should look something like this

head -c3 file-with-bom | hexdump -C
00000000  ef bb bf                                          |...|
00000003

Without a bom, you won't see the EF BB BF bytes but the actual content of the file:

head -c3 file-without-bom | hexdump -C
00000000  22 49 64                                          |"Id|
00000003

Note: My test file has CSV content and starts with a quoted header, so it shows "Id as the first three bytes without a BOM.


1 Using file-5.41 in Ubuntu 22.04.3, I simply get CSV text for a file regardless of the BOM.

Kariem
  • 111
0

A UTF-8 file with a BOM starts with the 3 bytes 0xef 0xbb 0xbf, the UTF-8 encoding of the U+FEFF character.

You can find those files efficiently in bash by reading the first 3 bytes of the files:

find . -type f -size +2c -print0 |
  while IFS= read -rd '' file; do
    IFS= LC_ALL=C read -rd '' -n3 first3 < "$file" &&
      [[ $first3 = $'\xef\xbb\xbf' ]] &&
      printf '%s\n' "$file"
  done

That doesn't check whether the rest of the file is valid UTF-8, but then neither does file, it's just a heuristic.