AWK with BOM: Is there any cool way to handle Unicode BOM with regexp?

Question

I have two files encoded in UTF-8 with/without BOM:

/tmp/bom$ ls
list.bom.txt  list.nobom.txt
/tmp/bom$ cat list.nobom.txt 
apple
banana
avocado
寿司
melon
/tmp/bom$ diff list.nobom.txt list.bom.txt 
1c1
< apple
---
> apple
/tmp/bom$ file list.nobom.txt list.bom.txt 
list.nobom.txt: UTF-8 Unicode text
list.bom.txt:   UTF-8 Unicode (with BOM) text

The only diff between two files is header BOM EF BB BF.

Then, in order to filter the lines that begin with 'a', I write a short awk script using a caret.

/tmp/bom$ gawk '/^a.*/' list.nobom.txt
apple
avocado
/tmp/bom$ gawk '/^a.*/' list.bom.txt
avocado

Unfortunately, with header BOM, apple in the first line is ignored.

Therefore, my question is: Is there any way to handle this?

I consider three solutions:

Write BOM bytes directly. For example,
```
gawk 'BEGIN { pat = "^(\xef\xbb\xbf)?a.*" } $0 ~ pat { print }'
```
works in UTF-8. However, this doesn't handle other encodings. Moreover, if there is U+FEFF used as Zero Width Non-Breaking Space (see comments), the above script fails in some cases.
Delete BOM bytes by re-encoding with nkf. For example,
```
nkf --oc=UTF-8 list.bom.txt | gawk '/^a.*/'
```
works. However, I wonder if there is a more sophisticated way.
[ADDED] This is an improvement of the first one, using bash feature.
```
gawk -v bom="$(echo -e '\uFEFF')" '
    NR == 1 {
        pat = "^" bom;
        sub(pat, "")
    }
    /^a.*/ {
        print
    }
'
```
This works for both UTF-8 with/without BOM. However this doesn't works for UTF-16 in my environment. So, the second solution is better.

Moreover, I think this is also the problem for grep, sed, or other scripts using regular expression matching. So, if there is a general solution, it would be more appreciated.

The file with the BOM does not start with 'a', it starts with the BOM (also known as Unicode character Zero Width No-Break Space). The files do not have the same content, just as a file with starting with a regular space character differs from a file starting with the character 'a'. — Johan Myréen, Jan 20 '17 at 11:57
@JohanMyréen Thanks for the comment! However, U+FEFF (BOM, or Zero Width No-Break Space) is now neither control nor graphic character, according to Unicode Standard, Version 9.0.0, Section 23.8 "Specials". This section also says "(Except for compatibility,) U+FEFF is not used with the semantics of zero width no-break space." And I think this time it's not a compatibility matter. (Is this correct?) — nekketsuuu, Jan 20 '17 at 12:30
Anyway, I know backward compatibility is very important, so it's OK that the beginning 'a' of the file with BOM is not matched in the intuitive way. But I want to know an option or something for this problem, if it exists. — nekketsuuu, Jan 20 '17 at 12:30
Yes, I am aware that the definition changed and U+FEFF is now only used for BOM. I find it unfortunate that the BOM has found its way to UTF-8, where it is not needed for its original purpose (because UTF-8 does not have a byte order). Using BOM for UTF-8 files is just asking for trouble, which this question has shown. My recommendation is to not use BOM for UTF-8 at all. You can't use it to decide if a file is encoded using UTF-8 or not, because there are UTF-8 files without BOM anyway. The Unicode standard recommends to not use BOM for UTF-8 files. — Johan Myréen, Jan 20 '17 at 14:39
I think so too.... (But then why Visual Studio use utf-8 with BOM by default?) So my second solution (converting all to uft-8 without BOM) is recommended? — nekketsuuu, Jan 20 '17 at 16:20
@JohanMyréen If you think so, could you write it as an answer please? — nekketsuuu, Jan 20 '17 at 16:57
Visual studio uses BOM in utf-8 because it is Microsoft, i.e. because it is non-standard, and breaks things. — ctrl-alt-delor, Jan 20 '17 at 17:13
With just awk, since it was already your tool, This answer is on top: https://stackoverflow.com/questions/1068650/using-awk-to-remove-the-byte-order-mark/1068700#1068700 — Sandburg, Oct 19 '22 at 15:00

score 6 · Accepted Answer · answered Jan 20 '17 at 17:34

6

A BOM doesn't make sense in UTF-8. Those are generally added by mistake by bogus software on Microsoft OSes.

dos2unix will remove it and also take care of other idiosyncrasies of Windows text files.

dos2unix < file.win.txt | awk ...

answered Jan 20 '17 at 17:34

Stéphane Chazelas

544,893

It may be that the BOM is not official UTF-8, but there are some use cases where it is handy. For example if you are creating a CSV file to be read by Excel, to get Excel to import the file as UTF-8 requires jumping through a few hoops that you might not realize until you see some mis-formed text, perhaps after having some some other work on the file. If the file has a BOM, Excel behaves as you'd like. – Mike Gleen Apr 06 '20 at 15:10
The first sentence is a point of view. 2 encoding of utf-8 exist, one with BOM, the other without, it's a fact. awk should "resist" and adapt to both cases. Adding BOM in utf-8 is a manner to force some interpreters which are blind to encoding, or which go in 1250 by default to change their behaviour. (for example, VLC interprets subtitles in 1250 by default... maybe because it's French) – Sandburg Oct 19 '22 at 13:22

AWK with BOM: Is there any cool way to handle Unicode BOM with regexp?

1 Answers1