grep
's idea of a character is locale-dependent. If you're in a non-Unicode locale and you grep from a file with Unicode characters in it then the character counts won't match up. If you echo $LANG
then you'll see the locale you're in.
If you set the LC_CTYPE
and/or LANG
environment variables to a value ending with ".UTF-8" then you will get the right behaviour:
$ cat data
étuis
letter
éééééé
$ LANG=C grep -E '^.{6}$' data
étuis
letter
$ LANG=en_US.UTF_8 grep -E '^.{6}$' data
letter
éééééé
$
You can change your locale for just a single command by assigning the variable on the same line as the command.
With this configuration, multi-byte characters are considered as single characters. If you want to exclude non-ASCII characters entirely, some of the other answers have solutions for you.
Note that it's still possible for things to break, or at least not do exactly what you expect, in the presence of combining characters. Your grep
may treat LATIN SMALL LETTER E + COMBINING CHARACTER ACUTE ABOVE differently than LATIN SMALL LETTER E WITH ACUTE.
.
, something likewăsd's
will match – cuonglm Jul 31 '14 at 05:07'
is a character that can reasonably be part of a "string with a fixed number of characters". – Michael Homer Jul 31 '14 at 05:19LC_CTYPE
andLANG
, something likeLC_CTYPE=en_US.UTF-8 LANG=en_US
will be failed. UseLC_ALL
for safety. – cuonglm Jul 31 '14 at 05:41