11

Possible Duplicate:
Match word containing characters beyond a-zA-Z

I do not understand vims definition of a word. From the help for the motion w (:h w):

w [count] words forward. |exclusive| motion. These commands move over words or WORDS.

   *word*

A word consists of a sequence of letters, digits and underscores, or a sequence of other non-blank characters, separated with white space (spaces, tabs, ). This can be changed with the 'iskeyword' option.

This means when I invoke the w motion, vim needs to check which characters can make up a word with the help of the iskeyword option. So let's check, what characters a word may be comprised of:

:set iskeyword?
iskeyword=@,48-57,_,192-255

Let's test this with characters not included in the characters listed in the iskeyword option, e.g. U+015B LATIN SMALL LETTER S WITH ACUTE. Pressing ga on ś tells us that it has the decimal value 347, which is larger than 255 and thus outside the range of iskeyword. The cursor is placed on the t of treść and I press w:

treść bar
^ (cursor)

The result:

treść bar
      ^ (cursor)

If a word can be comprised of letters, digits, underscores and other characters, the only possibility is that vim treats the ś as a letter, since it's obviously not a digit or an underscore. Let's check how to find out if a character is a letter. From :h :alpha::

The following character classes are supported: [:alpha:] [:alpha:] letters

A test with

/[[:alpha]]

shows that ś is not considered to be a letter.

Why did the cursor jump to the b if ś is neither a letter, nor a digit, nor an underscore and not listed in iskeyword?

Tested on VIM - Vi IMproved 7.3 (2010 Aug 15, compiled Dec 27 2012 21:21:18) Included patches: 1-762 on Debian GNU/Linux with locale set to en_GB.UTF-8.

Marco
  • 33,548
  • 5
    See http://stackoverflow.com/a/5972345/832273 in short all multibyte characters are included per default – Ulrich Dangel Jan 08 '13 at 11:34
  • I don't see why this deserves to be a separate question, Marco. It covers the same ground as in your previous question. Granted, it broadens the discussion, but the core fact remains that Vim doesn't yet fully support Unicode, so the concepts of "letter" and "word" are incomplete. – Warren Young Jan 08 '13 at 17:48
  • 1
    I agree this question developed from the former one. However, I think asking why a regex does not match on the one hand compared to the notion of a word on the other hand are two distinct questions with two different answers. For the first question the answer is “Vim has trouble with Unicode when it comes to regular expressions”, the answer to the second one is “multi-byte characters by default belong to a word”. – Marco Jan 08 '13 at 18:47

1 Answers1

1

As Ulrich mentioned in his comment the reason is that multi-byte characters are always considered to be part of a word. They don't need to be specified in iskeyword. To quote the help :h iskeyword which points to :h isfname:

Multi-byte characters 256 and above are always included, only the characters up to 255 are specified with this option. For UTF-8 the characters 0xa0 to 0xff are included as well.

Marco
  • 33,548
  • Experimentally, the documentation is incorrect. I haven't checked the source. – Gilles 'SO- stop being evil' Jan 08 '13 at 22:28
  • @Gilles I'm still figuring how vim treats different characters and why some characters are treated differently than others while seemingly being in the same range. Incorrect documentation makes understanding even worse. For me the behaviour is inconsistent and often surprising, but that might be just my lack of understanding so far. What exactly is incorrect, that × and ÷ are matched with \I or \k? That's of limited use, but it's what the documentation states, they are multi-byte characters. – Marco Jan 08 '13 at 22:51
  • In UTF-8 on my system (with iskeyword=@,48-57,_,192-255, the default value): µ is a word constituent, but no other character in the range 160–191. Characters in the range U+00c0..U+037d are word constituents, but not ; (U+037e), and so on: evidently vim has some knowledge of letterhood in Unicode. – Gilles 'SO- stop being evil' Jan 08 '13 at 23:07
  • @Gilles Same here, that's what I meant with my comment above. For the w motion this behaviour is practical for my purposes, but when it comes to regular expressions it's unpredictable. Run \k, \I, \w and [:alpha:] on that range and you'll see what I mean. Thanks anyway for taking the time to look into that. – Marco Jan 08 '13 at 23:32