How can an unquoted metacharacter be part of a token?

Question

I was looking through the bash man page after reading through some of @Tim's questions about shell grammar, and I came up with a (simple) question of my own.

Here is an excerpt from man bash (see it yourself at LESS=+/^DEFINITIONS man bash:

   word   A  sequence  of  characters  considered  as a single unit by the
          shell.  Also known as a token.
   ...
   metacharacter
          A character that, when unquoted, separates words.   One  of  the
          following:
          |  & ; ( ) < > space tab
   control operator
          A token that performs a control function.  It is one of the fol-
          lowing symbols:
          || & && ; ;; ( ) | <newline>

But here's the circle I'm not getting:

; is a "metacharacter."
A "metacharacter", when unquoted, separates "words."
"Token" is another term we can use for "word."
;; is a token.
Therefore ;; is a word.

But, that means it's a word consisting of two word separators. Given that it's unquoted and doesn't require blanks around it to be recognized (or does it??), how is this possible?

If your curious, the other questions about shell grammar I was reading through are these:

I wouldn't read too much into the manpage. In general the parsing is done is such a way so that interpreting ;; as a single token (SEMI_SEMI) has a higher precedence over interpreting it as two. If you're really curious about how it's done, it's all in http://www.opensource.apple.com/source/bash/bash-44.5/bash/parse.y . My mildly damaged brain doesn't find it to be a pleasant read. — Petr Skocik, Mar 18 '16 at 22:06

Greg Tarsa · Answer 1 · 2016-03-18T22:11:22.130

2

bash parses tokens that are generated by a lexical analyzer. When bash is breaking lines into words, it is probably using characters. When it is parsing command syntax it is using tokens. In that instance, ;; is not two ";" characters, rather, it is a token made up of two ";" (semi-colon) characters. The lexical analyzer of bash reads the incoming character stream in a way that allows it to identify groups of characters as tokens. So the code doesn't actually see semi-colons, it sees token codes.

See the flex and bison tools for a glimpse into this. I don't believe bash uses these, but they are tools used for similar parsing applications and there is overview material into how parsing is typically done.

The GNU Bash Reference manual is a good document to read. Well-written, but a bit more explanatory than the man page.

edited Mar 18 '16 at 22:11

answered Mar 18 '16 at 22:06

Greg Tarsa

439

Bash is actually one of the few widespread languages that does use bison. – Petr Skocik Mar 18 '16 at 22:09
@PSkocik: what most languages use? – Tim Mar 18 '16 at 22:42
@Tim Most common languages use hand-written parsers. Or so I heard. Being able to use yacc/bison is probably a good sign, as it implies your grammar has characteristics that make it parseable fast. – Petr Skocik Mar 18 '16 at 22:56

Thomas Dickey · Accepted Answer · 2016-03-18T22:12:52.847

Bash is using the same terminology as POSIX (no surprise). Use that for comparison (and occasionally clarification).

Quoting from Definitions

3.113 Control Operator

In the shell command language, a token that performs a control function. It is one of the following symbols:

&   &&   (   )   ;   ;;   newline   |   ||

The end-of-input indicator used internally by the shell is also considered a control operator.

Note: Token Recognition is defined in detail in XCU Token Recognition .

3.407 Token

In the shell command language, a sequence of characters that the shell considers as a single unit when reading input. A token is either an operator or a word.

Note: The rules for reading input are defined in detail in XCU Token Recognition.

3.440 Word

In the shell command language, a token other than an operator. In some cases a word is also a portion of a word token: in the various forms of parameter expansion, such as ${name-word}, and variable assignment, such as name=word, the word is the portion of the token depicted by word. The concept of a word is no longer applicable following word expansions-only fields remain.

Note: For further information, see XCU Parameter Expansion and wordexp.

So you see, there is a distinction between "word" and "token", and that they are not synonymous as implied in the question. Moreover, the standard does not consider the two semicolons to be separate characters, but a single unit.

I am pretty sure that is the other way around: "POSIX is using the same terminology as Bash (no surprise)". As POSIX builds on existing implementations (no surprise there also). — , Mar 18 '16 at 22:12
It say: "A token is either an operator or a word", or, again: "A token is ... a word", or, if you want to be more precise: "A word (is always) a token" (my words) confirmed by: "word: a token other than an operator". — , Mar 18 '16 at 22:16
The distinction between "operator", "word" and "token" is crucial and that's what really answers it. Thank you. — Wildcard, Mar 18 '16 at 22:53

score 1 · Answer 3 · answered Mar 18 '16 at 22:10

1

Yes, ;; is a word. But it is not two metacharacters together.
It is the end of a case statement:

case a in 
    [a-z]) echo "yes" ;;
esac

Or one -liner:

case a in [a-z]) echo "yes" ;; esac

And, yes, it separated from the "yes" with an space, so it is a word.
It doesn't have to be though:

case a in [a-z]) echo "yes";; esac

Yes, the wording of very specific issues may be confusing sometimes.

answered Mar 18 '16 at 22:10

You can take it further, actually: case a in([a-z])echo "yes";;esac – Wildcard Mar 18 '16 at 22:51

How can an unquoted metacharacter be part of a token?

3 Answers3