How to match a particular form of optional sequence from a manual synopsis, including variations?

Question

In this Q&A there is a reference to the manpages synopses being based "loosely" on the Extended Backus–Naur Form of metasyntax notation. It's interesting and serves as background. That being said, using related terminology, one of the most common type of element you will find in a command synopsis from a manual is the optional sequence; made of a definitions-list enclosed between a start-option-symbol and an end-option-symbol. In so many words, something we often associate with the likes of [ option ] which, for instance, could be a single dash or a longer double dash form followed one or more characters, such as in ps --help.

So I'd like to match a common optional sequence pattern we often see in the manuals which indeed:

Starts with [ and ends with ]
Contains an optional sequence in the form of -option or --option
Is not necessarily centered inside a bracket i.e. [-a], [ -ab], [-abc ] all match
Allows for a list containing an option and its optional element/specifier i.e. [-a foo -b bar -c=biz end]
Allows other brackets to appear inside the outside brackets i.e. [--a [-b[-c]] -d foo] (would match the whole input here)

... but doesn't allow:

Three dashes --- under any circumstance
To be more clear, things like [option](no dash) and [], [-], [--] or [foo-bar=a] alone shouldn't match.

The data doesn't contain too many unusual cases such as the examples presented above(I wouldn't know how to deal with unmatched brackets either but that's beyond the scope of this). Trying to address the requirements with grep like I did was maybe not the best idea in hindsight but I tried:

grep -E '\[{1,}([[:space:]]{0,}[[:punct:]]{0,}[[:alnum:]]{0,}){0,}(-{1,2}[[:alpha:]]{1,}){1,}([[:alnum:]]{0,}[[:punct:]]{0,}[[:space:]]{0,}){0,}\]{1,}'

It's matching some patterns¹, along the lines of what I want, but it has shortcomings, is hard to manage and reuse. Using arbitry sets(3) of parentheses to group items in order to manage matching repetitions to create "blocks" doesn't help in that regard either(but helps with debugging). Playing with characters classes to cater to the input seems quite unpredictable.

So how do you do this using either a better expression and/or a different tool/approach? How do you manage such long regular expressions if you use them - in this case should you have to use a command many times over to filter down the content? Do I need to manipulate the content differently beforehand to help me with that?

^{1. The output from iterating through the manpages files affords a good opportunity for testing. With grep here I used: for i in /usr/share/man/man1/*.gz; do basename "${i//.1.gz}"; my_grep_command_above <<< "$(man -l "$i")"; done using he entirety of the manpages output. Otherwise man man or man as provides a good variation of optional sequences for testing.}

Beware that how "loosely" synopses deploy BNF varies from quite strictly to not at all -- i.e., man pages do no actually have to obey any standard at all in this regard and some of them don't. — goldilocks, Jun 19 '14 at 19:28
Also, BNF is used explicitly to describe context-free grammar, which is also explicitly what regular expressions aren't for. So while you might be able to get this to work, it could also be impossible. — goldilocks, Jun 19 '14 at 19:32
@goldilocks Thanks for the info! Indeed it's not uniform. Actually I wish I had all synopses in pure EBNF form but my scope is quite small here. Exploration mostly. Thanks for the heads up on context-free grammar vs. RE. I had never consider the possible antinomy. — , Jun 19 '14 at 20:44
I get it. That's a perfectly valid explanation. And, what's more, if you pipe your script as is through another parser you can target those and change them more easily - or replace them with argument values. I meant no disrespect, only a comment in case you didn't know. — mikeserv, Jun 19 '14 at 20:49
sed is what I had in mind as the second parser, but if your grep were a function you could grepfn() { grep -E "\[{${1},}..." ; } ; grepfn 4 or something. It's a lot harder to do that with *. — mikeserv, Jun 19 '14 at 21:06

Stéphane Chazelas · Accepted Answer · 2014-06-19T21:05:52.700

2

You could do (with GNU grep):

grep -Po '\[\s*--?(?!-)((?>[^][]+)|\[(?1)*\])+\]'

Which on the text of your question gives:

[-a]
[ -ab]
[-abc ]
[-a foo -b bar -c=biz end]
[--a [-b[-c]] -d foo]

The idea being to use PCRE and its recursive matching operators as described in pcrepattern(3) for matching nested [...].

edited Jun 19 '14 at 21:05

answered Jun 19 '14 at 20:40

Stéphane Chazelas

544,893

@illuminÉ, see the edit. The extra (?>...) should help with the backtracking limit issue. – Stéphane Chazelas Jun 19 '14 at 21:06
Thank you very much! On a quick run on the entirety of man1 I couldn't pickup one single item I didn't want from the output! I will study your expression carefully. Even seems to take care of the uneven brackets which are nested. Magical RE. – Jun 19 '14 at 21:19
Thank you! The line break issue I had is about man and its behavior in my terminal, not your RE. MANWIDTH=1000 man man ... is needed. Even man --nh --nj won't do, as it gets upstaged by automatic hyphen from the terminal based on $COLUMNS; a terminal dependent element. 1000 is arbitrary. I tried resorting to groff but with that or "inside" man -P, there is that encoding(see extraction Q.)Ty! – Jun 20 '14 at 09:19

How to match a particular form of optional sequence from a manual synopsis, including variations?

1 Answers1