In this Q&A there is a reference to the manpages synopses being based "loosely" on the Extended Backus–Naur Form of metasyntax notation. It's interesting and serves as background. That being said, using related terminology, one of the most common type of element you will find in a command synopsis from a manual is the optional sequence; made of a definitions-list enclosed between a start-option-symbol and an end-option-symbol. In so many words, something we often associate with the likes of [ option ]
which, for instance, could be a single dash or a longer double dash form followed one or more characters, such as in ps --help
.
So I'd like to match a common optional sequence pattern we often see in the manuals which indeed:
- Starts with
[
and ends with]
- Contains an optional sequence in the form of
-option
or--option
- Is not necessarily centered inside a bracket i.e.
[-a]
,[ -ab]
,[-abc ]
all match - Allows for a list containing an option and its optional element/specifier i.e.
[-a foo -b bar -c=biz end]
- Allows other brackets to appear inside the outside brackets i.e.
[--a [-b[-c]] -d foo]
(would match the whole input here)
... but doesn't allow:
- Three dashes
---
under any circumstance - To be more clear, things like
[option]
(no dash) and[]
,[-]
,[--]
or[foo-bar=a]
alone shouldn't match.
The data doesn't contain too many unusual cases such as the examples presented above(I wouldn't know how to deal with unmatched brackets either but that's beyond the scope of this). Trying to address the requirements with grep
like I did was maybe not the best idea in hindsight but I tried:
grep -E '\[{1,}([[:space:]]{0,}[[:punct:]]{0,}[[:alnum:]]{0,}){0,}(-{1,2}[[:alpha:]]{1,}){1,}([[:alnum:]]{0,}[[:punct:]]{0,}[[:space:]]{0,}){0,}\]{1,}'
It's matching some patterns1, along the lines of what I want, but it has shortcomings, is hard to manage and reuse. Using arbitry sets(3) of parentheses to group items in order to manage matching repetitions to create "blocks" doesn't help in that regard either(but helps with debugging). Playing with characters classes to cater to the input seems quite unpredictable.
So how do you do this using either a better expression and/or a different tool/approach? How do you manage such long regular expressions if you use them - in this case should you have to use a command many times over to filter down the content? Do I need to manipulate the content differently beforehand to help me with that?
1. The output from iterating through the manpages files affords a good opportunity for testing. With grep here I used: for i in /usr/share/man/man1/*.gz; do basename "${i//.1.gz}"; my_grep_command_above <<< "$(man -l "$i")"; done
using he entirety of the manpages output. Otherwise man man
or man as
provides a good variation of optional sequences for testing.
{0,}
is equivalent to*
– mikeserv Jun 19 '14 at 19:46sed
is what I had in mind as the second parser, but if yourgrep
were a function you couldgrepfn() { grep -E "\[{${1},}..." ; } ; grepfn 4
or something. It's a lot harder to do that with*
. – mikeserv Jun 19 '14 at 21:06