0

I have a text file infile.txt that contains the following string:

[ A ]
1
2
[ B ]
3
[ C
4
5 
[ D ]

I wish to use both grep and sed to print the lines that start with [ and end with ]. Thus my desired output from grep and sed is:

[ A ]
[ B ]
[ D ]

As a reality check, I will first try to print the lines that contain [:

grep "\[" infile.txt
grep -E "\[" infile.txt
sed -n '/\[/p' infile.txt
sed -nE '/\[/p' infile.txt

Each of the preceding commands gives this output:

[ A ]
[ B ]
[ C
[ D ]

Now I need to specify that the printed lines should start with [ and end with ]. This answer to this question suggests using the regular expression \[[^\]]*\]. However, all of the following commands give no output (empty string):

grep "\[[^\]]*\]" infile.txt
grep -E "\[[^\]]*\]" infile.txt
sed -n '/\[[^\]]*\]/p' infile.txt
sed -nE '/\[[^\]]*\]/p' infile.txt

But each of the following commands...

grep "\[*\]" infile.txt
grep -E "\[*\]" infile.txt
sed -n '/\[*\]/p' infile.txt
sed -nE '/\[*\]/p' infile.txt

...give the desired output:

[ A ]
[ B ]
[ D ]

Why doesn't the regular expression \[[^\]]*\] -- again, from this answer to this question -- work for my text?

Andrew
  • 16,855

2 Answers2

1
grep -x '\[.*\]'

Should be enough to match on lines that start with [ and end in ] (with any number (*) of characters (.) in between).

-x in effect adds an implicit ^ at the start and $ at the end so that would be the same as:

grep '^\[.*\]$'

Same with ERE or sed:

grep -xE '\[.*\]'
grep -E '^\[.*\]$'
sed '/^\[.*\]$/!d'
sed -n '/^\[.*\]$/p'
sed -E '/^\[.*\]$/!d'
sed -En '/^\[.*\]$/p'

Your:

\[[^\]]*\]

Matches on a [ followed by a character other than backslash ([^\]) followed by any number of ] characters followed by ].

To match on [ followed by any number of characters other than ], followed by ], the syntax is \[[^]]*\] or \[[^]]*] as the ] doesn't need to be escaped, though I'd still recommend to as there are regex or glob flavours where it's necessary.

Inside [...] in standard BRE or ERE (except in awk), \ is not special¹. Though again, there are regexp variants where it is special so I'd still recommend to use [\\x] instead of [\x] for instance to match on either \ or x.

There are many different flavours of regexps. The ones at https://regexr.com/ as in your linked answer seem to be (some version of) PCRE (perl-compatible regular expressions) which some implementations of grep or sed support with -P or -R or -x perl and where \ can be used for escaping ] inside bracket expressions.

See also: Why does my regular expression work in X but not in Y?


¹ and is currently guaranteed to be in current versions of POSIX, though that might change in the future as it's hindering progress for no good reason. You'll find that some implementations of sed ignore that requirement for instance when $POSIXLY_CORRECT is not in the environment in that [\t] matches on TAB instead of \ or t as POSIX requires. To match on either \ or t, use [\\t] which is portable.

0

Let's decode the RE \[[^\]]*\]

  • \[ - Literal [ character
  • [^\] - Not \
  • ] - Literal ] character
  • * - Previous item repeated zero or more items, i.e. ] zero or more times
  • \] - Another literal ] character (the backslash is ignored here)

Applying this to [ A ] we can see it will not match. I suspect the question you're asking is why [^\]] does what it does. The ^ negation symbol has a special case that when the next symbol is ] it's treated literally, otherwise it's always the end of the [...] construct.

Instead you could use this RE, \[[^]*] or even anchor the front and back of the string, ^\[.*]$

Chris Davies
  • 116,213
  • 16
  • 160
  • 287
  • 1
    the point here being that the suggestion they got assumed Perl regexes (or similar), where the backslash does work within brackets – ilkkachu Jun 15 '23 at 19:34