Finding text between two specific characters or strings

Question

Say I have lines like this:

*[234]*
*[23]*
*[1453]*

where * represents any string (except a string of the form [number]). How can I parse these lines with a command line utility and extract the number between brackets?

More generally, which of these tools cut, sed, grep or awk would be appropriate for such task?

score 16 · Accepted Answer · edited Apr 13 '17 at 12:36

If you have GNU grep, you can use its -o option to search for a regex and output only the matching part. (Other grep implementations can only show the whole line.) If there are several matches on one line, they are printed on separate lines.

grep -o '\[[0-9]*\]'

If you only want the digits and not the brackets, it's a little harder; you need to use a zero-width assertion: a regexp that matches the empty string, but only if it is preceded, or followed as the case may be, by a bracket. Zero-width assertions are only available in Perl syntax.

grep -P -o '(?<=\[)[0-9]*(?=\])'

With sed, you need to turn off printing with -n, and match the whole line and retain only the matching part. If there are several possible matches on one line, only the last match is printed. See Extracting a regex matched with 'sed' without printing the surrounding characters for more details on using sed here.

sed -n 's/^.*\(\[[0-9]*\]\).*/\1/p'

or if you only want the digits and not the brackets:

sed -n 's/^.*\[\([0-9]*\)\].*/\1/p'

Without grep -o, Perl is the tool of choice here if you want something that's both simple and comprehensible. On every line (-n), if the line contains a match for \[[0-9]*\], then print that match ($&) and a newline (-l).

perl -l -ne '/\[[0-9]*\]/ and print $&'

If you only want the digits, put parentheses in the regex to delimit a group, and print only that group.

perl -l -ne '/\[([0-9]*)\]/ and print $1'

P.S. If you only want to require one or more digits between the brackets, change [0-9]* to [0-9][0-9]*, or to [0-9]+ in Perl.

All good, other than that he wants to "extract the number between brackets". I think "except [number]" means except [0-9] — Peter.O, Mar 08 '12 at 00:06
@Peter.O I understood “except [number]” to mean that there aren't other parts of the line of that form. But I edited my answer to show how to print only the digits, just in case. — Gilles 'SO- stop being evil', Mar 08 '12 at 00:39
Those perl regex asserts look really useful! I've been reading about them after seeing you use both backward and forward assertions, even in grep (I'd switched off to the the fact you can choose a regex engine). I'll be devoting a bit more time to perl's regex from here on. Thanks... PS.. I just read in man grep... "This is highly experimental and grep -P may warn of unimplemented features." ... I hope that doesn't mean unstable(?) ... — Peter.O, Mar 08 '12 at 06:16

Kyle Jones · Answer 2 · 2012-03-08T00:56:37.467

5

You can't do it with cut.

tr -c -d '0123456789\012'
sed 's/[^0-9]*//g'
awk -F'[^0-9]+' '{ print $1$2$3 }'
grep -o -E '[0-9]+'

tr is the most natural fit for the problem and would probably run the fastest, but I think you would need gigantic inputs to separate any of these options in terms of speed.

edited Mar 08 '12 at 00:56

answered Mar 07 '12 at 21:53

Kyle Jones

15,015

For sed, ^.* is greedy and consumes all but the last digit, and + needs to be \+ or else use the posix $[0-9][0-9]*$.... and in any case 's/[^0-9]*//g' works just as well, ... Thanks for thetr -cexample, but isn't that trailing\012` surperfluous? – Peter.O Mar 08 '12 at 00:48
@Peter Thanks for catching that. I'd have sworn I tested the sed example. :( I've changed it to your version. Regarding \012: it is needed otherwise tr will eat the newlines. – Kyle Jones Mar 08 '12 at 00:58
Aha... I was seeing it as \0, 1, 2 (or even , 0, 1, 2). I'm not well enough attuned to octal it seems.. Thanks. – Peter.O Mar 08 '12 at 06:48

score 4 · Answer 3 · answered Mar 07 '12 at 20:23

If you mean extract a set of consecutive digits between non-digit characters, I guess sed and awk are the best (although grep is also able to give you the matched characters):

sed: you can of course match the digits, but it's perhaps interesting to do the opposite, remove the non-digits (works as far as there is only one number per line):

$ echo nn3334nn | sed -e 's/[^[[:digit:]]]*//g'
3344

grep: you can match consecutive digits

$ echo nn3334nn | grep -o '[[:digit:]]*'
3344

I don't give an example for awk because I have null experience with it; it is interesting to note that, although sed is a swiss knife, grep gives you a simpler, more readable way to do this, which also works for more than one number on each input line (the -o only prints the matching parts of the input, each one on its own line):

$ echo dna42dna54dna | grep -o '[[:digit:]]*'
42
54

Just as a comparison, here is a sed eqivalent of the "more than one number per line" example grep -o '[[:digit:]]*' . . . sed -nr '/[0-9]/{ s/^[^[0-9]*|[^0-9]*$//g; s/[^0-9]+/\n/g; p}' ... (+1) — Peter.O, Mar 07 '12 at 21:10

score 2 · Answer 4 · answered Jan 24 '13 at 17:04

Since it has been said that this cannot be done with cut, I will show that it is easily possible to produce a solution that is at least not worse than some of the others, even though I do not endorse the use of cut as the "best" (or even a particularly good) solution. It should be said that any solution not looking specifically for *[ and ]* around the digits makes simplifying assumptions and is therefore prone to failure on examples more complex than then one given by the asker (e.g. digits outside *[ and ]*, which should not be shown). This solution checks at least for the brackets, and it could be extended to check the asterisks as well (left as an exercise to the reader):

cut -f 2 -d '[' myfile.txt | cut -f 1 -d ']'

This makes use of the the -d option, which specifies a delimiter. Obviously you could also pipe into the cut expression instead of reading from a file. While cut is probably pretty fast, since it is simple (no regex engine), you have to invoke it at least twice (or a few more time to check for *), which creates some process overhead. The one real advantage of this solution is that it is rather readable, especially for casual users not well versed in regex constructs.

Finding text between two specific characters or strings

4 Answers4

Linked