Sed and capturing groups struggle

Question

I have a text file that looks like this

(111)1111111
(111)-111-1111
(111)111-1111
111.111.1111

that I'm using to practice group capturing with regex and sed. The command I am running on the file (called test) is

sed 's/(?\(\d(3}\)[-.]?\(\d{3}\)[-.]?\(\d{4}\)/\1\2\3' test > output

Expecting the output that is just all 1's on every line. However, what I'm getting is just the entire file with no changes. What's going wrong?

Thanks! That did not fix the problem however but now I know that was part of the problem. — RhythmInk, Apr 24 '18 at 23:23

score 9 · Accepted Answer · answered Apr 24 '18 at 23:30

In standard basic regex, (?\(\d(3}\)[-.]? means:

a literal left parenthesis
a literal question mark
(start of a group)
a literal character 'd'
a literal left parenthesis 
the number '3'
a literal closing brace
(end of group)
a dash or a dot
a question mark

i.e., this will print x:

echo '(?d(3}-?' |sed 's/(?\(\d(3}\)[-.]?/x/'

You're very likely to want sed -E to enable extended regular expressions (ERE), and to then use ( and ) for grouping, and \( and \) for literal parenthesis.

Also note that \d is part of Perl regexes, not standard ones, and while GNU sed supports some \X escapes, they're not standard (and I don't think it supports \d). Same for \?, GNU sed supports it in BRE to mean what ? means in ERE, but it's not standard.

With all that in mind:

$ echo '(123)-456-7890' | sed -E 's/\(?([0-9]{3})\)?[-.]?([0-9]{3})[-.]?([0-9]{4})/\1\2\3/'
1234567890

Though you might almost as well just brute force it and just remove everything but the digits:

$ echo '(123)-456-7890' | sed -e 's/[^0-9]//g'
1234567890

(that would of course also accept stuff like (123)-4.5-6-7a8b9c0...)

See also:

Good answer, just a tiny remark: According to the POSIX standard, the interpretation of an ordinary character preceded by an unescaped backslash is undefined, so while \d is likely to match a literal d, you can't rely on that. — Philippos, Apr 25 '18 at 06:31

score 1 · Answer 2 · answered Apr 25 '18 at 04:34

1

We can do it below awk command too

echo "123-45-6789-10101"| awk '{gsub("[^0-9]","",$1);print }'

Output

12345678910101

answered Apr 25 '18 at 04:34

Praveen Kumar BS

5,211

score 0 · Answer 3 · answered Apr 25 '18 at 06:24

ilkkachu described very well why your regular expression does not work with sed (it's in a dialect not supported).

Here is an alternate way that just deletes the characters that are not 1:

sed 's/[^1]//g' file

To use groups, you may do something like

sed -E 's/([^1]*)(1+)([^1]*)/\2/g' file

That is, match a non-empty string of ones delimited on either side by a possibly empty string of non-ones, and replace all of that with the matched string of ones.

Change 1 to [0-9] and [^1] to [^0-9] to handle all digits.

Sed and capturing groups struggle

3 Answers3