Explain sed expression that deletes lines with repeated fields

Question

I was doing research on how to generate all non-repeating permutations of a set of a numbers without using recursion in Bash and I found this answer that worked, but I want to understand why.

Say you have three numbers: 1, 2, 3.

The following command will generate all possible non-repeating permutations:

printf "%s\n" {1,2,3}{1,2,3}{1,2,3} | sort -u | sed '/\(.\).*\1/d'
123
132
213
231
312
321

I understand what the printf with %s does when the argument is the brace expansions of the set {1, 2, 3} three times (which would print every single possible outcome).

I know that sort -u will output only unique lines.

I know that sed /<pattern>/d is used to delete any lines that match <pattern>.

Reading the pattern within the sed, I am somewhat confused. I know how to read regex but I don't see how this pattern works within the sed command.

\( = literal '('
.  = any character, once
\) = literal ')'
.* = any character, zero or more times
\1 = reference to first captured group

How does then the sed command remove non-unique values from this regex pattern? I don't understand how there's a reference to a captured group, when there's not really one? The parentheses are being used in the pattern to be matched literally? Everything about this execution makes sense to me until the sed command.

Sed uses basic regular expressions (BRE) by default - so $.$ is in fact a capture group (otherwise \1 would result in a Invalid back reference error) — steeldriver, Jul 10 '20 at 22:30
The sort -u doesn't actually change the piped input to sed at all. Compare the output of printf "%s\n" {1,2,3}{1,2,3}{1,2,3} | sort -u with the output of just printf "%s\n" {1,2,3}{1,2,3}{1,2,3} The outputs are exactly the same (at least on OSX). — Greenonline, Jul 11 '20 at 17:11

ilkkachu · Accepted Answer · 2020-07-11T00:04:36.340

4

That's basic regular expressions (BRE) for sed by default, so $.$ is a capture group containing any one character. Then the .* just skips everything, and \1 matches whatever the group matched. If the whole lot can be made to match, then some character showed up twice, once for the group, and once for the backreference.

In fact, if I'm not mistaken, that wouldn't even work with standard extended regular expressions, since (for whatever reasons) backreferences aren't supported in them. Backreferences are only mentioned under "BREs matching multiple characters", not under EREs, and in fact the same thing with ERE doesn't work on my macOS (it takes the \1 as meaning a literal number 1):

$ printf "%s\n" 122 321 | sed -E -e '/(.).*\1/d'
122

GNU tools do support backreferences in ERE, though.

(I don't think sort -u is necessary here, the combination of brace expansions should produce all combinations without duplicates.)

edited Jul 11 '20 at 00:04

answered Jul 10 '20 at 22:38

ilkkachu

138,973

1

Ohhhhh, so if the parentheses indeed are capturing a pattern and then there's a back reference, then the whole thing then becomes, if any one character, followed by zero or more characters, followed by the first captured character matches, delete. Therefore, permutations like 111 112 122 221 would be deleted since (1)1(1) would be matched, (1)-(1)2 [* = zero or more, so zero cases in this permutation], etc, etc, etc. That makes more sense now. – AnthonyBB Jul 10 '20 at 22:50
@AnthonyBB, yes, exactly – ilkkachu Jul 10 '20 at 22:52
I'm so used to ERE that forgot BRE was the default for sed and special regex characters only take a special meaning in BRE if prefixed with a \, while in ERE, they are special until they are escaped with \. Thank you! – AnthonyBB Jul 10 '20 at 23:01
@AnthonyBB, ERE are often much nicer, just because you don't need to use so many backslashes. But note that standard BRE doesn't have all the ERE features, even with backslashes. In particular, \+, \?, and \| are also gnuisms. – ilkkachu Jul 11 '20 at 00:03
1

Yes, back-references seem to be UN-supported in POSIX EREs, but have been historically supported in sed all the time on the right side of an s/// replacement and, by extension, in most implementations, directly on a simple regex. Search for [2addr]s/BRE/replacement/flags or simply for back-references. In practice it is rare that back-references are not available. – Jul 11 '20 at 02:43
What do you get from echo $(printf "%s\n" {1..3}{1..3}| sed -E '/1/d') ? only 22 23 32 33 ? That should be the same if you add the \ and Apple sed doesn't understand back-references. – Jul 11 '20 at 02:49
@Isaac, yes 22 23 32 33, with -E and either /1/ or /\1/. sed -e '/\1/d' of course gives "RE error: invalid backreference number". Given that the macOS tools are (AFAIU) some old BSD versions, one could look at how e.g. current FreeBSD deals with backrefs and BRE. The re_format(7) man page says "Obsolete ("basic") regular expressions differ in several respects. ... Finally, there is one new type of atom, a back reference...", so they probably don't work. – ilkkachu Jul 11 '20 at 11:55
1

@Isaac, I've also understood that regexes with backreferences aren't an actual regular language any more, and that supporting backreferences precludes some efficient RE implementations, but it might be that capture groups have the same effect. I'm not sure, though, I'm an engineer, not a comp.sci major. ;) – ilkkachu Jul 11 '20 at 11:59
1

Well, yes, on freeBSD ERE doesn't accept backreferences. Yes sed -E '/\1/d' erase all lines with a 1. And sed -E '/(.)\1/d', instead, remove lines that have a one after some other character. I am not a comp.sci major either. I am an Electronic engineer by training (yes an IEEE). And yes, "The language of squares is not regular", a (.+)\1 (in pcre language) is an square. – Jul 11 '20 at 12:24

Quasímodo · Answer 2 · 2020-07-11T20:22:13.690

As a first step, you need to understand $.$ correctly. In basic regular expressions, it is a capture group capturing any character, that must be reproduced by \1. Those are not literal parenthesis.

Now, for the extremely cool part! What does each element of the regex match in each case?

     Left  \(.\)  .*  \1  Right  Result
111        1      1   1          Deleted!
112        1          1   2      Deleted!
113        1          1   3      Deleted!
121        1      2   1          Deleted!
122  1     2          2          Deleted!
123        ?      ?   ?          NoMatch
131        1      3   1          Deleted!
132        ?      ?   ?          NoMatch
133  1     3          3          Deleted!

On the 122, if not clear: Since the expression is not anchored, 1 goes away to the left, the middle 2 matches the capture group $.$ and the last 2 matches the backreference \1. .* (the zero-or-more characters matching regex) will do its best to fit the string, so in this case it contracts to the null string.

If you doubt it, try

echo 122 | grep --color=always '\(.\).*\1'

You will see that only the 22 has been colored.

Compare it with the anchored version of the regex:

$ printf "%s\n" {1,2,3}{1,2,3}{1,2,3} | sort -u | sed '/^\(.\).*\1$/d'
112
113
122
123
132
133
...

Now there are not "Left" and "Right" slots:

     ^\(.\)  .*  \1$  Result
111  1       1   1    Deleted!
112  ?       ?   ?    NoMatch
113  ?       ?   ?    NoMatch
121  1       2   1    Deleted!
122  ?       ?   ?    NoMatch
123  ?       ?   ?    NoMatch
131  1       3   1    Deleted!
132  ?       ?   ?    NoMatch
133  ?       ?   ?    NoMatch

The 1st digit must be the last digit in this version, so there are less matches.

Explain sed expression that deletes lines with repeated fields

2 Answers2

Linked