Is there a way to do multiple replacements with sed, without chaining the replacements?

Question

I want to write a shell script that replaces A by BB and B by AA. For example, AYB becomes BBYAA. My usual solution to this kind of thing is to chain multiple calls to sed, like this:

sed 's/A/BB/g;s/B/AA/g'

However, that doesn't work in this situation because the A ends up being translated into AAAA instead of BB. tr also doesn't seem to be an option, because the replacement text is longer than one character. Is there something else I can do? It's OK if it uses something other than sed or tr.

That's generally beyond the capabilities of sed, but for your simple case you can use y/AB/BA/;s/[AB]/&&/g — , Jan 31 '21 at 04:14

glenn jackman · Accepted Answer · 2021-02-02T20:40:53.330

This is the kind of problem where you need a loop so you can search for both patterns simultaneously.

awk '
    BEGIN {
        regex = "A|B"
        map["A"] = "BB"
        map["B"] = "AA"
    }
    {
        str = $0
        result = ""
        while (match(str, regex)) {
            found = substr(str, RSTART, RLENGTH)
            result = result substr(str, 1, RSTART-1) map[found]
            str = substr(str, RSTART+RLENGTH)
        }
        print result str
    }
'

Of course, if Perl is available there's an equivalent oneliner:

perl -pe '
    BEGIN { %map = ("A" => "BB", "B" => "AA"); }
    s/(A|B)/$map{$1}/g;
'

If none of the patterns contain special characters, you can also build the regex dynamically:

perl -pe '
    BEGIN {
        %map = ("A" => "BB", "B" => "AA");
        $regex = join "|", keys %map;
    }
    s/($regex)/$map{$1}/g;
'

By the way, Tcl has a builtin command for this called string map, but it's not easy to write Tcl oneliners.

Demonstrating the effect that sorting the keys by length has:

without sorting

$ echo ABBA | perl -pe '
    BEGIN {
        %map = (A => "X", BB => "Y", AB => "Z");
        $regex = join "|", map {quotemeta} keys %map;
        print $regex, "\n";
    }
    s/($regex)/$map{$1}/g
'

A|AB|BB
XYX

with sorting

$ echo ABBA | perl -pe '
      BEGIN {
          %map = (A => "X", BB => "Y", AB => "Z");
          $regex = join "|", map {quotemeta $_->[1]}
                             reverse sort {$a->[0] <=> $b->[0]}
                             map {[length, $_]}
                             keys %map;
          print $regex, "\n";
      }
      s/($regex)/$map{$1}/g
  '

BB|AB|A
ZBX

Benchmarking "plain" sort versus Schwartzian in perl: The code in the subroutines is lifted directly from the sort documentation

#!perl
use Benchmark   qw/ timethese cmpthese /;
make up some key=value data
my $key='a';
for $x (1..10000) {
    push @unsorted,   $key++ . "=" . int(rand(32767));
}
plain sorting: first by value then by key
sub nonSchwartzian {
    my @sorted = 
        sort { ($b =~ /=(\d+)/)[0] <=> ($a =~ /=(\d+)/)[0] || uc($a) cmp uc($b) } 
        @unsorted
}
using the Schwartzian transform
sub schwartzian {
    my @sorted =
        map  { $->[0] }
        sort { $b->[1] <=> $a->[1] || $a->[2] cmp $b->[2] }
        map  { [$, /=(\d+)/, uc($_)] } 
        @unsorted
}
ensure the subs sort the same way
die "different" unless join(",", nonSchwartzian()) eq join(",", schwartzian());
benchmark
cmpthese(
    timethese(-10, {
        nonSchwartzian => 'nonSchwartzian()',
        schwartzian    => 'schwartzian()',
    })
);

running it:

$ perl benchmark.pl
Benchmark: running nonSchwartzian, schwartzian for at least 10 CPU seconds...
nonSchwartzian: 11 wallclock secs (10.43 usr +  0.05 sys = 10.48 CPU) @  9.73/s (n=102)
schwartzian: 11 wallclock secs (10.13 usr +  0.03 sys = 10.16 CPU) @ 49.11/s (n=499)
                 Rate nonSchwartzian    schwartzian
nonSchwartzian 9.73/s             --           -80%
schwartzian    49.1/s           405%             --

The code using the Schwartzian tranform is 4 times faster.

Where the comparison function is only length of each element:

Benchmark: running nonSchwartzian, schwartzian for at least 10 CPU seconds...
nonSchwartzian: 11 wallclock secs (10.06 usr +  0.03 sys = 10.09 CPU) @ 542.52/s (n=5474)
schwartzian: 10 wallclock secs (10.21 usr +  0.02 sys = 10.23 CPU) @ 191.50/s (n=1959)
                Rate    schwartzian nonSchwartzian
schwartzian    191/s             --           -65%
nonSchwartzian 543/s           183%             --

Schwartzian is much slower with this inexpensive sort function.

Can we get past the abusive commentary now?

I like this solution because it's clear and robust even for even trickier substitutions such as A→AB and B→BA. — hugomg, Jan 31 '21 at 15:34
for dynamically building the regexp, you'll have to sort by longest key first, otherwise overlapping keys like A and AB can cause issues — Sundeep, Feb 01 '21 at 03:44
Even if the patterns have special characters you could use \Q ... \E or quotemeta to escape them: $regex = join "|", map {quotemeta} keys %map (not tested) — jcaron, Feb 01 '21 at 12:48
@Sundeep that's because perl is using recursive "regexes" instead of "regular expressions"; the latter don't have this problem and always return the longest match. Compare echo boot | awk -F 'o|oo' '{print$2}' vs perl -le 'print((split /o|oo/,"boot")[1])' — , Feb 01 '21 at 16:28
@user414777, I rolled-back your edit. The technique I use here is a Schwartzian transform that is more efficient, caching the string length instead of having to perform it for every iteration in the sort algorithm. — glenn jackman, Feb 01 '21 at 17:54
No, your scare-quotes "technique" is NOT more efficient, and would NOT be more efficient no matter what input data is used, and even if perl wasn't already caching the lengths of the strings -- as it does. You won't be able to give a single solitary example where your map {} reverse sort {} map {} @list would be faster than the sort {} @list doing the same thing. — , Feb 02 '21 at 18:31
You're doing a totally different thing there -- you're not sorting strings by their length -- but by the result of a heavy-weight regex match. If you just wanted to prove me wrong no matter what, then sort { 1 while 1 } @list would've done just as well. — , Feb 02 '21 at 21:51
Did you read the part where I acknowledge you're correct? Why so angry? — glenn jackman, Feb 02 '21 at 22:32
tcl string map: https://stackoverflow.com/a/26574319/10440128 — milahu, Jan 27 '22 at 16:31

Kusalananda · Answer 2 · 2021-01-31T11:11:16.990

You can't do the whole operation with a single substitution in sed, but you can do it correctly in different ways depending on whether the two substrings A and B are single characters or longer strings.

Assuming the two substrings A and B are single characters...

You want to transform AYB into BBYAA. To do this,

Change each A to B and B to A using y/AB/BA/.
Substitute each A in the new string with AA using s/A/AA/g.
Substitute each B in the new string with BB using s/B/BB/g.

$ echo AYB | sed 'y/AB/BA/; s/B/BB/g; s/A/AA/g'
BBYAA

Combine the two last steps to get

$ echo AYB | sed 'y/AB/BA/; s/[AB]/&&/g'
BBYAA

In fact, the ordering of the operations here does not really matter:

$ echo AYB | sed 's/[AB]/&&/g; y/AB/BA/'
BBYAA

The sed editing command y/// translates the characters in its first argument to the corresponding characters in its second argument, a bit like the tr utility does. This is done in a single operation, so you don't need to use a temporary for the swap of A and B in y/AB/BA/. In general, y/// is much faster in translating single characters than what e.g. s///g is (since no regular expressions are involved), and it's also able to insert newlines into strings with \n, which the standard s/// command can't do (s/// in GNU sed can obviously do this as a non-portable convenience extension).

The & character in the replacement part of the s/// command will be replaced by whatever the expression in the first argument matched, so s/[AB]/&&/g would double up any A or B character in the input data.

For multi-character substrings, assuming the substrings are distinct (i.e. one substring is not found in the other, as in the case of oo and foo), use something like

$ echo fooxbar | sed 's/foo/@/g; s/bar/foofoo/g; s/@/barbar/g'
barbarxfoofoo

I.e., swap the two strings via an intermediate string not otherwise found in the data. Note that the intermediate string could be any string not found in the data, not just a single character.

score 8 · Answer 3 · edited Feb 01 '21 at 10:37

8

With awk, you can use pattern1 as the field separator FS and the replacement1 as output field separator OFS. Then, loop over each field and replace pattern2 by replacement2:

awk '{for (f=1;f<=NF;f++){gsub(p,r,$f)} $1=$1}1' FS=A OFS=BB p=B r=AA file

The point of $1=$1 is to force a rebuild of the record, else it would fail for 0A for example.

This is POSIX compliant and involves no intermediary string so it is foolproof.

edited Feb 01 '21 at 10:37

AdminBee

22,803

answered Jan 31 '21 at 10:18

Quasímodo

18,865
4
36
73

I appreciate that this solution also avoids chaining multiple substitutions. – hugomg Jan 31 '21 at 15:43

score 6 · Answer 4 · answered Jan 31 '21 at 01:52

6

I suppose one solution is to replace first A or B for another character not present in the string, then replace that character. This way the switching between A and B is avoided. Although a chain of sed's is needed:

$ echo AYB | sed -e 's/A/#/g' -e 's/B/AA/g' -e 's/#/BB/g'
BBYAA

answered Jan 31 '21 at 01:52

schrodingerscatcuriosity

12,396

1

Interesting. That sounds like a good workaround if the third character is something that never appears in the original input. However, if the input can contain #s then this would basically fall into the same problem as before... – hugomg Jan 31 '21 at 01:55
@hugomg indeed, I clarified that in the answer. It all depends I guess if you know beforehand what characters compose the string. Is it known, or could be any string? – schrodingerscatcuriosity Jan 31 '21 at 01:58
In my case the replacement strings are known ahead of time but the input text could be anything. – hugomg Jan 31 '21 at 02:09
3

Your temporary character doesn't need to be a printable one, I like using CTRL-G in these cases. This has the additional advantage of literally ringing a bell if something goes wrong and you cat a file which still has a replacement character. – Guntram Blohm Jan 31 '21 at 10:36
1

you can use easy single "special ascii chars" to do the intemediary replacements with: sed -e "s/A/$(printf '\001')/g" -e 's/B/AA/g' -e "s/$(printf '\001')/A/g" # this makes it easy to have several such intemediary and be relatively "sure" that they are not present in the input file ( \001, \002, etc .. .just be careful as some values are : newline, linefeed, tab, backspace, etc, that may also be present in the original file) – Olivier Dulac Feb 01 '21 at 03:31

score 6 · Answer 5 · answered Jan 31 '21 at 03:31

6

You can do this with GNU sed by changing all As to the record separator \n, which surely won't be present.

echo AYB |
sed -e '
  y/A/\n/
  s/[\nB]/&&/g
  y/\nB/BA/
'
BBYAA

answered Jan 31 '21 at 03:31

guest_7

5,728
1
7
13

Makyen · Answer 6 · 2021-02-01T19:10:55.227

You can accomplish this for arbitrary input text with a sequence of straight substitutions using sed:

sed 's/Q/Qz/g; s/A/Qa/g; s/B/AA/g; s/Qa/BB/g; s/Qz/Q/g;'

Open a token-space to hold or represent intermediate/arbitrary values

There's a general solution to this type of issue for situations where the only thing you can do is straight substitutions, and which works for arbitrary input text.

The trick is that you first create a token-space (i.e. variable space) within the intermediate copy of the text which can be used to represent arbitrary values such that the tokens you use in later substitutions can't exist in the intermediate copy of the text as you're making further substitutions. For example:

s/Q/Qz/g

This makes it such that the text can no longer contain any Q followed by anything other than z, and every Qz actually represents a Q. That means you are free to use Q followed by any character other than z to represent whatever you desire.

In this specific case, you also can not use QA and QB, because you want to make substitutions of the single A and B characters.

So, to produce the overall substitution which you desire, you would perform the following sequence of substitutions:

s/Q/Qz/g  # Open your Q* token-space.
          # You can now use any Q* other than Qz, QA, and QB to represent another value.
          # The restriction to not use QA and QB is only because this specific case requires
          # substituting for the single A and B characters.
s/A/Qa/g  # Temporarily represent all A as Qa.
s/B/AA/g  # Change all B to AA, as desired in the question.
s/Qa/BB/g # Change all Qa placeholders with BB.
s/Qz/Q/g  # Restore all Q, closing the token-space.

So, overall, this could be written as:

sed 's/Q/Qz/g; s/A/Qa/g; s/B/AA/g; s/Qa/BB/g; s/Qz/Q/g;'

This produces the output:

 $ echo AYB | sed 's/Q/Qz/g; s/A/Qa/g; s/B/AA/g; s/Qa/BB/g; s/Qz/Q/g;'
 BBYAA

What's happening:

Substitution	Text in buffer	Comment
`s/Q/Qz/g`	`AYB`	Open your `Q*` token-space. This does nothing with this example input, but assures this will work with arbitrary input text.
`s/A/Qa/g`	`QaYB`	Temporarily represent all `A` as `Qa`.
`s/B/AA/g`	`QaYAA`	Change all `B` to `AA`, as desired in the question.
`s/Qa/BB/g`	`BBYAA`	Change all `Qa` placeholders with `BB`.
`s/Qz/Q/g`	`BBYAA`	Restore all `Q`, closing the token-space. Again, does nothing with this example input text, but needed in order to work with arbitrary input text.

You can use any character for the first character in your token-space

For the first character of your token-space, you can use any single character. I generally use Q, because it has a low frequency of occurrence in text, is a letter in the ASCII character set, and is easy to remember. Z would also fit. If you are using a wider character set, it is advantageous to choose a character, perhaps a symbol, which has an even lower frequency of use. For performance reasons, the ideal would be to pick a character which doesn't exist in the text which you will be using, such that the first and last substitutions don't actually do anything. However, that's only a performance and space concern, not a functional concern. In other words, it will function with whatever character you choose, but it will be faster if it has to do less work.

Concerns

Any patterns which might specifically rely on the character you use as the start of your token-space must account for it being represented by the token you've chosen. In the above case, if you are wanting to do some other substitutions with Q characters, then you need to either account for all Q being represented in the intermediate text as Qz or do those substitutions before opening your token-space or after closing it. In general, you can resolve this by just choosing a different character to be the starting character for your token-space.
This is neither the most CPU efficient nor space efficient way to accomplish this. Real programming languages can accomplish this type of task substantially more efficiently using various other constructs. This is, however, a good technique to have in your back pocket when making substitutions is the tool you have available.

Adám · Answer 7 · 2021-02-01T13:56:13.757

4

It's OK if it uses something other than sed or tr.

Dyalog APL's ⎕R ("quad-R") PCRE Replacement operator (full documentation) does parallel matching:

patterns ← ⎕JSON'["A","B"]'
substitutions ← ⎕JSON'["BB","AA"]'
Transform ← patterns ⎕R substitutions
⎕← Transform ⍞

Try it online!

In the above, I've spelled out everything for clarity, but to create a function that does this, you only need:

(,¨'AB')⎕R'BB' 'AA'

Try it on TryAPL!

Full disclosure: I work for Dyalog Ltd., the vendor of Dyalog APL.

edited Feb 01 '21 at 13:56

answered Jan 31 '21 at 20:29

Adám

181

1

@terdon OK, edited in (only here, for now ― I potentially have have close to a thousand posts to edit) while awaiting an authoritative answer on Meta. – Adám Feb 01 '21 at 13:59
Good to see an APL solution! – fpmurphy Feb 01 '21 at 16:38
looks familiar. looks like ... The J Programming Language, which is based on APL – milahu Jan 27 '22 at 16:01

score 1 · Answer 8 · answered Feb 01 '21 at 17:39

Here is another solution based on a loop:

#!/bin/bash
InString="AYB"
OutString=""
for (( i=1; i <= ${#InString}; i++ ))
do
   char=$(expr substr "$InString" $i 1)
   case $char in
     'A') OutString+="BB";;
     'B') OutString+="AA";;
     *)   OutString+="$char";;
   esac
done
echo $OutString

guest_7 · Answer 9 · 2021-02-02T09:04:02.893

#stuff match=>replacement pairs and export them
export vA="X"
export vBB="Y"
export vAB="Z"


#GNU sed used here

# dynamically build the regex
re=
for i in A BB AB; do
  re=${re-}${re:+\|}${i}
done
echo "regex = $re" 

#give it a spin... 
echo ABBA |
sed -E \
  -e 'p' \
  -e "s/($re)/\n&\n/g;:a" \
  -e "s/^([^\n]*)\n($re)\n(.*)/echo \"\1\${v\2}\3\"/e;ta" \
;
regex = A|BB|AB
ABBA
ZBX

As we notice that sed uses the longest match when matching at a given position in the input string, unlike in perl which picks up the leftmost match. So the onus of sorting by length is not on the user.

Is there a way to do multiple replacements with sed, without chaining the replacements?

9 Answers9

make up some key=value data

plain sorting: first by value then by key

using the Schwartzian transform

ensure the subs sort the same way

benchmark

Open a token-space to hold or represent intermediate/arbitrary values

You can use any character for the first character in your token-space

Concerns