22

I want to write a shell script that replaces A by BB and B by AA. For example, AYB becomes BBYAA. My usual solution to this kind of thing is to chain multiple calls to sed, like this:

sed 's/A/BB/g;s/B/AA/g'

However, that doesn't work in this situation because the A ends up being translated into AAAA instead of BB. tr also doesn't seem to be an option, because the replacement text is longer than one character. Is there something else I can do? It's OK if it uses something other than sed or tr.

Braiam
  • 35,991
hugomg
  • 5,747
  • 4
  • 39
  • 54
  • 1
    That's generally beyond the capabilities of sed, but for your simple case you can use y/AB/BA/;s/[AB]/&&/g –  Jan 31 '21 at 04:14

9 Answers9

16

This is the kind of problem where you need a loop so you can search for both patterns simultaneously.

awk '
    BEGIN {
        regex = "A|B"
        map["A"] = "BB"
        map["B"] = "AA"
    }
    {
        str = $0
        result = ""
        while (match(str, regex)) {
            found = substr(str, RSTART, RLENGTH)
            result = result substr(str, 1, RSTART-1) map[found]
            str = substr(str, RSTART+RLENGTH)
        }
        print result str
    }
'

Of course, if Perl is available there's an equivalent oneliner:

perl -pe '
    BEGIN { %map = ("A" => "BB", "B" => "AA"); }
    s/(A|B)/$map{$1}/g;
'

If none of the patterns contain special characters, you can also build the regex dynamically:

perl -pe '
    BEGIN {
        %map = ("A" => "BB", "B" => "AA");
        $regex = join "|", keys %map;
    }
    s/($regex)/$map{$1}/g;
'

By the way, Tcl has a builtin command for this called string map, but it's not easy to write Tcl oneliners.


Demonstrating the effect that sorting the keys by length has:

  1. without sorting

    $ echo ABBA | perl -pe '
        BEGIN {
            %map = (A => "X", BB => "Y", AB => "Z");
            $regex = join "|", map {quotemeta} keys %map;
            print $regex, "\n";
        }
        s/($regex)/$map{$1}/g
    '
    
    A|AB|BB
    XYX
    
  2. with sorting

    $ echo ABBA | perl -pe '
          BEGIN {
              %map = (A => "X", BB => "Y", AB => "Z");
              $regex = join "|", map {quotemeta $_->[1]}
                                 reverse sort {$a->[0] <=> $b->[0]}
                                 map {[length, $_]}
                                 keys %map;
              print $regex, "\n";
          }
          s/($regex)/$map{$1}/g
      '
    
    BB|AB|A
    ZBX
    

Benchmarking "plain" sort versus Schwartzian in perl: The code in the subroutines is lifted directly from the sort documentation

#!perl
use Benchmark   qw/ timethese cmpthese /;

make up some key=value data

my $key='a'; for $x (1..10000) { push @unsorted, $key++ . "=" . int(rand(32767)); }

plain sorting: first by value then by key

sub nonSchwartzian { my @sorted = sort { ($b =~ /=(\d+)/)[0] <=> ($a =~ /=(\d+)/)[0] || uc($a) cmp uc($b) } @unsorted }

using the Schwartzian transform

sub schwartzian { my @sorted = map { $->[0] } sort { $b->[1] <=> $a->[1] || $a->[2] cmp $b->[2] } map { [$, /=(\d+)/, uc($_)] } @unsorted }

ensure the subs sort the same way

die "different" unless join(",", nonSchwartzian()) eq join(",", schwartzian());

benchmark

cmpthese( timethese(-10, { nonSchwartzian => 'nonSchwartzian()', schwartzian => 'schwartzian()', }) );

running it:

$ perl benchmark.pl
Benchmark: running nonSchwartzian, schwartzian for at least 10 CPU seconds...
nonSchwartzian: 11 wallclock secs (10.43 usr +  0.05 sys = 10.48 CPU) @  9.73/s (n=102)
schwartzian: 11 wallclock secs (10.13 usr +  0.03 sys = 10.16 CPU) @ 49.11/s (n=499)
                 Rate nonSchwartzian    schwartzian
nonSchwartzian 9.73/s             --           -80%
schwartzian    49.1/s           405%             --

The code using the Schwartzian tranform is 4 times faster.

Where the comparison function is only length of each element:

Benchmark: running nonSchwartzian, schwartzian for at least 10 CPU seconds...
nonSchwartzian: 11 wallclock secs (10.06 usr +  0.03 sys = 10.09 CPU) @ 542.52/s (n=5474)
schwartzian: 10 wallclock secs (10.21 usr +  0.02 sys = 10.23 CPU) @ 191.50/s (n=1959)
                Rate    schwartzian nonSchwartzian
schwartzian    191/s             --           -65%
nonSchwartzian 543/s           183%             --

Schwartzian is much slower with this inexpensive sort function.

Can we get past the abusive commentary now?

glenn jackman
  • 85,964
  • I like this solution because it's clear and robust even for even trickier substitutions such as A→AB and B→BA. – hugomg Jan 31 '21 at 15:34
  • I really like your 3 solutions. – Olivier Dulac Feb 01 '21 at 03:26
  • 1
    for dynamically building the regexp, you'll have to sort by longest key first, otherwise overlapping keys like A and AB can cause issues – Sundeep Feb 01 '21 at 03:44
  • Even if the patterns have special characters you could use \Q ... \E or quotemeta to escape them: $regex = join "|", map {quotemeta} keys %map (not tested) – jcaron Feb 01 '21 at 12:48
  • @Sundeep, great point. – glenn jackman Feb 01 '21 at 14:25
  • @Sundeep that's because perl is using recursive "regexes" instead of "regular expressions"; the latter don't have this problem and always return the longest match. Compare echo boot | awk -F 'o|oo' '{print$2}' vs perl -le 'print((split /o|oo/,"boot")[1])' –  Feb 01 '21 at 16:28
  • 3
    @user414777, I rolled-back your edit. The technique I use here is a Schwartzian transform that is more efficient, caching the string length instead of having to perform it for every iteration in the sort algorithm. – glenn jackman Feb 01 '21 at 17:54
  • No, your scare-quotes "technique" is NOT more efficient, and would NOT be more efficient no matter what input data is used, and even if perl wasn't already caching the lengths of the strings -- as it does. You won't be able to give a single solitary example where your map {} reverse sort {} map {} @list would be faster than the sort {} @list doing the same thing. –  Feb 02 '21 at 18:31
  • @user414777, please consider my update. – glenn jackman Feb 02 '21 at 20:20
  • You're doing a totally different thing there -- you're not sorting strings by their length -- but by the result of a heavy-weight regex match. If you just wanted to prove me wrong no matter what, then sort { 1 while 1 } @list would've done just as well. –  Feb 02 '21 at 21:51
  • 1
    Did you read the part where I acknowledge you're correct? Why so angry? – glenn jackman Feb 02 '21 at 22:32
  • tcl string map: https://stackoverflow.com/a/26574319/10440128 – milahu Jan 27 '22 at 16:31
14

You can't do the whole operation with a single substitution in sed, but you can do it correctly in different ways depending on whether the two substrings A and B are single characters or longer strings.

Assuming the two substrings A and B are single characters...

You want to transform AYB into BBYAA. To do this,

  1. Change each A to B and B to A using y/AB/BA/.
  2. Substitute each A in the new string with AA using s/A/AA/g.
  3. Substitute each B in the new string with BB using s/B/BB/g.
$ echo AYB | sed 'y/AB/BA/; s/B/BB/g; s/A/AA/g'
BBYAA

Combine the two last steps to get

$ echo AYB | sed 'y/AB/BA/; s/[AB]/&&/g'
BBYAA

In fact, the ordering of the operations here does not really matter:

$ echo AYB | sed 's/[AB]/&&/g; y/AB/BA/'
BBYAA

The sed editing command y/// translates the characters in its first argument to the corresponding characters in its second argument, a bit like the tr utility does. This is done in a single operation, so you don't need to use a temporary for the swap of A and B in y/AB/BA/. In general, y/// is much faster in translating single characters than what e.g. s///g is (since no regular expressions are involved), and it's also able to insert newlines into strings with \n, which the standard s/// command can't do (s/// in GNU sed can obviously do this as a non-portable convenience extension).

The & character in the replacement part of the s/// command will be replaced by whatever the expression in the first argument matched, so s/[AB]/&&/g would double up any A or B character in the input data.


For multi-character substrings, assuming the substrings are distinct (i.e. one substring is not found in the other, as in the case of oo and foo), use something like

$ echo fooxbar | sed 's/foo/@/g; s/bar/foofoo/g; s/@/barbar/g'
barbarxfoofoo

I.e., swap the two strings via an intermediate string not otherwise found in the data. Note that the intermediate string could be any string not found in the data, not just a single character.

Kusalananda
  • 333,661
8

With awk, you can use pattern1 as the field separator FS and the replacement1 as output field separator OFS. Then, loop over each field and replace pattern2 by replacement2:

awk '{for (f=1;f<=NF;f++){gsub(p,r,$f)} $1=$1}1' FS=A OFS=BB p=B r=AA file

The point of $1=$1 is to force a rebuild of the record, else it would fail for 0A for example.

This is POSIX compliant and involves no intermediary string so it is foolproof.

AdminBee
  • 22,803
Quasímodo
  • 18,865
  • 4
  • 36
  • 73
6

I suppose one solution is to replace first A or B for another character not present in the string, then replace that character. This way the switching between A and B is avoided. Although a chain of sed's is needed:

$ echo AYB | sed -e 's/A/#/g' -e 's/B/AA/g' -e 's/#/BB/g'
BBYAA
  • 1
    Interesting. That sounds like a good workaround if the third character is something that never appears in the original input. However, if the input can contain #s then this would basically fall into the same problem as before... – hugomg Jan 31 '21 at 01:55
  • @hugomg indeed, I clarified that in the answer. It all depends I guess if you know beforehand what characters compose the string. Is it known, or could be any string? – schrodingerscatcuriosity Jan 31 '21 at 01:58
  • In my case the replacement strings are known ahead of time but the input text could be anything. – hugomg Jan 31 '21 at 02:09
  • 3
    Your temporary character doesn't need to be a printable one, I like using CTRL-G in these cases. This has the additional advantage of literally ringing a bell if something goes wrong and you cat a file which still has a replacement character. – Guntram Blohm Jan 31 '21 at 10:36
  • 1
    you can use easy single "special ascii chars" to do the intemediary replacements with: sed -e "s/A/$(printf '\001')/g" -e 's/B/AA/g' -e "s/$(printf '\001')/A/g" # this makes it easy to have several such intemediary and be relatively "sure" that they are not present in the input file ( \001, \002, etc .. .just be careful as some values are : newline, linefeed, tab, backspace, etc, that may also be present in the original file) – Olivier Dulac Feb 01 '21 at 03:31
6

You can do this with GNU sed by changing all As to the record separator \n, which surely won't be present.

echo AYB |
sed -e '
  y/A/\n/
  s/[\nB]/&&/g
  y/\nB/BA/
'
BBYAA
guest_7
  • 5,728
  • 1
  • 7
  • 13
6

You can accomplish this for arbitrary input text with a sequence of straight substitutions using sed:

sed 's/Q/Qz/g; s/A/Qa/g; s/B/AA/g; s/Qa/BB/g; s/Qz/Q/g;'

Open a token-space to hold or represent intermediate/arbitrary values

There's a general solution to this type of issue for situations where the only thing you can do is straight substitutions, and which works for arbitrary input text.

The trick is that you first create a token-space (i.e. variable space) within the intermediate copy of the text which can be used to represent arbitrary values such that the tokens you use in later substitutions can't exist in the intermediate copy of the text as you're making further substitutions. For example:

s/Q/Qz/g

This makes it such that the text can no longer contain any Q followed by anything other than z, and every Qz actually represents a Q. That means you are free to use Q followed by any character other than z to represent whatever you desire.

In this specific case, you also can not use QA and QB, because you want to make substitutions of the single A and B characters.

So, to produce the overall substitution which you desire, you would perform the following sequence of substitutions:

s/Q/Qz/g  # Open your Q* token-space.
          # You can now use any Q* other than Qz, QA, and QB to represent another value.
          # The restriction to not use QA and QB is only because this specific case requires
          # substituting for the single A and B characters.
s/A/Qa/g  # Temporarily represent all A as Qa.
s/B/AA/g  # Change all B to AA, as desired in the question.
s/Qa/BB/g # Change all Qa placeholders with BB.
s/Qz/Q/g  # Restore all Q, closing the token-space.

So, overall, this could be written as:

sed 's/Q/Qz/g; s/A/Qa/g; s/B/AA/g; s/Qa/BB/g; s/Qz/Q/g;'

This produces the output:

 $ echo AYB | sed 's/Q/Qz/g; s/A/Qa/g; s/B/AA/g; s/Qa/BB/g; s/Qz/Q/g;'
 BBYAA

What's happening:

Substitution Text in buffer Comment
s/Q/Qz/g AYB Open your Q* token-space. This does nothing with this example input, but assures this will work with arbitrary input text.
s/A/Qa/g QaYB Temporarily represent all A as Qa.
s/B/AA/g QaYAA Change all B to AA, as desired in the question.
s/Qa/BB/g BBYAA Change all Qa placeholders with BB.
s/Qz/Q/g BBYAA Restore all Q, closing the token-space. Again, does nothing with this example input text, but needed in order to work with arbitrary input text.

You can use any character for the first character in your token-space

For the first character of your token-space, you can use any single character. I generally use Q, because it has a low frequency of occurrence in text, is a letter in the ASCII character set, and is easy to remember. Z would also fit. If you are using a wider character set, it is advantageous to choose a character, perhaps a symbol, which has an even lower frequency of use. For performance reasons, the ideal would be to pick a character which doesn't exist in the text which you will be using, such that the first and last substitutions don't actually do anything. However, that's only a performance and space concern, not a functional concern. In other words, it will function with whatever character you choose, but it will be faster if it has to do less work.

Concerns

  1. Any patterns which might specifically rely on the character you use as the start of your token-space must account for it being represented by the token you've chosen. In the above case, if you are wanting to do some other substitutions with Q characters, then you need to either account for all Q being represented in the intermediate text as Qz or do those substitutions before opening your token-space or after closing it. In general, you can resolve this by just choosing a different character to be the starting character for your token-space.
  2. This is neither the most CPU efficient nor space efficient way to accomplish this. Real programming languages can accomplish this type of task substantially more efficiently using various other constructs. This is, however, a good technique to have in your back pocket when making substitutions is the tool you have available.
Makyen
  • 160
  • 1
  • 2
  • 7
4

It's OK if it uses something other than sed or tr.

Dyalog APL's ⎕R ("quad-R") PCRE Replacement operator (full documentation) does parallel matching:

patterns ← ⎕JSON'["A","B"]'
substitutions ← ⎕JSON'["BB","AA"]'
Transform ← patterns ⎕R substitutions
⎕← Transform ⍞

Try it online!

In the above, I've spelled out everything for clarity, but to create a function that does this, you only need:

(,¨'AB')⎕R'BB' 'AA'

Try it on TryAPL!

Full disclosure: I work for Dyalog Ltd., the vendor of Dyalog APL.

Adám
  • 181
1

Here is another solution based on a loop:

#!/bin/bash

InString="AYB" OutString=""

for (( i=1; i <= ${#InString}; i++ )) do char=$(expr substr "$InString" $i 1) case $char in 'A') OutString+="BB";; 'B') OutString+="AA";; *) OutString+="$char";; esac done echo $OutString

fpmurphy
  • 4,636
0
#stuff match=>replacement pairs and export them
export vA="X"
export vBB="Y"
export vAB="Z"


#GNU sed used here

# dynamically build the regex
re=
for i in A BB AB; do
  re=${re-}${re:+\|}${i}
done
echo "regex = $re" 

#give it a spin... 
echo ABBA |
sed -E \
  -e 'p' \
  -e "s/($re)/\n&\n/g;:a" \
  -e "s/^([^\n]*)\n($re)\n(.*)/echo \"\1\${v\2}\3\"/e;ta" \
;
regex = A|BB|AB
ABBA
ZBX

As we notice that sed uses the longest match when matching at a given position in the input string, unlike in perl which picks up the leftmost match. So the onus of sorting by length is not on the user.

guest_7
  • 5,728
  • 1
  • 7
  • 13