Sed: modify every non-first word repetition for every word in text

Question

I need to do something like that using sed?

qq    ab xyz     ab qq aa ab

Becomes:

qq    ab xyz     +ab+ +qq+ aa +ab+

Do you want to do that for each line independently or globally in the file (that is, if qq is found on one line, should all the other qqs be replaced on the same line only or in the whole file?)? — Stéphane Chazelas, Dec 03 '14 at 15:17

Stéphane Chazelas · Accepted Answer · 2014-12-03T15:44:54.560

If your input doesn't contain <, > nor + characters, you could do:

sed '
  s/[[:alnum:]]\{1,\}/<&>/g;:1
  s/\(<\([^>]*\)>.*\)<\2>/\1+\2+/;t1
  s/[<>]//g'

If it may, you can always escape them:

sed '
  s/:/::/g;s/</:{/g;s/>/:}/g
  s/[[:alnum:]]\{1,\}/<&>/g;:1
  s/\(<\([^>]*\)>.*\)<\2>/\1+\2+/;t1
  s/[<>]//g
  s/:}/>/g;s/:{/</g;s/::/:/g'

Those assume you want to do that independently on each line. If you want to do it on the whole file, you'd need to load the whole file in memory first (note that some sed implementations have size limitations there):

sed '
  :2
  $!{N;b2
  }
  s/:/::/g;s/</:{/g;s/>/:}/g
  s/[[:alnum:]]\{1,\}/<&>/g;:1
  s/\(<\([^>]*\)>.*\)<\2>/\1+\2+/;t1
  s/[<>]//g
  s/:}/>/g;s/:{/</g;s/::/:/g'

That's going to be pretty inefficient though and would be a lot easier with perl:

perl -pe 's/\w+/$seen{$&}++ ? "+$&+" : $&/ge'

Line-based:

perl -pe 'my %seen;s/\w+/$seen{$&}++ ? "+$&+" : $&/ge'

mikeserv · Answer 2 · 2014-12-03T20:37:20.883

Here is another approach : this uses a few seds:

an='[:alnum:]' esc=$(printf '\033\[')
sed "/[${an}]/!d;=;a\ }
    s/.*/ & /;s/[^${an}]\{1,\}/   /g
    s| \([${an}"']\{1,\}\) | \
    s/\\([^+'"${an}"']\\)\\(\1\\)\\([^+'"${an}"']\\)/\\1+\\2+\\3/2|g
' <text |
sed '/^ /!N;s/\n */{/' |
sed -e 's/.*/ & /;s/+/ & /g' \
    -f - \
    -e "s/ //;s/ $//
        s/+[^+ ]\{1,\}+/${esc}38;5;35m&${esc}0m/g
        s/ + /+/g" text

Basically, the first two seds team up to write a script for the third. The first sed clears anything but alphanumeric characters from every line - and skips entirely any line without one. For all of the groups of characters that remain, it writes a substitution statement that the third will eventually (almost instantaneously) read and interpret as its script.

The second sed is necessary because the first writes its script by line - and each line can have several s/// statements. The first prints its line number for each line which contains an alphanumeric, but that needs to be paired up in a function context for the third sed - and so the second does this.

Here is a sample of what the script looks like:

...
43{
    s/\([^+[:alnum:]]\)\(n\)\([^+[:alnum:]]\)/\1+\2+\3/2  
    s/\([^+[:alnum:]]\)\(N\)\([^+[:alnum:]]\)/\1+\2+\3/2  
    s/\([^+[:alnum:]]\)\(G\)\([^+[:alnum:]]\)/\1+\2+\3/2  
 }
44{
    s/\([^+[:alnum:]]\)\(b\)\([^+[:alnum:]]\)/\1+\2+\3/2  
    s/\([^+[:alnum:]]\)\(block\)\([^+[:alnum:]]\)/\1+\2+\3/2  
 }
45{
    s/\([^+[:alnum:]]\)\(END\)\([^+[:alnum:]]\)/\1+\2+\3/2  
    s/\([^+[:alnum:]]\)\(SEDSCRIPT\)\([^+[:alnum:]]\)/\1+\2+\3/2  
 }

There is a trailing 2 for each s/// - this is because each substitution is directed at the second occurrence of each pattern - if a second occurrence does not exist nothing is substituted. The above is the result of running it on another of my sed scripts - it seems pretty immune to special characters or similar.

As I was writing I found it was easier to tell what was going on if I colored its selection - which is what...

s/+[^+ ]\{1,\}+/${esc}38;5;35m&${esc}0m/g

... that line does. You can comment or remove it you ever use this at all and don't want or need it.

Here's the script it writes for your example data:

1{
    s/\([^+[:alnum:]]\)\(qq\)\([^+[:alnum:]]\)/\1+\2+\3/2 
    s/\([^+[:alnum:]]\)\(ab\)\([^+[:alnum:]]\)/\1+\2+\3/2 
    s/\([^+[:alnum:]]\)\(xyz\)\([^+[:alnum:]]\)/\1+\2+\3/2 
    s/\([^+[:alnum:]]\)\(ab\)\([^+[:alnum:]]\)/\1+\2+\3/2 
    s/\([^+[:alnum:]]\)\(qq\)\([^+[:alnum:]]\)/\1+\2+\3/2 
    s/\([^+[:alnum:]]\)\(aa\)\([^+[:alnum:]]\)/\1+\2+\3/2 
    s/\([^+[:alnum:]]\)\(ab\)\([^+[:alnum:]]\)/\1+\2+\3/2
 }

Here's what it prints:

qq    ab xyz     +ab+ +qq+ aa +ab+

And from some of my sed script from before:

        s/\(\(.\)${bs}\2\)\{1,\}/${esc}38;5;35m&${+esc+}0m/g
        s/\(_${bs}[^_]\)\{1,\}/${esc}38;5;75m&${+esc+}0m/g
        s/.${bs}//g
        s/\(\(${esc}\)0m\2[^m]*+m+[_ ]\{,+2+\}\)\{+2+\}/_/g
n;      /./!N;G

Sed: modify every non-first word repetition for every word in text

2 Answers2

Linked