String replacement using a dictionary

Question

What is a good way to do string replacements in a file using a dictionary with a lot of substituend-substituent pairs? And by a lot, I actually mean about 20 – not much, but many enough so that I want to organize them neatly.

I kind of want to collect all substituend-substituent pairs in a file dictionary.txt in an easy-to-manage way, since I need to replace a lot of stuff, say like:

"yes"      : "no"
"stop"     : "go, go, go!"
"wee-ooo"  : "ooooh nooo!"
"gooodbye" : "hello"

"high"     : "low"
"why?"     : "i don't know"

Now I want to apply these substitutions in some file novel.txt.

Then I want to run magiccommand --magicflags dictionary.txt novel.txt so that all instances of yes in novel.txt are replaced by no (so even Bayesian would be replaced by Banoian) and all instances of goodbye in novel.txt would be replaced by hello and so forth.

So far, the strings I need to replace (and replace with) do not have any quotes (neither single nor double) in them. (It would be nice, though, to see a solution working well with strings containing quotes, of course.)

I know sed and awk / gawk can do such stuff principally, but can they also work with such dictionary files? Seems like gawk would be the right candidate for magiccommand, what are the right magicflags? How do I need to format my dictionary.txt?

Is perl acceptable? There's a nice trick for compiling up a set of search and replace patterns. — Sobrique, Mar 12 '16 at 23:14

score 3 · Accepted Answer · answered Mar 12 '16 at 19:00

Here's one way with sed:

sed '
s|"\(.*\)"[[:blank:]]*:[[:blank:]]*"\(.*\)"|\1\
\2|
h
s|.*\n||
s|[\&/]|\\&|g
x
s|\n.*||
s|[[\.*^$/]|\\&|g
G
s|\(.*\)\n\(.*\)|s/\1/\2/g|
' dictionary.txt | sed -f - novel.txt

How it works:
The 1st sed turns dictionary.txt into a script-file (editing commands, one per line). This is piped to the 2nd sed (note the -f - which means read commands from stdin) that executes those commands, editing novel.txt.
This requires translating your format

"STRING"   :   "REPLACEMENT"

into a sed command and escaping any special characters in the process for both LHS and RHS:

s/ESCAPED_STRING/ESCAPED_REPLACEMENT/g

So the first substitution

s|"\(.*\)"[[:blank:]]*:[[:blank:]]*"\(.*\)"|\1\
\2|

turns "STRING" : "REPLACEMENT" into STRING\nREPLACEMENT (\n is a newline char). The result is then copied over the hold space.
s|.*\n|| deletes the first part keeping only REPLACEMENT then s|[\&/]|\\&|g escapes the reserved characters (this is the RHS).
It then exchanges the hold buffer with the pattern space and s|\n.*|| deletes the second part keeping only STRING and s|[[\.*^$/]|\\&|g does the escaping (this is the LHS).
The content of the hold buffer is then appended to pattern space via G so now the pattern space content is ESCAPED_STRING\nESCAPED_REPLACEMENT.
The final substitution

s|\(.*\)\n\(.*\)|s/\1/\2/g|

transforms it into s/ESCAPED_STRING/ESCAPED_REPLACEMENT/g

score 1 · Answer 2 · answered Mar 13 '16 at 02:04

Here's a perl version. It creates a hash containing pre-compiled regular expressions and then loops through each line of input applying all the regexes to each line. perl's -i is used for "in place editing" of the input file. You can easily add or change any of the regexes or replacement strings.

Pre-compiling the regexes using qr// greatly improves the speed of the script, which will be very noticable if there are a lot of regexes and/or a lot of input lines to process.

#! /usr/bin/perl -i

use strict;

# the dictionary is embedded in the code itself.
# see 2nd version below for how to read dict in
# from a file.
my %regex = (
    qr/yes/      => 'no',
    qr/stop/     => 'go, go, go!',
    qr/wee-ooo/  => 'ooooh nooo!',
    qr/gooodbye/ => 'hello',
    qr/high/     => 'low',
    qr/why\?/    => 'i don\'t know',
);

while (<>) {
      foreach my $key (keys %regex) {
            s/$key/$regex{$key}/g;
      }
}

Here's another version which reads in the dictionary from the first filename on the command line, while still processing the second (and optional subsequent) filenames:

#! /usr/bin/perl -i

use strict;

# the dictionary is read from a file.
#
# file format is "searchpattern replacestring", with any
# number of whitespace characters (space or tab) separating
# the two fields.  You can add comments or comment out dictionary
# entries with a '#' character.
#
# NOTE: if you want to use any regex-special characters as a
# literal in either $searchpattern or $replacestring, you WILL
# need to escape them with `\`.  e.g. for a literal '?', use '\?'.
#
# this is very basic and could be improved.  a lot.

my %regex = ();

my $dictfile = shift ;
open(DICT,'<',$dictfile) || die "couldn't open $dictfile: $!\n";
while(<DICT>) {
    s/#.*// unless (m/\\#/); # remove comments, unless escaped.
                             # easily fooled if there is an escaped 
                             # '#' and a comment on the same line.

    s/^\s*|\s*$//g ;         # remove leading & trailing spaces
    next if (/^$/) ;         # skip empty lines

    my($search, $replace) = split;
    $regex{qr/$search/} = $replace;
};
close(DICT);


# now read in the input file(s) and modify them.
while (<>) {
      foreach my $key (keys %regex) {
            s/$key/$regex{$key}/g;
      }
}

score 1 · Answer 3 · answered Mar 13 '16 at 22:27

Started writing this as a comment, but it got too complicated, hence a second perl answer. Given your source file, you can use a neat perl trick to build a regex:

#!/usr/bin/env perl

use strict;
use warnings; 
use Data::Dumper;

#build key-value pairs
my %replace = map { /"(.+)"\s*:\s*"(.+)"/ } <DATA>;
print Dumper \%replace; 

#take the keys of your hash, then build into capturing regex
my $search = join ( "|", map {quotemeta} keys %replace ); 
$search = qr/($search)/;

print "Using match regex of: $search\n";

#read stdin or files on command line, line by line
while ( <> ) { 
    #match regex repeatedly, replace with contents of hash. 
    s/$search/$replace{$1}/g;
    print;
}

__DATA__
"yes"      : "no"
"stop"     : "go, go, go!"
"wee-ooo"  : "ooooh nooo!"
"gooodbye" : "hello"

"high"     : "low"
"why?"     : "i don't know"

We generate a hash using a multi-line pattern match and map to create key value pairs.

We build a search regex, and use the values captured in it to replace.

Using <> is perl's magic filehandle - STDIN or files specified on command line. Much how sed does it. (You can use a file and read it 'normally' for the pattern, the use of DATA is purely illustrative).

String replacement using a dictionary

3 Answers3

Linked

Related