Replace exact strings in a file based on a list of strings and a list of corresponding replacements,

Question

I'm trying to do a dictionary based search and replace, but i'm having trouble figuring out how to make it case sensitive/exact match, but it is proving quite difficult.

I have three files, fileA is the text to be editied, FileB is the list of words to search for and FileC is the list of words that will be the replacements.

paste -ds///g /dev/null /dev/null <(sed 's|[[\.*^\b$\b/]|\\&|g' fileB) <(sed 's|[\&/]|\\\b&\b|g' fileC) /dev/null /dev/null | sed -f - fileA

as far as i can understand, in order for sed to search and replace exact matches i need to do something like sed 's/\<exact_word_to_replace\>/exact_replacement/g' filename

but i really cannot figure out where, in my code above, the \< and \> is supposed to go!

Would \b be better? and if so, where would that go?

hope someone can push me in the right direction here...

cheers, Nb

it's based on this: https://unix.stackexchange.com/a/271108

Does this answer your question? How to perform replacements defined in one file on another file — Philippos, May 02 '22 at 07:09
You have not included any details about your files. Adding a sample test, with input and output file, would be enlightening. — thanasisp, May 02 '22 at 08:20
Oh, sorry... FileA is just a text of many words (sentences), FileB and FileC are one word per line.... Problem that occurs is that when I search to replace 'house' with 'dwelling', it would also replace the 'house' in 'houseboat', so I end up with nonsense like 'dwellingboat'... For example. — noisebob, May 02 '22 at 08:35

cas · Accepted Answer · 2022-05-03T09:29:25.480

I would not use paste and sed for this at all. I'd use awk or perl. For example:

First, some sample input files. Note that (for my own convenience) I have changed the meanings of File[ABC] - Files A and B are the search patterns and corresponding replacements. FileC is the input text file to be modified.

The important thing is that the file containing search terms is the first argument to the script, and the file containing replacement strings is the second. The actual input to be modified comes from the third (and subsequent, if any) argument(s) and/or from stdin.

$ cat FileA
house
$ cat FileB
dwelling
$ cat FileC
Mr House does not live in a land-based house, his house is a houseboat.

And a perl script. Save this as, say, replace.pl and make it executable with chmod +x replace.pl:

$ cat replace.pl 
#!/usr/bin/perl
use strict;
Variables to hold the first two filenames.
my $FileA = shift;
my $FileB = shift;
An associative array ("hash") called %RE. The keys are the search
regexes and the values are the replacements.
my %RE;
Read both FileA and FileB at the same time, to build a
hash of pre-compiled regular expressions (%RE) and their
replacements.
open(my $A,'<',$FileA) || die "Couldn't open $FileA for read: $!\n";
open(my $B,'<',$FileB) || die "Couldn't open $FileB for read: $!\n";
while(my $a = <$A>) { # loop reading lines from first file
  die "$FileA is longer than $FileB" if (eof $B);
  my $b = <$B>; # read in a line from 2nd file
  die "$FileB is longer than $FileA" if (eof $A && ! eof $B);
chomp($a,$b);
Uncomment only ONE of the following four lines:
$RE{qr/\b$a\b/} = $b;                 # regular expression match
  #$RE{qr/\b\Q$a\E\b/} = $b;            # exact-match version.
  #$RE{qr/(?<!-)\b$a\b(?!-)/} = $b;     # regexp match, no hyphen allowed
  #$RE{qr/(?<!-)\b\Q$a\E\b(?!-)/} = $b; # exact match, no hyphen allowed.
}
close($A);
close($B);
process stdin and/or any remaining filename argument(s) on
the command line (e.g. FileC).
while (<>) {
  foreach my $a (keys %RE) {
    s/$a/$RE{$a}/g;
  };
  print;
}

Notes:

perl's chomp function removes the trailing input record separator ($/ - the end-of-line character(s), e.g. newline or CR+LF, depending on text file type and operating system) from a variable or list of variables. See perldoc -f chomp.
perl's qr quoting operator returns a compiled regular expression. See perldoc -f qr for details.
pre-compiling the regular expressions will make little or no difference if the search, replacement, and text files are all small. It will make a huge difference in performance if either the list of searches & replaces (Files A and B) is long and/or the input (FileC) is large. The overhead of repeatedly compiling the regular expressions many times will add up to a significant consumption of CPU processing power and time.
the regular expression is compiled from \b$a\b, so contains the zero-width word-boundary markers around the value from FileA. See man perlre and search for word boundary. "Zero-width" means that \b just asserts what we expect to see there without actually matching and consuming any of the input text. Other examples of zero-width assertions include ^ (start of line anchor) and $ (end of line anchor). Search for Assertions in the same man page.
If you want the patterns from FileA to be treated as fixed strings (i.e. treat all regex metacharacters like * or ? as literal strings with no special meaning), then surround the pattern with \Q and \E to disable (quote) metacharacters. It is important that the \bs be outside of the \Q and \E. I've added a commented out example. This is also documented in man perlre.
the script will break if any pattern in FileA ends with an un-escaped \ character. Also, any pattern containing \E may cause it to break if you use the fixed-string version. And \Q in the non-fixed string version will also cause problems. Garbage in, Garbage Out. Sanitise your input.
also in man perlre: perl defines word characters (\w) as: alphanumeric plus "_", plus other connector punctuation chars plus Unicode marks
hyphens and most other punctuation characters terminate words. houseboat in FileC will be unchanged, but house-boat would be changed to dwelling-boat, and share-house would be changed to share-dwelling. This is less than ideal.

This can be fixed by changing the script to use zero-width negative lookahead and lookbehind assertions ((?!pattern) and (?<!pattern) respectively) for hyphen chars in the RE - e.g. $RE{qr/(?<!-)\b$a\b(?!-)/} = $b; or $RE{qr/(?<!-)\b\Q$a\E\b(?!-)/} = $b;. In short, these tell perl's regexp engine "DO NOT MATCH if there is a - before or after the pattern we're looking for".

Using zero-width assertions here (rather than just negated character classes like [^-]) is important, it prevents the RE from gobbling up the next character (for the same reason that the zero-width assertion \b doesn't actually match or consume any character in the input). Again, this is documented in man perlre, search for Lookaround Assertions.

I've added examples for these to the script too.
The /i modifier is not used, so regex matches will be case-sensitive.
This script has very primitive argument processing. If you need something better, use one of perl's many command-line argument/option processing modules, e.g. Getopt::Std or Getopt::Long. These are both core perl modules and are included with perl.

And finally, some sample output:

$ ./replace.pl FileA FileB FileC
Mr House does not live in a land-based dwelling, his dwelling is a houseboat.

If you want the script to actually change each individual input file (instead of just printing it/them to stdout), change the first line from:

#!/usr/bin/perl

to

#!/usr/bin/perl -i

or (if you want the original file(s) saved as .bak):

#!/usr/bin/perl -i.bak

BTW, even with the -i in-place edit option, this script will still work if the input comes from stdin rather than a file.

Omg, thank you so much... I need to read that a couple of times more and try it out, but thank you for the EDUCATIONAL reply. — noisebob, May 03 '22 at 08:38
No problem. I fixed a problem with the lookahead (using [[:punct:]] instead of just - made it match commas too) and added both lookbehind to the search regexp and some more explanation. So you probably want to copy-paste another copy of the new & improved script. — cas, May 03 '22 at 09:32

Replace exact strings in a file based on a list of strings and a list of corresponding replacements,

1 Answers1

Variables to hold the first two filenames.

An associative array ("hash") called %RE. The keys are the search

regexes and the values are the replacements.

Read both FileA and FileB at the same time, to build a

hash of pre-compiled regular expressions (%RE) and their

replacements.

Uncomment only ONE of the following four lines:

process stdin and/or any remaining filename argument(s) on

the command line (e.g. FileC).