I would not use paste
and sed
for this at all. I'd use awk or perl. For example:
First, some sample input files. Note that (for my own convenience) I have changed the meanings of File[ABC]
- Files A and B are the search patterns and corresponding replacements. FileC is the input text file to be modified.
The important thing is that the file containing search terms is the first argument to the script, and the file containing replacement strings is the second. The actual input to be modified comes from the third (and subsequent, if any) argument(s) and/or from stdin.
$ cat FileA
house
$ cat FileB
dwelling
$ cat FileC
Mr House does not live in a land-based house, his house is a houseboat.
And a perl script. Save this as, say, replace.pl
and make it executable with chmod +x replace.pl
:
$ cat replace.pl
#!/usr/bin/perl
use strict;
Variables to hold the first two filenames.
my $FileA = shift;
my $FileB = shift;
An associative array ("hash") called %RE. The keys are the search
regexes and the values are the replacements.
my %RE;
Read both FileA and FileB at the same time, to build a
hash of pre-compiled regular expressions (%RE) and their
replacements.
open(my $A,'<',$FileA) || die "Couldn't open $FileA for read: $!\n";
open(my $B,'<',$FileB) || die "Couldn't open $FileB for read: $!\n";
while(my $a = <$A>) { # loop reading lines from first file
die "$FileA is longer than $FileB" if (eof $B);
my $b = <$B>; # read in a line from 2nd file
die "$FileB is longer than $FileA" if (eof $A && ! eof $B);
chomp($a,$b);
Uncomment only ONE of the following four lines:
$RE{qr/\b$a\b/} = $b; # regular expression match
#$RE{qr/\b\Q$a\E\b/} = $b; # exact-match version.
#$RE{qr/(?<!-)\b$a\b(?!-)/} = $b; # regexp match, no hyphen allowed
#$RE{qr/(?<!-)\b\Q$a\E\b(?!-)/} = $b; # exact match, no hyphen allowed.
}
close($A);
close($B);
process stdin and/or any remaining filename argument(s) on
the command line (e.g. FileC).
while (<>) {
foreach my $a (keys %RE) {
s/$a/$RE{$a}/g;
};
print;
}
Notes:
perl's chomp
function removes the trailing input record separator ($/
- the end-of-line character(s), e.g. newline or CR+LF, depending on text file type and operating system) from a variable or list of variables. See perldoc -f chomp
.
perl's qr
quoting operator returns a compiled regular expression. See perldoc -f qr
for details.
pre-compiling the regular expressions will make little or no difference if the search, replacement, and text files are all small. It will make a huge difference in performance if either the list of searches & replaces (Files A and B) is long and/or the input (FileC) is large. The overhead of repeatedly compiling the regular expressions many times will add up to a significant consumption of CPU processing power and time.
the regular expression is compiled from \b$a\b
, so contains the zero-width word-boundary markers around the value from FileA. See man perlre
and search for word boundary
. "Zero-width" means that \b
just asserts what we expect to see there without actually matching and consuming any of the input text. Other examples of zero-width assertions include ^
(start of line anchor) and $
(end of line anchor). Search for Assertions
in the same man page.
If you want the patterns from FileA to be treated as fixed strings (i.e. treat all regex metacharacters like *
or ?
as literal strings with no special meaning), then surround the pattern with \Q
and \E
to disable (quote) metacharacters. It is important that the \b
s be outside of the \Q
and \E
. I've added a commented out example. This is also documented in man perlre
.
the script will break if any pattern in FileA ends with an un-escaped \
character. Also, any pattern containing \E
may cause it to break if you use the fixed-string version. And \Q
in the non-fixed string version will also cause problems. Garbage in, Garbage Out.
Sanitise your input.
also in man perlre
: perl defines word characters (\w
) as: alphanumeric plus "_", plus other connector punctuation chars plus Unicode marks
hyphens and most other punctuation characters terminate words. houseboat
in FileC will be unchanged, but house-boat
would be changed to dwelling-boat
, and share-house
would be changed to share-dwelling
. This is less than ideal.
This can be fixed by changing the script to use zero-width negative lookahead and lookbehind assertions ((?!pattern)
and (?<!pattern)
respectively) for hyphen chars in the RE - e.g. $RE{qr/(?<!-)\b$a\b(?!-)/} = $b;
or $RE{qr/(?<!-)\b\Q$a\E\b(?!-)/} = $b;
. In short, these tell perl's regexp engine "DO NOT MATCH if there is a -
before or after the pattern we're looking for".
Using zero-width assertions here (rather than just negated character classes like [^-]
) is important, it prevents the RE from gobbling up the next character (for the same reason that the zero-width assertion \b
doesn't actually match or consume any character in the input). Again, this is documented in man perlre
, search for Lookaround Assertions
.
I've added examples for these to the script too.
The /i
modifier is not used, so regex matches will be case-sensitive.
This script has very primitive argument processing. If you need something better, use one of perl's many command-line argument/option processing modules, e.g. Getopt::Std or Getopt::Long. These are both core perl modules and are included with perl.
And finally, some sample output:
$ ./replace.pl FileA FileB FileC
Mr House does not live in a land-based dwelling, his dwelling is a houseboat.
If you want the script to actually change each individual input file (instead of just printing it/them to stdout), change the first line from:
#!/usr/bin/perl
to
#!/usr/bin/perl -i
or (if you want the original file(s) saved as .bak):
#!/usr/bin/perl -i.bak
BTW, even with the -i
in-place edit option, this script will still work if the input comes from stdin rather than a file.