1

I'm trying to do a dictionary based search and replace, but i'm having trouble figuring out how to make it case sensitive/exact match, but it is proving quite difficult.

I have three files, fileA is the text to be editied, FileB is the list of words to search for and FileC is the list of words that will be the replacements.

paste -ds///g /dev/null /dev/null <(sed 's|[[\.*^\b$\b/]|\\&|g' fileB) <(sed 's|[\&/]|\\\b&\b|g' fileC) /dev/null /dev/null | sed -f - fileA

as far as i can understand, in order for sed to search and replace exact matches i need to do something like sed 's/\<exact_word_to_replace\>/exact_replacement/g' filename

but i really cannot figure out where, in my code above, the \< and \> is supposed to go!

Would \b be better? and if so, where would that go?

hope someone can push me in the right direction here...

cheers, Nb

it's based on this: https://unix.stackexchange.com/a/271108

1 Answers1

0

I would not use paste and sed for this at all. I'd use awk or perl. For example:

First, some sample input files. Note that (for my own convenience) I have changed the meanings of File[ABC] - Files A and B are the search patterns and corresponding replacements. FileC is the input text file to be modified.

The important thing is that the file containing search terms is the first argument to the script, and the file containing replacement strings is the second. The actual input to be modified comes from the third (and subsequent, if any) argument(s) and/or from stdin.

$ cat FileA
house

$ cat FileB dwelling

$ cat FileC Mr House does not live in a land-based house, his house is a houseboat.

And a perl script. Save this as, say, replace.pl and make it executable with chmod +x replace.pl:

$ cat replace.pl 
#!/usr/bin/perl

use strict;

Variables to hold the first two filenames.

my $FileA = shift; my $FileB = shift;

An associative array ("hash") called %RE. The keys are the search

regexes and the values are the replacements.

my %RE;

Read both FileA and FileB at the same time, to build a

hash of pre-compiled regular expressions (%RE) and their

replacements.

open(my $A,'<',$FileA) || die "Couldn't open $FileA for read: $!\n"; open(my $B,'<',$FileB) || die "Couldn't open $FileB for read: $!\n"; while(my $a = <$A>) { # loop reading lines from first file die "$FileA is longer than $FileB" if (eof $B); my $b = <$B>; # read in a line from 2nd file die "$FileB is longer than $FileA" if (eof $A && ! eof $B);

chomp($a,$b);

Uncomment only ONE of the following four lines:

$RE{qr/\b$a\b/} = $b; # regular expression match #$RE{qr/\b\Q$a\E\b/} = $b; # exact-match version. #$RE{qr/(?<!-)\b$a\b(?!-)/} = $b; # regexp match, no hyphen allowed #$RE{qr/(?<!-)\b\Q$a\E\b(?!-)/} = $b; # exact match, no hyphen allowed.

} close($A); close($B);

process stdin and/or any remaining filename argument(s) on

the command line (e.g. FileC).

while (<>) { foreach my $a (keys %RE) { s/$a/$RE{$a}/g; }; print; }

Notes:

  • perl's chomp function removes the trailing input record separator ($/ - the end-of-line character(s), e.g. newline or CR+LF, depending on text file type and operating system) from a variable or list of variables. See perldoc -f chomp.

  • perl's qr quoting operator returns a compiled regular expression. See perldoc -f qr for details.

  • pre-compiling the regular expressions will make little or no difference if the search, replacement, and text files are all small. It will make a huge difference in performance if either the list of searches & replaces (Files A and B) is long and/or the input (FileC) is large. The overhead of repeatedly compiling the regular expressions many times will add up to a significant consumption of CPU processing power and time.

  • the regular expression is compiled from \b$a\b, so contains the zero-width word-boundary markers around the value from FileA. See man perlre and search for word boundary. "Zero-width" means that \b just asserts what we expect to see there without actually matching and consuming any of the input text. Other examples of zero-width assertions include ^ (start of line anchor) and $ (end of line anchor). Search for Assertions in the same man page.

  • If you want the patterns from FileA to be treated as fixed strings (i.e. treat all regex metacharacters like * or ? as literal strings with no special meaning), then surround the pattern with \Q and \E to disable (quote) metacharacters. It is important that the \bs be outside of the \Q and \E. I've added a commented out example. This is also documented in man perlre.

  • the script will break if any pattern in FileA ends with an un-escaped \ character. Also, any pattern containing \E may cause it to break if you use the fixed-string version. And \Q in the non-fixed string version will also cause problems. Garbage in, Garbage Out. Sanitise your input.

  • also in man perlre: perl defines word characters (\w) as: alphanumeric plus "_", plus other connector punctuation chars plus Unicode marks

  • hyphens and most other punctuation characters terminate words. houseboat in FileC will be unchanged, but house-boat would be changed to dwelling-boat, and share-house would be changed to share-dwelling. This is less than ideal.

    This can be fixed by changing the script to use zero-width negative lookahead and lookbehind assertions ((?!pattern) and (?<!pattern) respectively) for hyphen chars in the RE - e.g. $RE{qr/(?<!-)\b$a\b(?!-)/} = $b; or $RE{qr/(?<!-)\b\Q$a\E\b(?!-)/} = $b;. In short, these tell perl's regexp engine "DO NOT MATCH if there is a - before or after the pattern we're looking for".

    Using zero-width assertions here (rather than just negated character classes like [^-]) is important, it prevents the RE from gobbling up the next character (for the same reason that the zero-width assertion \b doesn't actually match or consume any character in the input). Again, this is documented in man perlre, search for Lookaround Assertions.

    I've added examples for these to the script too.

  • The /i modifier is not used, so regex matches will be case-sensitive.

  • This script has very primitive argument processing. If you need something better, use one of perl's many command-line argument/option processing modules, e.g. Getopt::Std or Getopt::Long. These are both core perl modules and are included with perl.

And finally, some sample output:

$ ./replace.pl FileA FileB FileC
Mr House does not live in a land-based dwelling, his dwelling is a houseboat.

If you want the script to actually change each individual input file (instead of just printing it/them to stdout), change the first line from:

#!/usr/bin/perl

to

#!/usr/bin/perl -i

or (if you want the original file(s) saved as .bak):

#!/usr/bin/perl -i.bak

BTW, even with the -i in-place edit option, this script will still work if the input comes from stdin rather than a file.

cas
  • 78,579
  • Omg, thank you so much... I need to read that a couple of times more and try it out, but thank you for the EDUCATIONAL reply. – noisebob May 03 '22 at 08:38
  • No problem. I fixed a problem with the lookahead (using [[:punct:]] instead of just - made it match commas too) and added both lookbehind to the search regexp and some more explanation. So you probably want to copy-paste another copy of the new & improved script. – cas May 03 '22 at 09:32