Using Raku (formerly known as Perl_6)
~$ raku -pe 'BEGIN my %h = ( \
"One" => "Ten", \
"Two" => "Twenty", \
"Three" => "Thirty", \
"Four" => "Forty", \
q[$$MOON$$] => "SUN", \
q[#LATER] => "SNOW"); \
s:g/ [ ^ | <punct>+ | <blank>+] <( @(%h.keys) )> [ <punct>+ | <blank>+ | $ ] /%h{$/}/;' file
Here's an answer written in Raku, a programming language in the Perl-family. Above, the -pe
sed-like autoprinting command-line flags are used. A %h
hash is declared inline. Note $
must be escaped, however "\$\$MOON\$\$"
can be written q[$$MOON$$]
as above, reducing the need for backslashes.
The meat of the substitution is s///
, which uses :g
global modifier. Within the match-domain (left-half), the @(%h.keys)
hash keys are coerced to an @
-sigiled array, and these are understood as literal strings within the match-domain. In the substitution-domain (right-half) the $/
match variable is used to recover the corresponding key's value
, which is substituted in.
The problem here is "words" is usually defined as alphanumeric-plus-_
(underscore). In that case you would use Raku's <<
(left) and >>
(right) zero-width regex anchors as they represent left and right word-boundaries, respectively. Without these boundary markers, something like Fourteen
will get incorrectly substituted to Fortyteen
. (See last line of Sample Input file below: Sample Output shows the correct result).
Since the OP has requested a solution using keys starting/ending with NON- alphanumeric-plus-_
characters (thus precluding the use of zero-width word-boundary anchors), one way is to try to delineate the possibilities, like so:
s:g/ [ ^ | <punct>+ | <blank>+] <( @(%h.keys) )> [ <punct>+ | <blank>+ | $ ] /%h{$/}/;
Sample Input:
1, This is a Record One, Value1, Dummy_val1 One, $$MOON$$
2, This is a Record Two, Value2, Dummy_val2 Two, #LATER
3, This is a Record Three, Value3, Dummy_val3 Three, #LATER
4, This is a Record Four, Value4, Dummy_val4 Four, $$MOON$$
5, This is a Record Fourteen, Value14, Dummy_val14 Fourteen, #LATER
Sample Output:
1, This is a Record Ten, Value1, Dummy_val1 Ten, SUN
2, This is a Record Twenty, Value2, Dummy_val2 Twenty, SNOW
3, This is a Record Thirty, Value3, Dummy_val3 Thirty, SNOW
4, This is a Record Forty, Value4, Dummy_val4 Forty, SUN
5, This is a Record Fourteen, Value14, Dummy_val14 Fourteen, SNOW
Probably a better (more reliable) way is to more-carefully choose the NON- word keys, e.g. making sure that they start/end with NON- word characters (e.g. #LATER#
instead of #LATER
). Then use two hashes, like so:
~$ raku -pe 'BEGIN my %words = ("One" => "Ten", "Two" => "Twenty", "Three" => "Thirty", "Four" => "Forty") \
andthen my %non-words = (q[$$MOON$$] => "SUN", q[#LATER#] => "SNOW"); \
s:g/ << @(%words.keys) >> /%words{$/}/; \
s:g/ [ ^ | <punct>+ | <blank>+] <( @(%non-words.keys) )> [ <punct>+ | <blank>+ | $ ] /%non-words{$/}/;' file
This code takes the same Sample Input file (after updating #LATER
to #LATER#
), and produces Sample Output identical to that above.
https://docs.raku.org/language/regexes#Regex_interpolation
https://docs.raku.org/language/regexes
https://docs.raku.org
https://raku.org
&
)? I assume you do not want to do partial word matches, only full word matches (e.g.Four
should not match the start of matchFourteen
), right? What characters are word-constituent for your purposes ([[:alnum:]_]
is the common set for word-constituent characters but YMMV)? – Ed Morton Dec 16 '21 at 17:21if there is a mapping of the new word to another word in the mapping file, it can still be replaced
- I assume that means it's also OK if it's not replaced. Doing such recursive mappings leads to having to handle potentially infinite recursion (e.g. foo maps to bar but then bar maps to foo) so it gets ugly. – Ed Morton Dec 16 '21 at 21:14[[:alnum:]_]
) characters, and those start/end terminated with non-word characters? – jubilatious1 Jan 23 '24 at 05:19