How to Read every word from a file and Replace it with the substitute word (if found) from another file

Question

I need to read every full word from each line of a file and replace every occurrence of that word (if found anywhere in the file) using sed or awk with the word from another file.

Contents of fileA.txt:

1, This is a Record One, Value1, Dummy_val1 One, $$MOON$$
2, This is a Record Two, Value2, Dummy_val2 Two, #LATER
3, This is a Record Three, Value3, Dummy_val3 Three, #LATER
4, This is a Record Four, Value4, Dummy_val4 Four, $$MOON$$

and then Search_Replace_File.txt gives the info about what word needs to be replaced with which:

One=Ten
Two=Twenty
Three=Thirty
Four=Forty
$$MOON$$=SUN
#LATER=SNOW

The expected output is as below.

1, This is a Record Ten, Value1, Dummy_val1 Ten, SUN
2, This is a Record Twenty, Value2, Dummy_val2 Twenty, SNOW
3, This is a Record Thirty, Value3, Dummy_val3 Thirty, SNOW
4, This is a Record Forty, Value4, Dummy_val4 Forty, SUN

Note:

if an old word is replaced with a new word from the list and if there is a mapping of the new word to another word in the mapping file, it can still be replaced.
Replacement strings might also include symbols like below etc. $$MOON$$=SUN #LATER=SNOW

Tried the below code so far but it doesn't replace words.

#!/bin/bash
while read var
do
search_string=`echo "$var"|awk -F= '{print $1}'`
replace_string=`echo "$var"|awk -F= '{print $2}'`
sed "s/$searchstring/$replacestring/g" fileA.csv > fileB.csv
done < Search_Replace_File.txt
mv fileB.csv fileA.csv

Do you want the word replaced everywhere it's found on each line or only in the specific positions shown in your example? Can the mapping ever replace an old word with a new word that ALSO has to be mapped to some other word and, if so, how should that be handled? Please don't respond in comments, just [edit] your question to provide all of the missing requirements, update the example to be more truly representative if necessary, and add your code attempt to solve the problem yourself. — Ed Morton, Dec 16 '21 at 17:14
Some other missing info - can your original or replacement "words" ever contain regexp metachars or backreferences (e.g. &)? I assume you do not want to do partial word matches, only full word matches (e.g. Four should not match the start of match Fourteen), right? What characters are word-constituent for your purposes ([[:alnum:]_] is the common set for word-constituent characters but YMMV)? — Ed Morton, Dec 16 '21 at 17:21
Edited the question - please let me know if you need more clarity. — Dhruuv, Dec 16 '21 at 18:52
Regarding if there is a mapping of the new word to another word in the mapping file, it can still be replaced - I assume that means it's also OK if it's not replaced. Doing such recursive mappings leads to having to handle potentially infinite recursion (e.g. foo maps to bar but then bar maps to foo) so it gets ugly. — Ed Morton, Dec 16 '21 at 21:14
Does this answer your question? How to perform replacements defined in one file on another file — Philippos, Dec 17 '21 at 11:57
Can replacement "tokens" be broken out into those start/end terminated with word (e.g. [[:alnum:]_]) characters, and those start/end terminated with non-word characters? — jubilatious1, Jan 23 '24 at 05:19

score 2 · Accepted Answer · answered Dec 16 '21 at 21:12

Using any awk in any shell on every Unix box:

$ cat tst.awk
BEGIN { FS="=" }
NR==FNR {
    map[$1] = $2
    next
}
{
    head = ""
    tail = $0
    while ( match(tail,/[^,= ]+/) ) {
        old = substr(tail,RSTART,RLENGTH)
        new = (old in map ? map[old] : old)
        head = head substr(tail,1,RSTART-1) new
        tail = substr(tail,RSTART+RLENGTH)
    }
    print head tail
}

$ awk -f tst.awk Search_Replace_File.txt fileA.txt
1, This is a Record Ten, Value1, Dummy_val1 Ten, SUN
2, This is a Record Twenty, Value2, Dummy_val2 Twenty, SNOW
3, This is a Record Thirty, Value3, Dummy_val3 Thirty, SNOW
4, This is a Record Forty, Value4, Dummy_val4 Forty, SUN

My assumption above is that none of your input words contain ,, =, or blanks but any other characters are fine.

Also if an old word maps to a new word and that new word also can be mapped to another new word the above code will not do that as that leads to infinite recursion, only the first mapping will hold.

Maybe I am missing something: why not just do a {for (word in map) {gsub(word,map[word])}} 1 after reading the map? Is it about speed or are there any dangers with the string functions? — FelixJN, Dec 17 '21 at 08:24
@FelixJN It's dangers with the non-string functions. for (word in map) {gsub(word,map[word])} would fail with a false match if word contained any regexp metachar (., *, ?, [, ], +, (, ), ^, $, etc.), would fail with a partial match if word contained a substring of an existing one (e.g. if word was Four and the input contained Fourteen), would fail with a bad replacement if map[word] contained a backreference char like &, etc. — Ed Morton, Dec 17 '21 at 13:18

guest_7 · Answer 2 · 2021-12-24T04:02:48.507

We can do this using awk as shown:

awk '
BEGIN {
  d = "[$]{2}"
  w = "[[:alpha:]][_[:alnum:]]*"
  re = d w d "|" "[#]?" w
}
FS == "="{a[$1]=$2;next}
{
  z = ""
  t = $0
  gsub(re, RS "&" RS, t)
  nf = split(t, x, RS)
  for (i=1; i<=nf; i++)
    z = z ((i%2) ? x[i] : ((x[i] in a) ? a[x[i]] : x[i]))
  print z
}
' FS="=" Search_Replace_File.txt FS=" " fileA.txt
1, This is a  Record Ten, Value1, Dummy_val1 Ten, SUN
2, This is a Record Twenty, Value2, Dummy_val2 Twenty, SNOW
3, This is a Record Thirty, Value3, Dummy_val3 Thirty, SNOW
4, This is a Record Forty, Value4, Dummy_val4 Forty, SUN

define regex for a word.
demarcate words in the current line via newlines.
then split the current line on newline.
all words are even numbered fields.
check if words are found in array a and then replace them.
print the modified line.

@Ed Morton thanks for taking the time to critique the solution. I have gone ahead & Incorporated the improvements you suggested. — guest_7, Dec 24 '21 at 04:01

jubilatious1 · Answer 3 · 2024-01-23T09:17:02.177

Using Raku (formerly known as Perl_6)

~$ raku -pe 'BEGIN my %h = (          \ 
               "One" => "Ten",        \ 
               "Two" => "Twenty",     \
               "Three" => "Thirty",   \
               "Four" => "Forty",     \
               q[$$MOON$$] => "SUN",  \
               q[#LATER] => "SNOW");  \ 
             s:g/ [ ^ | <punct>+ | <blank>+] <( @(%h.keys) )> [ <punct>+ | <blank>+ | $ ] /%h{$/}/;'  file

Here's an answer written in Raku, a programming language in the Perl-family. Above, the -pe sed-like autoprinting command-line flags are used. A %h hash is declared inline. Note $ must be escaped, however "\$\$MOON\$\$" can be written q[$$MOON$$] as above, reducing the need for backslashes.

The meat of the substitution is s///, which uses :g global modifier. Within the match-domain (left-half), the @(%h.keys) hash keys are coerced to an @-sigiled array, and these are understood as literal strings within the match-domain. In the substitution-domain (right-half) the $/ match variable is used to recover the corresponding key's value, which is substituted in.

The problem here is "words" is usually defined as alphanumeric-plus-_ (underscore). In that case you would use Raku's << (left) and >> (right) zero-width regex anchors as they represent left and right word-boundaries, respectively. Without these boundary markers, something like Fourteen will get incorrectly substituted to Fortyteen. (See last line of Sample Input file below: Sample Output shows the correct result).

Since the OP has requested a solution using keys starting/ending with NON- alphanumeric-plus-_ characters (thus precluding the use of zero-width word-boundary anchors), one way is to try to delineate the possibilities, like so:

s:g/ [ ^ | <punct>+ | <blank>+] <( @(%h.keys) )> [ <punct>+ | <blank>+ | $ ] /%h{$/}/;

Sample Input:

1, This is a Record One, Value1, Dummy_val1 One, $$MOON$$
2, This is a Record Two, Value2, Dummy_val2 Two, #LATER
3, This is a Record Three, Value3, Dummy_val3 Three, #LATER
4, This is a Record Four, Value4, Dummy_val4 Four, $$MOON$$
5, This is a Record Fourteen, Value14, Dummy_val14 Fourteen, #LATER

Sample Output:

1, This is a Record Ten, Value1, Dummy_val1 Ten, SUN
2, This is a Record Twenty, Value2, Dummy_val2 Twenty, SNOW
3, This is a Record Thirty, Value3, Dummy_val3 Thirty, SNOW
4, This is a Record Forty, Value4, Dummy_val4 Forty, SUN
5, This is a Record Fourteen, Value14, Dummy_val14 Fourteen, SNOW

Probably a better (more reliable) way is to more-carefully choose the NON- word keys, e.g. making sure that they start/end with NON- word characters (e.g. #LATER# instead of #LATER). Then use two hashes, like so:

~$ raku -pe 'BEGIN    my %words = ("One" => "Ten", "Two" => "Twenty", "Three" => "Thirty", "Four" => "Forty")  \
             andthen  my %non-words = (q[$$MOON$$] => "SUN", q[#LATER#] => "SNOW");  \
             s:g/ << @(%words.keys) >> /%words{$/}/;  \
             s:g/ [ ^ | <punct>+ | <blank>+] <( @(%non-words.keys) )> [ <punct>+ | <blank>+ | $ ] /%non-words{$/}/;'  file

This code takes the same Sample Input file (after updating #LATER to #LATER#), and produces Sample Output identical to that above.

https://docs.raku.org/language/regexes#Regex_interpolation
https://docs.raku.org/language/regexes
https://docs.raku.org
https://raku.org

How to Read every word from a file and Replace it with the substitute word (if found) from another file

3 Answers3