3

In a large text file, I need to replace several words (let's call them "keys") by different replacement texts (let's call them "values"). Currently, I use sed for this purpose, as in

sed -i -e 's/\bkey\b/value/' file

The file is big, and the process takes a few minutes. There are over 1,000 "key-value" pairs, and currently I repeat the sed process for each and every "key-value" pair. Obviously, it takes a long time.

I wonder if there is a way to feed an array of "key-value" (pattern-replacement) pairs into sed/awk or similar utilities to make the replacement in one run (or faster). The "key-value" pairs can be constructed in any format.

An example case is replacing names with their acronyms (say, in TSV format)

Key                                               Value
United Nations                                    UN
United States Environmental Protection Agency     EPA
International Atomic Energy Agency                IAEA
World Health Organization                         WHO

and the input text is:

This has been covered by both the United Nations and World Health Organization. This is the main domain of the International Atomic Energy Agency. United States Environmental Protection Agency is a federal agency supervising this matter.

AdminBee
  • 22,803
Googlebot
  • 1,959
  • 3
  • 26
  • 41
  • 3
    Please edit your post to include example input along with information on input file formatting and desired output. This will help us testing proposed solutions before posting them. – AdminBee Aug 18 '22 at 10:07
  • @AdminBee sorry since my outputs are big, I did not post an example. I will add a small example. – Googlebot Aug 18 '22 at 10:32
  • 3
    Yes, but we can't help you parse data you don't show us. How are keys and values defined? Is it key=value? Or key\tvalue? Or maybe key value? Can you have spaces in either key or value? Just show us a few lines and the output you want from those lines so we can understand. – terdon Aug 18 '22 at 11:41
  • 1
    If the file is in a specific format, such as JSON, YAML or some other structured format, then using sed is the wrong way to go about it (partly due to the risk of updating the wrong thing, an partly due to the fact that values in these formats are encoded). – Kusalananda Aug 18 '22 at 12:20
  • @terdon, the format of key-value pairs is not important. I can re-arrange them in any possible format. – Googlebot Aug 18 '22 at 12:28
  • @Kusalananda no, it is plain text. – Googlebot Aug 18 '22 at 12:29
  • 1
    @Googlebot JSON and YAML is plain text, but structured plain text. Please add a representable example in your question. – Kusalananda Aug 18 '22 at 12:30
  • 1
    @Googlebot it is extremely important, actually. The solutions will depend 100% on the format you need to parse. The answer you ave gotten, for example, assumes that you can provide a manual list of the keys. It also assumes that they are separated by non-word characters. You can choose to ignore all of us who are asking for an example, but that means we won't be able to help you. Your call. – terdon Aug 18 '22 at 12:34
  • 1
    @terdon I added an example. I never ignore people who ask for further information. When others are trying to help me, it would be stupid of me not to follow their advice. – Googlebot Aug 18 '22 at 12:38
  • 1
    Thanks for the added information. Now, what does the file to modify look like? Can only one of the keys occur per line? Can any single key occur more than once per line? Do any associated values contain one of the other keys that are to be replaced, and if so, how should this situation be handled? – AdminBee Aug 18 '22 at 13:02
  • 1
    @AdminBee the input text is unformatted text (like the text on the current page), a book, an article, etc. Several keys can occur in a single line. – Googlebot Aug 18 '22 at 13:18
  • 1
    Don't describe the input text, [edit] your question to SHOW us sample input text and the associated output you want after the script you're asking for help with runs against that input. We need you to provide sample input and expected output to solidify your needs and so we have something to test a potential solution against. – Ed Morton Aug 18 '22 at 13:45
  • 4
    If you are happy to store the keys and values in any format, you may just as well store them as sed substitution instructions: s/World Health Organization/WHO/g. You may then use that list of instructions with sed -f to apply the instructions to whatever file yo have. – Kusalananda Aug 18 '22 at 16:19
  • I see you added the input text, that's a big help, thanks, now add the expected output text and you'll have a complete question. Having said that, the sample input only contains the most basic sunny day cases, not cases where 1 string is a subset of another or contains regexp metachars, etc., so all we'd be able to test would be the trivial cases and so YMMV if/when you hit more interesting cases in your real use of the tool. – Ed Morton Aug 18 '22 at 18:49
  • 1

4 Answers4

4

Note that -i and \b are both non-standard extensions that some sed implementations have copied from perl. Here, why not using perl in the first place:

perl -i -pe '
  BEGIN {
    %map = (
      "key1"  => "value1",
      "key 2" => "value2"
    );
    $re = join "|", map {qr{\Q$_\E}} keys %map;
  }
  s/\b(?:$re)\b/$map{$&}/g' your-file

The key => value mapping can also be expressed as:

%map = qw(
   key1 value1
   key2 value2
);

Or read from CSVs or other structured formats using the corresponding perl module (Text::CSV, JSON)... perl is a generic programming language geared for text manipulation, so it's the obvious choice here and there's not limit to what you can do with it.

For a simple TSV, that could be:

<map.tsv perl -i -pe '
  BEGIN {
    <STDIN>; # skip header
    while (<STDIN>) {
      chomp;
      my ($k, $v) = split /\t/;
      $map{$k} = $v;
    }
$re = join &quot;|&quot;, map {qr{\Q$_\E}} keys %map;

} s/\b(?:$re)\b/$map{$&}/g' your-file

Note that, if you were doing:

sed -i -e 's/\bK1\b/V1/g' file
sed -i -e 's/\bK2\b/V2/g' file

That can be simplified to:

sed -i '
  s/\bK1\b/V1/g
  s/\bK2\b/V2/g' file

Or for your TSV:

<map.tsv awk -F'\t' '
   NR > 1 {
     # escape regexp operators in keys to emulate perl \Q \E:
     gsub(/[][\/\\*.^$]/, "\\\\&", $1)
     # escape /, \ and & in replacement:
     gsub(/[\\/&]/, "\\\\&", $2)
     print "s/\\b"$1"\\b/"$2"/g"
   }' | sed -i -f - your-file

which reads (and writes) the file only once.

But in both cases, you'd run into problems if some of the values are also among the keys. For instance with s/\bA\b/B/g followed by s/\bB\b/C/g you'd end up with the As turned into Cs instead of Bs. The perl approach above doesn't have the problem as it runs only one subtitute operator.

Also note that perl in its regexps processes alternation from left to right, so if you have s/\b(?:foo|foo bar)\b/$map{$&}/g, on a foo bar input, it would replace foo, not foo bar.

And bear in mind that the order in which an associative array is walked is random.

sed (with those implementations that support extended regexps with -E / -r or \| in BREs) in contrast will try to find the longest match.

You could get the same behaviour with perl by sorting the keys by length before joining with |, for instance by replacing keys %map with sort {length$b <=> length$a} keys %map.

A final note: perl by default processes its input byte-wise, with word characters (\b matching the boundary between a word and non-word character) limited to the ASCII letters and digits and underscore, while sed implementations typically decode it according to the locale's charset. If your input or key/values contain non-ASCII characters, you can add the -Mopen=locale to decode it as per the locale's charset, or if it's in UTF-8 (the locale encoding most commonly used these days), just add the -C option.

  • In gsub(/[\\/&]/, "\\\\&", $2) , $2 contains the new line \n ? I obtain a replacement, but a line feed each time... Ok, I seen, the map_file has to be Unix formatted (LF). – Sandburg Dec 20 '23 at 15:39
3

Assuming you only need to handle sunny day mappings as in your provided example (i.e. no regexp or backreference metachars, no case changes, no substrings, no cyclic mappings, etc.), then using any awk:

$ awk -F'\t+' '
    NR==FNR { if (NR>1) map[$1]=$2; next }
    { for (key in map) gsub(key,map[key]); print }
' map_file input_file
This has been covered by both the UN and WHO. This is the main domain of the IAEA. EPA is a federal agency supervising this matter.

If that's not all you need then edit your question to provide more truly representative sample input/output.

Sandburg
  • 367
Ed Morton
  • 31,617
2

Using Raku (formerly known as Perl_6)

~$ raku -pe 'BEGIN my %h = ("United Nations" => "UN",  \
             "United States Environmental Protection Agency" => "EPA",  \
             "International Atomic Energy Agency" => "IAEA",  \
             "World Health Organization" => "WHO");  \
             s:g/@(%h.keys)/%h{$/}/;'   file

OR:

~$ raku -ne 'BEGIN my %h = ("United Nations" => "UN",  \
             "United States Environmental Protection Agency" => "EPA",  \
             "International Atomic Energy Agency" => "IAEA",  \
             "World Health Organization" => "WHO");  \
             put S:g/@(%h.keys)/%h{$/}/ given $_;'   file

Sample Input:

This has been covered by both the United Nations and World Health Organization. This is the main domain of the International Atomic Energy Agency. United States Environmental Protection Agency is a federal agency supervising this matter.

Sample Output:

This has been covered by both the UN and WHO. This is the main domain of the IAEA. EPA is a federal agency supervising this matter.

Raku is a programming language in the Perl-family of programming languages, and this answer is basically a translation of the excellent Perl answer posted by @Stéphane_Chazelas. Probably the best "use-case" for Raku is if you need to handle Unicode substitutions consistently, as Raku provides high-level support for Unicode built-in.

Briefly, a %h hash is created with the key/value pairs of interest. A heads up here--if you try to use hashes directly in a Raku regex you'll get a warning: "The use of hash variables in regexes is reserved." Instead the %h.keys hash's keys are first taken and @(…) coerced to an array in the matcher-half (a $-sigiled or @-sigiled variable within a regex-matcher instructs Raku to interpolate the stringified contents literally). In the substitution-half the $/ match variable is decoded into the corresponding value of the key/value pair.

[The second example uses -ne command line flags with Raku's "big-S" S/// notation--which returns the resultant string].

Of course, to more fully replicate other answers given, you can use <|w> or <?wb>, which are Raku's zero-width word-boundary anchors--equivalent to \b in other languages. Thus the last line above becomes:

s:g/ <?wb> @(%h.keys) <?wb> /%h{$/}/;

You can even use Raku's << left-word boundary and >> right-word boundary (the Unicode symbols « and » also work):

s:g/ << @(%h.keys) >> /%h{$/}/;


Starting from a TSV file:

The code simplifies quite a bit if you take key/value pairs from a 2-column TSV file (instead of inline as above). Use Raku's Text::CSV module at the command line as follows (note: delete the .skip(1) call if the TSV file has no header). Don't forget to include the .[*;*] indexing bracket code as it is essential to Raku treating each row as two elements (key and value) that add to the %h hash:

~$ raku -MText::CSV -pe 'BEGIN my %h = csv(in => "/path/to/kv_pairs.tsv", sep => "\t").skip(1).[*;*];   
                         s:g/ << @(%h.keys) >> /%h{$/}/;'   file
This has been covered by both the UN and WHO. This is the main domain of the IAEA. EPA is a federal agency supervising this matter.

OR:

~$ raku -MText::CSV -ne 'BEGIN my %h = csv(in => "/path/to/kv_pairs.tsv", sep => "\t").skip(1).[*;*];   
                         put S:g/ << @(%h.keys) >> /%h{$/}/ given $_;'   file
This has been covered by both the UN and WHO. This is the main domain of the IAEA. EPA is a federal agency supervising this matter.

https://docs.raku.org/language/regexes
https://docs.raku.org
https://raku.org

jubilatious1
  • 3,195
  • 8
  • 17
1

Assuming that

  • the replacement values in your key-value-mapfile cannot themselves contain keys that would mandate replacing (including their own associated key!), and
  • the mapfile is tab-separated

the following awk program should work:

awk -F'\t' 'NR==FNR{repl[$1]=$2;klen[$1]=length($1);next}
            {for (key in repl) {
               while (i=index($0,key)) {
                 $0=substr($0,1,i-1) repl[key] substr($0,i+klen[key])
               }
             }
            }1' mapfile.txt input.txt

This will first set the input field separator to TAB, and process first the map file and then the actual input file.

  • While processing the first file (indicated by FNR, the per-file line counter, being equal to NR the global line counter), it fills the repl array with the replacements to be performed, and keeps track of the "key" lengths in a separate array klen. Afterwards, it skips processing to the next line.
  • When processing the second file, where the NR==FNR condition is no longer fulfilled and thus skipped, it will for every input line loop over all replacement keys (i.e. all indices of the repl array) and use the index() function to check whether they occur on the input line.
  • If so, the occurence of key is replaced by re-assembling the input line from the substring before the key, the replacement, and then the substring after the key.
  • It does so in a while loop to ensure all occurences are replaced, in case one key can occur multiple times on an input line.
  • The reason for using this "manual" approach, rather than a RegEx-based gsub() is that this way, you have no limitations on what your key values may look like. If we used gsub(), key with characters that are special to regular expressions may lead to unexpected behavior.

For your input example, the output would look as follows:

This has been covered by both the UN and WHO. This is the main domain of the IAEA. EPA is a federal agency supervising this matter.

Note that not all awk versions and implementations can perform in-place editing (the equivalent of the -i flag). If you have a reasonably new GNU Awk (> 4.1.0), you can use the -i inplace extension for that function.

Also note that in its current form this program does not implement the "word-boundary" constraint on replacements.

AdminBee
  • 22,803