Note that -i
and \b
are both non-standard extensions that some sed
implementations have copied from perl
. Here, why not using perl
in the first place:
perl -i -pe '
BEGIN {
%map = (
"key1" => "value1",
"key 2" => "value2"
);
$re = join "|", map {qr{\Q$_\E}} keys %map;
}
s/\b(?:$re)\b/$map{$&}/g' your-file
The key => value mapping can also be expressed as:
%map = qw(
key1 value1
key2 value2
);
Or read from CSVs or other structured formats using the corresponding perl module (Text::CSV
, JSON
)... perl
is a generic programming language geared for text manipulation, so it's the obvious choice here and there's not limit to what you can do with it.
For a simple TSV, that could be:
<map.tsv perl -i -pe '
BEGIN {
<STDIN>; # skip header
while (<STDIN>) {
chomp;
my ($k, $v) = split /\t/;
$map{$k} = $v;
}
$re = join "|", map {qr{\Q$_\E}} keys %map;
}
s/\b(?:$re)\b/$map{$&}/g' your-file
Note that, if you were doing:
sed -i -e 's/\bK1\b/V1/g' file
sed -i -e 's/\bK2\b/V2/g' file
That can be simplified to:
sed -i '
s/\bK1\b/V1/g
s/\bK2\b/V2/g' file
Or for your TSV:
<map.tsv awk -F'\t' '
NR > 1 {
# escape regexp operators in keys to emulate perl \Q \E:
gsub(/[][\/\\*.^$]/, "\\\\&", $1)
# escape /, \ and & in replacement:
gsub(/[\\/&]/, "\\\\&", $2)
print "s/\\b"$1"\\b/"$2"/g"
}' | sed -i -f - your-file
which reads (and writes) the file only once.
But in both cases, you'd run into problems if some of the values are also among the keys. For instance with s/\bA\b/B/g
followed by s/\bB\b/C/g
you'd end up with the A
s turned into C
s instead of B
s. The perl
approach above doesn't have the problem as it runs only one s
ubtitute operator.
Also note that perl
in its regexps processes alternation from left to right, so if you have s/\b(?:foo|foo bar)\b/$map{$&}/g
, on a foo bar
input, it would replace foo
, not foo bar
.
And bear in mind that the order in which an associative array is walked is random.
sed
(with those implementations that support extended regexps with -E
/ -r
or \|
in BREs) in contrast will try to find the longest match.
You could get the same behaviour with perl
by sorting the keys by length before joining with |
, for instance by replacing keys %map
with sort {length$b <=> length$a} keys %map
.
A final note: perl
by default processes its input byte-wise, with word characters (\b
matching the boundary between a word and non-word character) limited to the ASCII letters and digits and underscore, while sed
implementations typically decode it according to the locale's charset. If your input or key/values contain non-ASCII characters, you can add the -Mopen=locale
to decode it as per the locale's charset, or if it's in UTF-8 (the locale encoding most commonly used these days), just add the -C
option.
key=value
? Orkey\tvalue
? Or maybekey value
? Can you have spaces in either key or value? Just show us a few lines and the output you want from those lines so we can understand. – terdon Aug 18 '22 at 11:41sed
is the wrong way to go about it (partly due to the risk of updating the wrong thing, an partly due to the fact that values in these formats are encoded). – Kusalananda Aug 18 '22 at 12:20key-value
pairs is not important. I can re-arrange them in any possible format. – Googlebot Aug 18 '22 at 12:28sed
substitution instructions:s/World Health Organization/WHO/g
. You may then use that list of instructions withsed -f
to apply the instructions to whatever file yo have. – Kusalananda Aug 18 '22 at 16:19