31

I have a script that reads a text stream and generates a file of sed commands that is later run with sed -f. The generated sed commands are like:

s/cid:image002\.gif@01CC3D46\.926E77E0/https:\/\/mysite.com\/files\/1922/g
s/cid:image003\.gif@01CC3D46\.926E77E0/https:\/\/mysite.com\/files\/1923/g
s/cid:image004\.jpg@01CC3D46\.926E77E0/https:\/\/mysite.com\/files\/1924/g

Assume the script which generates the sed commands is something like:

while read cid fileid
do
    cidpat="$(echo $cid | sed -e s/\\./\\\\./g)"
    echo 's/'"$cidpat"'/https:\/\/mysite.com\/files\/'"$fileid"'/g' >> sedscr
done

How can I improve the script to ensure all regex metacharacters in the cid string are escaped and interpolated properly?

dan
  • 4,077
  • 5
  • 27
  • 34

1 Answers1

40

To escape variables to be used on the left hand side and right hand side of a s command in sed (here $lhs and $rhs respectively), you'd do:

escaped_lhs=$(printf '%s\n' "$lhs" | sed 's:[][\\/.^$*]:\\&:g')
escaped_rhs=$(printf '%s\n' "$rhs" | sed 's:[\\/&]:\\&:g; $!s/$/\\/')

sed "s/$escaped_lhs/$escaped_rhs/"

Note that $lhs cannot contain a newline character.

That is, on the LHS, escape all the regexp operators (][.^$*), the escaping character itself (\, and the separator (/).

On the RHS, you only need to escape &, the separator, backslash and the newline character (which you do by inserting a backslash at the end of each line except the last one ($!s/$/\\/)).

Note: you don't want to add backslashes before characters that do not have a special meaning, because in doing so, you could end up giving them a special meaning. For instance, <, + and t have no special meaning in BREs, but \<, \+ and \t do in some implementations of sed (and for \t, including on the RHS).

That assumes you use / as a separator in your sed s commands and that you don't enable Extended REs with -r (GNU sed/ssed/ast/busybox sed) or -E (BSDs, ast, recent GNU, recent busybox) or PCREs with -R (ssed) or Augmented REs with -A/-X (ast) which all have extra RE operators.

For EREs, the most widely supported of those extensions, the equivalent would be:

escaped_lhs=$(printf '%s\n' "$lhs" | sed 's:[][\\/.^$*+?(){}|]:\\&:g')
escaped_rhs=$(printf '%s\n' "$rhs" | sed 's:[\\/&]:\\&:g; $!s/$/\\/')

sed -E "s/$escaped_lhs/$escaped_rhs/"

A few ground rules when dealing with arbitrary data:

  • Don't use echo
  • quote your variables
  • consider the impact of the locale (especially its character set: it's important that the escaping sed commands are run in the same locale as the sed command using the escaped strings (and with the same sed command) for instance)
  • don't forget about the newline character (here you may want to check if $lhs contains any and take action).

A much safer option is to use perl instead of sed and pass the strings in the environment and use the \Q/\E perl regexp operators for taking strings literally:

A="$lhs" B="$rhs" perl -pe 's/\Q$ENV{A}\E/$ENV{B}/g'

perl (by default) will not be affected by the locale's character set as, in the above, it only considers the strings as arrays of bytes without caring about what characters (if any) they may represent for the user. With sed, you could achieve the same by fixing the locale to C with LC_ALL=C for all sed commands (though that will also affect the language of error messages, if any).

In some shells, you can also do the escaping without having to resort to external utilities.

In zsh (here for BRE escaping):

set -o extendedglob
escaped_lhs=${lhs//(#m)[][\\.^$\/&]/\\$MATCH}
escaped_rhs=${rhs//(#m)[\\&\/$'\n']/\\$MATCH}

In ksh93:

escaped_lhs=${lhs//[][\\.^$\/&]/\\\0}
escaped_rhs=${rhs//[\\&\/$'\n']/\\\0}

In fish 3.4.0+:

set escaped_lhs (
  string replace -ar -- '[][\\\\/.^$*]' '\\\\$0' "$lhs" |
    string collect --allow-empty
)

set escaped_rhs ( string replace -ar -- '[\\&/'\n']' '\\$0' "$rhs" | string collect --allow-empty --no-trim-newlines )

  • What if I need to escape double quotes? – Menon May 08 '15 at 07:31
  • 1
    @Menon, double quotes are not special to sed, you don't need to escape them. – Stéphane Chazelas May 08 '15 at 08:26
  • This cannot be used for pattern matching using wildcard, can it? – Menon May 13 '15 at 07:16
  • 1
    @Menon, no, wildcard pattern matching as with find's -name is different from regular expressions. There you only need to escape ?, * backslash and [ – Stéphane Chazelas May 13 '15 at 07:35
  • With my present version of Bash 5.2.15 it is also possible to do the escaping without external utilities (and without requiring shopt -s extglob). This is 8 years after your answer, so maybe it's newly possible due to a functional change in Bash since then. This works for me => escaped_lhs="${lhs//[][\\\/.^\$*]/\\&}"; escaped_rhs="${rhs//[\\\/&$'\n']/\\&}"; escaped_rhs="${escaped_rhs%\\$'\n'}"$'\n' – Rowan Thorpe Mar 28 '23 at 22:09