6

I need a way to URL encode strings with a shell script on a OpenWRT device running old version of busybox. Right now I ended up with the following code:

urlencode() {
echo "$@" | awk -v ORS="" '{ gsub(/./,"&\n") ; print }' | while read l
do
  c="`echo "$l" | grep '[^-._~0-9a-zA-Z]'`"
  if [ "$l" == "" ]
  then
    echo -n "%20"
  else
    if [ -z "$c" ]
    then
      echo -n "$l"
    else
      printf %%%02X \'"$c"
    fi
  fi
done
echo ""
}

This works more or less fine but there're a few flaws:

  1. Some characters are skipped, like "\", for example.
  2. Result is return character by character so it's extremely slow. It takes about 20 seconds to url encode just a few strings inside a batch.

My version of bash doesn't support substring like this ${var:x:y}.

Rui F Ribeiro
  • 56,709
  • 26
  • 150
  • 232
Serg
  • 63
  • There is a small set of characters that are allowed in URLs. Everything else must be encoded. You may be better off having only two branches, one for allowed characters, the other to escape, using some simple way to get the hex value of the character (the encoding is not dark magic, it's just the hex value, %20 is 0x20 which is a space in ASCII and UTF-8). See http://stackoverflow.com/a/10660730/1079308 – njsg Jan 08 '13 at 19:20
  • Thanks! Reducing the number of if branches speeded up the encoding dramatically to the point it's sufficient for my use case. However, I'm still unable to encode certain characters like "$$" or "". Would really appreciate an advice on how to escape the content of the variable properly. – Serg Jan 08 '13 at 21:46
  • Note that I've updated my answer: my first versions were all forking an external [ utility on my router. Avoiding the external [ is a significant speed improvement. – Gilles 'SO- stop being evil' Jan 09 '13 at 11:29

2 Answers2

10

[TL,DR: use the urlencode_grouped_case version in the last code block.]

Awk can do most of the job, except that it annoyingly lacks a way to convert from a character to its number. If od is present on your device, you can use it to convert all characters (more precisely, bytes) into the corresponding number (written in decimal, so that awk can read it), then use awk to convert valid characters back into literals and quoted characters into the proper form.

urlencode_od_awk () {
  echo -n "$1" | od -t d1 | awk '{
      for (i = 2; i <= NF; i++) {
        printf(($i>=48 && $i<=57) || ($i>=65 && $i<=90) || ($i>=97 && $i<=122) ||
                $i==45 || $i==46 || $i==95 || $i==126 ?
               "%c" : "%%%02x", $i)
      }
    }'
}

If your device doesn't have od, you can do everything inside the shell; this will significantly help performance (fewer calls to external program — none if printf is a builtin) and be easier to write correctly. I believe all Busybox shells support the ${VAR#PREFIX} construct to trim a prefix from a string; use it to strip the first character of the string repeatedly.

urlencode_many_printf () {
  string=$1
  while [ -n "$string" ]; do
    tail=${string#?}
    head=${string%$tail}
    case $head in
      [-._~0-9A-Za-z]) printf %c "$head";;
      *) printf %%%02x "'$head"
    esac
    string=$tail
  done
  echo
}

If printf is not a builtin but an external utility, you will again gain performance by invoking it only once for the whole function instead of once per character. Build up the format and parameters, then make a single call to printf.

urlencode_single_printf () {
  string=$1; format=; set --
  while [ -n "$string" ]; do
    tail=${string#?}
    head=${string%$tail}
    case $head in
      [-._~0-9A-Za-z]) format=$format%c; set -- "$@" "$head";;
      *) format=$format%%%02x; set -- "$@" "'$head";;
    esac
    string=$tail
  done
  printf "$format\\n" "$@"
}

This is optimal in terms of external calls (there's a single one, and you can't do it with pure shell constructs unless you're willing to enumerate all characters that need to be escaped). If most of the characters in the argument are to be passed unchanged, you can process them in a batch.

urlencode_grouped_literals () {
  string=$1; format=; set --
  while
    literal=${string%%[!-._~0-9A-Za-z]*}
    if [ -n "$literal" ]; then
      format=$format%s
      set -- "$@" "$literal"
      string=${string#$literal}
    fi
    [ -n "$string" ]
  do
    tail=${string#?}
    head=${string%$tail}
    format=$format%%%02x
    set -- "$@" "'$head"
    string=$tail
  done
  printf "$format\\n" "$@"
}

Depending on compilation options, [ (a.k.a. test) may be an external utility. We're only using it for string matching which can also be done within the shell with the case construct. Here are the last two approaches rewritten to avoid the test builtin, first going character by character:

urlencode_single_fork () {
  string=$1; format=; set --
  while case "$string" in "") false;; esac do
    tail=${string#?}
    head=${string%$tail}
    case $head in
      [-._~0-9A-Za-z]) format=$format%c; set -- "$@" "$head";;
      *) format=$format%%%02x; set -- "$@" "'$head";;
    esac
    string=$tail
  done
  printf "$format\\n" "$@"
}

and copying each literal segment in a batch:

urlencode_grouped_case () {
  string=$1; format=; set --
  while
    literal=${string%%[!-._~0-9A-Za-z]*}
    case "$literal" in
      ?*)
        format=$format%s
        set -- "$@" "$literal"
        string=${string#$literal};;
    esac
    case "$string" in
      "") false;;
    esac
  do
    tail=${string#?}
    head=${string%$tail}
    format=$format%%%02x
    set -- "$@" "'$head"
    string=$tail
  done
  printf "$format\\n" "$@"
}

I tested on my router (MIPS processor, DD-WRT-based distribution, BusyBox with ash, external printf and [). Each version is a noticeable speed improvement on the previous one. Moving to a single fork is the most significant improvement; it's the one that makes the function respond almost instantly (in human terms) as opposed to after a few seconds for a realistic long URL parameter.

Note that the code above may fail in fancy locales (not likely on a router). You may need export LC_ALL=C if you use a non-default locale.

  • Nice answer. I'll add that busybox (and other similar things) could replace commonly-builtin commands (printf, even echo!) with an invocation of busybox itself (I saw this behaviour in mobaXterm 3's version of busybox bash, for example) : making some script surprinsingly slow (a simple for i in ... ; do echo i ; done loop in busybox version of 'bash' will invoke busybox n times, 1 per echo, whereas in bash echo would be builtin). Try really hard to put only 1 invocation to any subcommand to avoid having multiple invocation of busybox (each with a big overhead). set -x helps to find them. – Olivier Dulac Jan 09 '13 at 08:03
  • @OlivierDulac Whether BusyBox ash forks on builtins like echo depends on a compilation option (ENABLE_FEATURE_SH_NOFORK), which I think makes it faster but buggy in corner cases (traps, effects of special builtins). – Gilles 'SO- stop being evil' Jan 09 '13 at 10:58
  • thanks, good to know (I can't really recompile the one in mobaXterm, and I prefer they chose the most compatible approcah, but good to know anyway) – Olivier Dulac Jan 09 '13 at 12:11
  • Excellent answer! Maybe you should note that, if $LANG is set, you need to override character matching using LC_ALL=C, if you want to encode special chars like umlauts: [a-z] will also match äöüß and would therefore not be encoded (see https://unix.stackexchange.com/questions/87745/what-does-lc-all-c-do#answer-87763 for further information). – flederwiesel Nov 24 '21 at 09:15
  • In the urlencode_od_awk function, you need to use echo -n to avoid adding and converting a trailing newline. Also you're missing a space before $i<=90, and you need to wrap the condition logic of the ternary in parentheses.
    Excellent answer though!
    – mtalexan Feb 28 '23 at 19:26
3

While busybox utilities in general are stripped down versions of the corresponding POSIX utilities (often with GNU extensions though), busybox awk is actually a full-fledged almost compliant version of the POSIX awk API, with even extensions on top.

So, you should be able to do everything in one awk invocation:

urlencode() {
  LC_ALL=C awk -- '
    BEGIN {
      for (i = 1; i <= 255; i++) hex[sprintf("%c", i)] = sprintf("%%%02X", i)
    }
    function urlencode(s,  c,i,r,l) {
      l = length(s)
      for (i = 1; i <= l; i++) {
        c = substr(s, i, 1)
        r = r "" (c ~ /^[-._~0-9a-zA-Z]$/ ? c : hex[c])
      }
      return r
    }
    BEGIN {
      for (i = 1; i < ARGC; i++)
        print urlencode(ARGV[i])
    }' "$@"
}

Here printing the URL encoding of each argument on separate lines of output.