Modify and Replace $1 (awk) or \1 (sed) Values from Decimal to Hexadecimal Globally in a String?

Question

Is it possible to Modify and Replace $1 (awk) or \1 (sed) Values from Decimal to Hexadecimal Globally in a String? It is possible that the string may contain any decimal value, which needs to be modified and replaced with its hexadecimal equivalent.

awk example:

echo "&#047;Test&#045;Test&#045;Test&#045;Test&#045;Test&#047;Test&#047;Test&#047;" | awk '{gsub("&#([0-9]+);", $1, $0); print}'

sed example:

echo "&#047;Test&#045;Test&#045;Test&#045;Test&#045;Test&#047;Test&#047;Test&#047;" | sed -E 's/&#([0-9]+);/$(printf "%X" \1)/g;'
echo "&#047;Test&#045;Test&#045;Test&#045;Test&#045;Test&#047;Test&#047;Test&#047;" | sed -E 's/&#([0-9]+);/$(echo "obase=16; \1" | bc)/g;'

I've attempted to subexec and pipe with printf "%X" and bc, but have been unable to combine the two for the resulting decimal to hexadecimal modification and replacement.

expected output:

%2FTest%2DTest%2DTest%2DTest%2DTest%2FTest%2FTest%2F

Your assistance is greatly appreciated.

score 1 · Answer 1 · edited Jan 05 '22 at 07:17

Using GNU awk for the 3rd arg to match():

$ echo "&#047;Test&#045;Test&#045;Test&#045;Test&#045;Test&#047;Test&#047;Test&#047;" |
awk '{
    while ( match($0,/(.*)&#([0-9]+);(.*)/,a) ) {
        $0 = a[1] sprintf("%%%02X",a[2]) a[3]
    }
    print
}'
%2FTest%2DTest%2DTest%2DTest%2DTest%2FTest%2FTest%2F

otherwise, using any awk in any shell on every Unix box:

$ echo "&#047;Test&#045;Test&#045;Test&#045;Test&#045;Test&#047;Test&#047;Test&#047;" |
awk '{
    while ( match($0,/&#[0-9]+;/) ) {
        $0 = substr($0,1,RSTART-1) sprintf("%%%02X",substr($0,RSTART+2,RLENGTH-3)) substr($0,RSTART+RLENGTH)
    }
    print
}'
%2FTest%2DTest%2DTest%2DTest%2DTest%2FTest%2FTest%2F

Ed... I appreciate your response, but I was looking more for a one-liner solution. — Gary C. New, Jan 05 '22 at 18:19

Stéphane Chazelas · Accepted Answer · 2022-01-07T15:04:44.340

With GNU awk, where the Record Separator can be a regexp, and what it matches is stored in RT:

gawk -v RS='&#[0-9]+;' -v ORS= '1;RT{printf("%%%02X", substr(RT,3))}'

Personally, I'd use perl instead:

perl -pe 's{&#(\d+);}{sprintf "%%%02X", $1}ge'

See also:

perl -MURI::Escape -MHTML::Entities -lpe '$_ = uri_escape decode_entities $_'

Which here gives:

%2FTest-Test-Test-Test-Test%2FTest%2FTest%2F

As the hyphen doesn't need to be encoded in a URI. It would also take care of converting % to %25, space to %20, & to %26 and much more.

There's also the question of what to do with non-ASCII characters (characters above )? If they should be converted to the URI encoding of their UTF-8 encoding, for instance for € (€, U+20AC, €) to be converted to %E2%82%AC (the 3 bytes of the UTF-8 encoding of that character), then that should rather be:

perl  -MURI::Escape -MHTML::Entities -lpe '$_ = uri_escape_utf8 decode_entities $_'

With uri_escape, you'd get the ISO8859-1 (aka latin1) encoding which in this day and age is unlikely to be what you want (and be limited to characters up to ÿ). The other solutions would convert € to %20AC for instance which is definitely wrong.

Nice! I like how you read through the lines on this request. The awk one-liner is what I am searching for for basic url encoding needs. What if we combine such as sed 's/-/-/g;' | awk -v RS='&#[0-9]+;' -v ORS= '1;RT{printf("%%%02X", substr(RT,3))}' to produce the desired url encoded output? I would normally consider using perl, but I need a solution that can be used in a minimalistic, shell environment. Thank you! — Gary C. New, Jan 05 '22 at 18:17

guest_7 · Answer 3 · 2022-01-07T10:08:48.997

With GNU sed, which has the /e modifier on the s/// command, we can do it as shown:

$ sed -E ":a;s/(.*)&#([0-9]+);(.*)/printf %s '\\1' \"\$(dc -e '37an16o\\2f')\" '\\3'/e;ta" file

GNU sed in extended regex mode -E
GNU dc to turn decimals to hex.
we then repeat the substitution by means of the t command.

In case your GNU sed doesn't yet support the /e modifier to the s/// command, we can then turn the input line into chunks of GNU dc code and pipe it to dc:

< file \
sed -E '1i\
16o
  s/&#([0-9]+);/\n37an\1n\n/g
  s/([^\n]*)/[&]/g
  s/([^\n]*\n){2}/&x/g
  s/\nx/x /g;y/\n/n/
  s/$/pc/;s/.*/[&]x/
' | dc

What it is essentially doing is:

output is in hex (16o)
turn the decimal -> %HEX equivalent
delimit the hex portion with newlines
turn these islands of non newline chunks into dc strings, to be later passed onto dc for execution.

Nice! A sed solution. Unfortunately, my version of sed must be a trimmed down version as it does not support the /e modifier. Is anyone else able to validate? Thanks! — Gary C. New, Jan 07 '22 at 03:22
I've added a method for GNU seds not supporting the /e flag to the s/// — guest_7, Jan 07 '22 at 10:10

Modify and Replace $1 (awk) or \1 (sed) Values from Decimal to Hexadecimal Globally in a String?

3 Answers3

Linked