How to split with awk escaping all special characters

Question

I'm trying to create a char array by using split, it works so far.

The problem is when any character in the input string is preceded by \ . What happens is \ doesn't get considered as a char as it escapes the following character and gets lost, not being considered in the array.

The goal is to store everything in charArray for later use.

function getLineChars {
   l=1
   for line in ${fileLinesArray[@]}; do
      charArray=$(echo | awk -v str="${line}" '{
         split(str, lineChars, "")
         for (i=1; i<=length(str); i++) {
            printf("%s ", lineChars[i])
         }
      }')
      l=$(($l+1))
      echo "${charArray[@]}"
   done
}

So mainly every special or strange character is getting printed into the array, except for this kind of situation:

3\zKhj awk: warning: escape sequence `\z' treated as plain `z'

and the array comes out as:

3 z K h j

Lacking the \ character, which is desired to be included in the array.

What can be done about this? Is it ok to try and use awk, or would you suggest something different?

Thanks in advance.

Maybe related? https://unix.stackexchange.com/questions/654388/break-down-string-into-array-in-shell-script/654692#654692 — jubilatious1, Dec 10 '23 at 16:09

markp-fuso · Answer 1 · 2023-12-08T15:15:44.347

If you really need to use awk then feed ${line} as a here-string:

function getLineChars {
   l=1
   for line in "${fileLinesArray[@]}"; do
      charArray=$( awk '{ split($0, lineChars, "")
                          for (i=1; i<=length($0); i++) {
                              printf("%s ", lineChars[i])
                          }
                        }' <<< "${line}" )
      l=$(($l+1))
      echo "${charArray[@]}"
   done
}

Taking for a test drive:

$ fileLinesArray=( '3\zKhj' )
$ getLineChars
3 \ z K h j

But, what's actually in charArray[@]?

$ typeset -p charArray
declare -- charArray="3 \\ z K h j "

It's actually a single string, with a trailing space.

If you really want an array of characters then replace charArray=$( awk ... ) with charArray=( $( awk ... ) ); making the change and taking for a test drive:

$ getLineChars                                                                           
3 \ z K h j
$ typeset -p charArray
declare -a charArray=([0]="3" [1]="\" [2]="z" [3]="K" [4]="h" [5]="j")

So now we have an actual array of characters.

I'd probably opt for something a bit simpler, eg:

function getLineChars {
   l=1
   for line in "${fileLinesArray[@]}"; do
      mapfile -t charArray < <( grep -o . <<< "${line}" )
      l=$(($l+1))
      echo "${charArray[@]}"
   done
}

NOTE: updated to use mapfile (synonym for readarray; thanks Ed Morton).

Taking for a test drive:

$ getLineChars
3 \ z K h j
$ typeset -p charArray
declare -a charArray=([0]="3" [1]="\" [2]="z" [3]="K" [4]="h" [5]="j")

Or we could eliminate the $( grep ... ) subprocess calls via a regex and the BASH_REMATCH[] array:

getLineChars() {
    l=1
    for line in "${fileLinesArray[@]}"; do 
        [[ "${line}" =~ ${line//?/(.)} ]] && charArray=( "${BASH_REMATCH[@]:1}" )
        l=$(($l+1))
        echo "${charArray[@]}" 
    done
}

Where:

${line//?/(.)} - replace each character with the literal string (.) thus giving us a capture group for each character (NOTE: do not wrap this in double quotes)
"${BASH_REMATCH[@]:1}" - grab all array entries starting with index == 1 and going to the end of the array

Taking for a test drive:

$ getLineChars
3 \ z K h j
$ typeset -p charArray
declare -a charArray=([0]="3" [1]="\" [2]="z" [3]="K" [4]="h" [5]="j")
$ typeset -p BASH_REMATCH
declare -a BASH_REMATCH=([0]="3\zKhj" [1]="3" [2]="\" [3]="z" [4]="K" [5]="h" [6]="j")

Very nice, thank you so much. All of the alternatives worked as desired, I opted for REMATCH as it seems to be the most direct way to process. Most cases worked, except for two specific cases in my test file, which continue to be ignored as chars because they probably represent a newline: ^_ and ^L . They were keyboard generated by the way. — KrOo Pine, Dec 08 '23 at 20:34
assuming those are <Ctrl> + <underscore> and <Ctrl> + <capital_L>, all 3x solutions show 2 new entries on the end of the charArray[] array: [6]=$'\037' [7]=$'\f' — markp-fuso, Dec 08 '23 at 20:48

Ed Morton · Answer 2 · 2023-12-08T20:21:22.780

Splitting on a null FS split(str, lineChars,"") is undefined behavior so it'll do different things in different awks, using -v to pass the value of a variable to awk expands escape sequences by design which is not what you want (see how-do-i-use-shell-variables-in-an-awk-scriptfor alternatives), and using echo and a pipe introduces unnecessary overhead and fragility (will break depending on which chars and echo version you use).

charArray in your code:

charArray=$(echo | awk '...')

is a scalar, not an array, I think you meant to do:

charArray=( $(echo | awk '...') )

but populating an array from a command output using array=( command ) exposes the command output to the shell for globbing and filename expansion so never do that for any command, use readarray instead, e.g. try both of these:

$ line='a*b c'; array=( $(grep -o . <<<"$line") )
declare -p array
<output will not include the `*` or blank char from `$line` but will include the names of all files in your current directory>

$ line='a*b c'; readarray -t array < <(grep -o . <<<"$line")
$ declare -p array
declare -a array=([0]="a" [1]="*" [2]="b" [3]=" " [4]="c")

So, do this instead for robustness and portability (assuming you're using bash as your shell) IF you were going to do this with a shell loop calling awk:

$ line='3\zK*h jÃk'
$ readarray -t charArray < <(
    awk '
        BEGIN {
            line = ARGV[1]
            ARGV[1] = ""
            lgth = length(line)
            for (i=1; i<=lgth; i++) {
                print substr(line,i,1)
            }
        }
    ' "$line"
)
$ declare -p charArray
declare -a charArray=([0]="3" [1]="\" [2]="z" [3]="K" [4]="*" [5]="h" [6]=" " [7]="j" [8]="Ã" [9]="k")

but there's almost certainly a better way to do whatever it is you want to do than having a shell loop calling awk one line at a time, post a new question with sample input/output if you want help with that bigger issue.

Oh, and never name a variable l as it looks far too much like the number 1 and so obfuscates your code and there are some other issues with your function that copy/pasting it into http://shellcheck.net will tell you about and help you fix.

Thanks for your comment and advise. I tried this, but readarray loses the strangest characters like Ã and such. Not readarray nor split seem to recognize sequences such as ^_ or ^L, these seem to be totally lost and represented as an empty newline with some extra spaces. — KrOo Pine, Dec 08 '23 at 20:13
That's almost certainly related to your locale setting, nothing to do with the code. I just added Ã to the example in my answer to show it working in my locale, en_US.UTF-8, try setting LC_ALL=C (the POSIX default) before calling readarray. — Ed Morton, Dec 08 '23 at 20:22

jubilatious1 · Answer 3 · 2023-12-16T19:50:25.920

Using Perl and/or Raku to keep backslash-escaped characters intact

Perl Solution:

~$ echo -n '3\zKh j' | perl -ne 'print split /(?<!\\)/'
3\zKh j
#visualize split with Data::Dumper module
~$ ~$ echo -n '3\zKh j' | perl -MData::Dumper -ne 'print Dumper split /(?<!\)/'
$VAR1 = '3';
$VAR2 = '\z';
$VAR3 = 'K';
$VAR4 = 'h';
$VAR5 = ' ';
$VAR6 = 'j';
#and also Unicode (add -CSDA to command line)
~$ echo -n '3\zKh jÃkΣ' | perl -CSDA -MData::Dumper -ne 'print Dumper split /(?<!\)/'
$VAR1 = '3';
$VAR2 = '\z';
$VAR3 = 'K';
$VAR4 = 'h';
$VAR5 = ' ';
$VAR6 = 'j';
$VAR7 = "\x{c3}";
$VAR8 = 'k';
$VAR9 = "\x{3a3}";

Raku (language formerly known as Perl6) Solution:

~$ echo -n '3\zKh j' | raku -ne '.comb(/ \\? . /).print'
3 \z K h   j
#visualize split with raku built-in
~$ echo -n '3\zKh j' | raku -ne '.comb(/ \? . /).raku.print'
("3", "\z", "K", "h", " ", "j").Seq
#and also Unicode (enabled by default)
~$ echo -n '3\zKh jÃkΣ' | raku -ne '.comb(/ \? . /).raku.print'
("3", "\z", "K", "h", " ", "j", "Ã", "k", "Σ").Seq

Perl References:
https://perldoc.perl.org
https://www.perl.org

Raku References:
https://docs.raku.org
https://raku.org

score 0 · Answer 4 · answered Dec 16 '23 at 18:59

if you want to transmit a variable to awk by appending the value in the code-string of awk:

awk 'BEGIN {var="'"$BASH_variable"'"}

than you can use this function from my library:


declare g_RV  #-- g_RV ... global return value
#-- call:        g_serialize_STR_ForAWK  [string to serialize STR] [option bINT]
#-- description: converts a string to combine it with an awk variable declaration: 'BEGIN { var="'[serialized string STR]'" ..}'
#--              '' becomes '\', '"' becomes '&quot;', $'\n' becomes '\n' 
#-- parameters:  $1 ... string to serialize STR - a string you want to transmit to awk per variable declaration (var="...")
#--              $2 ... option bINT optional - convert it with bash (0), convert it with sed (1), Standard (0)
#-- returnValue: written to g_RV - the converted string STR
#-- depends on:  variables - g_RV
function g_serialize_STR_ForAWK ()
    {
    local -i option=$2
#-- use sed for converting
if ((option)); then
    g_RV=$(sed -z 's/\\/\\\\/g; s/&quot;/\\&quot;/g; s/\n/\\n/g' &lt;&lt;&lt; $1&quot;;&quot;)    
    g_RV=${g_RV:0:-1}
#-- use bash for converting    
else
    g_RV=${1//'\'/'\\'}; g_RV=${g_RV//'&quot;'/'\&quot;'}; g_RV=${g_RV//$'\n'/'\n'}
fi
}

How to split with awk escaping all special characters

4 Answers4