Break down string into array in shell script

Question

I am trying to convert string for example string=11111001 to array which I will be able to access by calling respective array index like

arr[0]=1, arr[1]=0

I am new to shell scripting and from what I read, it doesn't have a separator I am stuck.

Can someone help me?

Maybe I'm reading the string backwards; aren't the first two values both 1's? Your question indicates arr[1]=0 with a string of 11..... — Jeff Schaller, Jun 15 '21 at 15:55
Since you're new to shell scripting, a piece of advice: bash is great for dealing with files and processes. For dealing with data, strings, or algorithms, not so much. When you start doing computationally tricky stuff, generally the best way to do it in bash is to call an external command. Also see https://unix.stackexchange.com/a/303387/135943 — Wildcard, Jun 18 '21 at 18:30

score 7 · Answer 1 · edited Jun 16 '21 at 21:52

7

bash already has a form of this by way of string slicing:

$ word="word"
$ printf "%s\n" "${word:0:1}"
w
$ printf "%s\n" "${word:1:1}"
o

The syntax for this is ${variable:start:length}, and will return the next length characters starting at the start ᵗʰ character (zero-indexed).

$ printf "%s\n" "${word:2:2}"
rd

edited Jun 16 '21 at 21:52

Jeff Schaller

67,283
35
116
255

answered Jun 15 '21 at 15:58

DopeGhoti

76,081

The manual uses the term "offset" in place of "start". Nevertheless, great advice. – glenn jackman Jun 15 '21 at 16:18
1

and ${#word} to get the length – ilkkachu Jun 16 '21 at 17:38

Stéphane Chazelas · Answer 2 · 2021-06-16T09:35:15.813

For completeness, with zsh, to split a string into:

its character constituents:

chars=( ${(s[])string} )

(if $string contains bytes not forming parts of valid characters, each of those will still be stored as separate elements)

its byte constituents

you can do the same but after having unset the multibyte option, for instance locally in an anonymous function:

(){ set -o localoptions +o multibyte
  bytes=( ${(s[])string} )
}

its grapheme cluster constituents.

You can use PCRE's ability to match them with \X:

zmodload zsh/pcre
(){
  graphemes=()
  local rest=$string match
  pcre_compile -s '(\X)\K.*'
  while pcre_match -v rest -- "$rest"; do
    graphemes+=($match[1])
  done
}

(that one assumes the input contains text properly encoded in the locale's charmap).

With string=$'Ste\u0301phane', those give:

chars=( S t e ́ p h a n e )
bytes=( S t e $'\M-L' $'\M-\C-A' p h a n e )
graphemes=( S t é p h a n e )

As the e + U+0301 grapheme cluster (which display devices usually represent the same as the é U+00E9 precomposed equivalent) is made up of 2 characters (U+0065 and U+0301) where in locales using UTF-8 as their charmap, the first one is encoded on one byte (0x65), and the second on two bytes (0xcc 0x81, also known as Meta-L and Meta-Ctrl-A).

For strings made up only of ASCII characters like your 11111001, all three will be equivalent.

Note that in zsh like in all other shells except ksh/bash, array indices start at 1, not 0.

score 4 · Answer 3 · answered Jun 15 '21 at 16:01

You could split the string on individual characters:

string=11111001
echo "$string" | grep -o .

and read them back as an array:

readarray -t arr <<<"$(grep -o . <<<"$string")"

Then, of course, each character would be at each index of the arr array.

$ declare -p arr
declare -a arr=([0]="1" [1]="1" [2]="1" [3]="1" [4]="1" [5]="0" [6]="0" [7]="1")

But why create a new array if bash could access each individual character directly as this:

$ string=11111001
echo "${string:5:1}" "${string:7:1}"
0 1

Read about ${parameter:offset:length} in man bash.

score 4 · Answer 4 · answered Jun 15 '21 at 18:23

4

A more verbose way to read a string one character at a time:

string=11111001
arr=()
while IFS= read -r -d "" -n 1 char; do
    arr+=("$char")
done < <(printf '%s' "$string")
declare -p arr

outputs

declare -a arr=([0]="1" [1]="1" [2]="1" [3]="1" [4]="1" [5]="0" [6]="0" [7]="1")

answered Jun 15 '21 at 18:23

glenn jackman

85,964

Note that a string such as string=$'\xf0\x80++' in a UTF-8 locale would be split into $'\xf0\x80+ and +. – Stéphane Chazelas Jun 16 '21 at 15:45

score 3 · Answer 5 · answered Jun 16 '21 at 16:07

3

With bash 4.4+, as bash can't store NUL characters in its variables anyway, you could call a different utility to do the splitting and print the result NUL-delimited, which you can read into an array with readarray -td ''.

If your system comes with the GNU implementation of grep, you could do:

readarray -td '' bytes < <(printf %s "$string" | LC_ALL=C grep -zo .)
readarray -td '' chars < <(printf %s "$string" | grep -zo .)
readarray -td '' graphemes < <(printf %s "$string" | grep -zPo '\X')

All but the first will skip bytes that don't form part of valid characters in the locale (at least with GNU grep 3.4). For instance, with string=$'Ste\u0301phane \\\xf0\x80z.' (the trailing part not forming valid UTF-8), in a UTF-8 locale, that gives:

declare -a bytes=([0]="S" [1]="t" [2]="e" [3]=$'\314' [4]=$'\201' [5]="p" [6]="h" [7]="a" [8]="n" [9]="e" [10]=" " [11]="\\" [12]=$'\360' [13]=$'\200' [14]="z" [15]=".")
declare -a chars=([0]="S" [1]="t" [2]="e" [3]="́" [4]="p" [5]="h" [6]="a" [7]="n" [8]="e" [9]=" " [10]="\\" [11]="z" [12]=".")
declare -a graphemes=([0]="S" [1]="t" [2]="é" [3]="p" [4]="h" [5]="a" [6]="n" [7]="e" [8]=" " [9]="\\" [10]="z" [11]=".")

If not on a GNU system, and assuming $string contains valid UTF-8 text, you could use perl instead:

readarray -td '' bytes < <(perl -0le 'print for split "", shift' -- "$string")
readarray -td '' chars < <(perl -CSA -0le 'print for split "", shift' -- "$string")
readarray -td '' graphemes < <(perl -CSA -0le 'print for shift =~ /\X/g' -- "$string")

answered Jun 16 '21 at 16:07

Stéphane Chazelas

544,893

If I try your example string=$'Ste\u0301phane \\\xf0\x80z.' with Raku like so: arrayCH=($(printf %s "$string" | raku -e slurp.comb.print)) I obtain the error Malformed UTF-8 near bytes 5c f0 80 in block <unit> at -e line 1. (Errors with printf or echo on $string). However using echo directly on the string does not error out: arrayCH=($(echo 'Ste\u0301phane \\\xf0\x80z.' | raku -ne .comb.print)); echo "${arrayCH[@]}" returns S t e \ u 0 3 0 1 p h a n e \ \ \ x f 0 \ x 8 0 z . – jubilatious1 Jun 18 '21 at 16:12
1

@jubilatious1, you'd need echo -E $'...' or echo -e '...' assuming an echo implementation that supports those options, if you want those \u0301, \x... to be expanded. – Stéphane Chazelas Jun 18 '21 at 16:15
Thanks, Stéphane. It looks like my echo is too old. – jubilatious1 Jun 18 '21 at 17:12
1

@jubilatious1, you'd get the echo builtin of bash, but looking at your answer, that would indeed be from a 15 year old version of bash (3.2 is from 2006), so while it would support -e and -E (as long as the xpg_echo and posix options are not enabled), it likely wouldn't support \u0301. To output the UTF-8 encoding of the U+0301 character in there, you'd use: printf '\314\201'. In any case readarray -d needs 4.4 (from 2016) or above. – Stéphane Chazelas Jun 18 '21 at 17:17
It can be done in Raku like so: arrayCH=($(raku -e '"Ste\x[301]phane".comb.print')); echo "${arrayCH[@]}" returns: S t é p h a n e . I don't think Raku supports the \u0301 format, but maybe a regex might do, to generate the input string above. – jubilatious1 Jun 18 '21 at 17:42

GAD3R · Answer 6 · 2021-06-15T19:17:54.910

2

string=11111001
read -a array <<< $(echo "$string" | sed 's/./& /g')

sed to split the string by spaces separated.

edited Jun 15 '21 at 19:17

answered Jun 15 '21 at 16:45

GAD3R

66,769

2

That assumes $string doesn't contain backslashes nor newlines nor characters of $IFS and is not something like -Enee and that $IFS contains the space character. – Stéphane Chazelas Jun 16 '21 at 15:47

jubilatious1 · Answer 7 · 2021-07-15T17:17:12.433

Using Raku (formerly known as Perl_6):

~$ OLDIFS="$IFS"
~$ IFS=" "
~$ string=11111001
~$ read -a array <<< "$(raku -e lines.comb.print <<<"$string")"
~$ declare -p array
declare -a array='([0]="1" [1]="1" [2]="1" [3]="1" [4]="1" [5]="0" [6]="0" [7]="1")'
~$ IFS="$OLDIFS"
~$ echo -n "$IFS" | raku -e 'dd($*IN.slurp);'
" \t\n"

Unicode in Raku:
According to the docs, "Raku applies normalization by default to all input and output except for file names, which are read and written as UTF8-C8; graphemes, which are user-visible forms of the characters, will use a normalized representation." So the code/characters below give the following results:

~$ OLDIFS="$IFS"
~$ IFS=" "
~$ string1="palmarés,Würdigung,Témoignages d'honneur"
~$ read -a array1a <<< "$(raku -e lines.subst\(/"\s"/,｢_｣\).split\(｢,｣\).print <<<"$string1")"
~$ echo "${array1a[@]}"
palmarés Würdigung Témoignages_d'honneur
~$ declare -p array1a
declare -a array1a='([0]="palmarés" [1]="Würdigung" [2]="Témoignages_d'\''honneur")'
~$ read -a array1b <<< "$(raku -e lines.comb.print <<<"${array1a[2]}")"
~$ echo "${array1b[@]}"
T é m o i g n a g e s _ d ' h o n n e u r
~$ declare -p array1b
declare -a array1b='([0]="T" [1]="é" [2]="m" [3]="o" [4]="i" [5]="g" [6]="n" [7]="a" [8]="g" [9]="e" [10]="s" [11]="_" [12]="d" [13]="'''" [14]="h" [15]="o" [16]="n" [17]="n" [18]="e" [19]="u" [20]="r")'
~$ IFS="$OLDIFS"
~$ echo -n "$IFS" | raku -e 'dd($*IN.slurp);'
" \t\n"

https://docs.raku.org/language/unicode#Normalization
https://github.com/MoarVM/MoarVM/blob/master/docs/strings.asciidoc
[code tested on: GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14]

EDIT_1: The 'right' way to handle strings with embedded newlines appears to be slurping the string instead of reading with Raku's ne commandline flag, i.e. raku -e slurp.comb.print instead of raku -ne .comb.print. Then $IFS can be tuned to create an array using (or ignoring) newlines.

EDIT_2: As noted by @StephaneChazelas and @roaima, asterisks (*s) are problematic due to file-globbing. Here's code showing that quotation here (and above) is proper:

~$ string_star="*11111001"
~$ echo "$string_star"
*11111001
~$ read -a array_star <<< "$(raku -e slurp.comb.print <<<"$string_star")"
~$ echo "${array_star[@]}"
* 1 1 1 1 1 0 0 1

Double-quoting is essential (above), however as an extra measure Raku can be used to delete all * by adding a call such as .subst(...), (here substituting with nothing). Work-in-progress code below (consider applying same approach to delete other special characters in bash such as \, [, and ?):

~$ read string_nostar <<< "$(raku -e slurp.subst\(｢*｣\).print <<<"$string_star")"
~$ read -a array_nostar <<< "$(raku -e slurp.comb.print <<<"$string_nostar")"
~$ echo "${array_nostar[@]}"
1 1 1 1 1 0 0 1

I edited my post, adding code to remove *'s from input strings. See "EDIT_2". (Also, I tried to address embedded newlines with "EDIT_1"). — jubilatious1, Jun 18 '21 at 18:16
* is not the only problem. wildcard characters, newlines and all characters in $IFS are. It also assumes $IFS contains the SPC character. See Security implications of forgetting to quote a variable in bash/POSIX shells. When using the split+glob operator like you do here by leaving that $(...) unquoted in list context, you need to remember to tune it (set $IFS to the list of separator you want and disable glob if not wanted). — Stéphane Chazelas, Jun 18 '21 at 18:50
@StéphaneChazelas edited, generally after the reference you cited https://unix.stackexchange.com/q/171346/227738 , but also https://stackoverflow.com/a/11418930/7270649 — jubilatious1, Jun 21 '21 at 15:31

Break down string into array in shell script

7 Answers7

its character constituents:

its byte constituents

its grapheme cluster constituents.

Linked