2

Note: Thank you to Jeff Schaller and steeldriver. But as neither posted as an answer, I'm not sure how to mark as solved. I now have a better understanding of pipes/subshells. I'm pretty sure I once knew this but it's been a long time since I tried anything complex in bash.

Both assigning the filtered result from awk to a variable and process substitution worked for me. My final code to read unsorted unique lines from stdin:

while read -r FILE
do
    ...
done < <(awk '!x[$0]++')

More reading on process substitution for those who find this question looking for solution to a similar problem.

ORIGINAL QUESTION:

I searched the site, but I can't find an answer to my problem.

I'm building an array from stdin and need to filter for unique lines. To do this, I'm using awk '!x[$0]++' which I've read is shorthand for:

awk 'BEGIN { while (getline s) { if (!seen[s]) print s; seen[s]=1 } }'.

The filter works as desired but the issue is that the resulting array from the while read loop is empty.

For example (using $list as a surrogate for stdin):

list=$'red apple\nyellow banana\npurple grape\norange orange\nyellow banana'
while read -r line; do
    array[count++]=$line
done <<< "$list"
echo "array length = ${#array[@]}"
counter=0
while [  $counter -lt ${#array[@]} ]; do
    echo ${array[counter++]}
done

produces:

array length = 5
red apple
yellow banana
purple grape
orange orange
yellow banana

But filtering $list with awk:

list=$'red apple\nyellow banana\npurple grape\norange orange\nyellow banana'
awk '!x[$0]++' <<< "$list" | while read -r line; do
    array[count++]=$line
done
echo "array length = ${#array[@]}"
counter=0
while [  $counter -lt ${#array[@]} ]; do
     echo ${array[counter++]}
done

produces:

array length = 0

But the output of awk '!x[$0]++' <<< "$list" appears fine:

red apple
yellow banana
purple grape
orange orange

I've tried examining each line in the while read loop:

list=$'red apple\nyellow banana\npurple grape\norange orange\nyellow banana'
i=0
awk '!x[$0]++' <<< "$list" | while read -r line; do
    echo "line[$i] = $line"
    let i=i+1
done

and it appears fine:

line[0] = red apple
line[1] = yellow banana
line[2] = purple grape
line[3] = orange orange

What am I missing here?

In case it's important, I'm using bash 3.2.57:

GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin15) Copyright (C) 2007 Free Software Foundation, Inc.

Mike
  • 23

2 Answers2

3
awk '!x[$0]++' <<< "$list" | while read -r line; do
    array[count++]=$line
done

The array (italic) in this case is a part of the subshell (bold).

The $line and $array has a value whilst the subshell is alive, so to speak.

Once the subshell finishes, aka dies, the parent (spawner) environment is restored. This includes obliteration of any variables set in the subshell.

In this case:

  • $array removed,
  • $line removed.

Try this:

list=$'red apple\nyellow banana\npurple grape\norange orange\nyellow banana'
awk '!x[$0]++' <<< "$list" | while read -r line; do
    array[count++]=$line
    printf "array[%d] { %s\n" ${#array[@]} # array[num_of_elements] {
    printf "       %s\n" "${array[@]}"     # elements
    printf "}\n"                           # } end of array

done

printf "\n[ %s ]\n\n" "END OF SUBSHELL (PIPE)"

printf "array[%d] {\n" ${#array[@]}
printf "       %s\n" "${array[@]}"
printf "}\n"

Yields:

array[1] {
       red apple
}
array[2] {
       red apple
       yellow banana
}
array[3] {
       red apple
       yellow banana
       purple grape
}
array[4] {
       red apple
       yellow banana
       purple grape
       orange orange
}

[ END OF SUBSHELL (PIPE) ]

array[0] {

}

Or as per manual.

We can start with Pipelines

[…] Each command in a pipeline is executed in its own subshell (see Command Execution Environment). […]

And the Command Execution Environment expands the adventure as follows:

[…] A command invoked in this separate environment cannot affect the shell’s execution environment.

Command substitution, commands grouped with parentheses, and asynchronous commands are invoked in a subshell environment that is a duplicate of the shell environment, except that traps caught by the shell are reset to the values that the shell inherited from its parent at invocation. Builtin commands that are invoked as part of a pipeline are also executed in a subshell environment. Changes made to the subshell environment cannot affect the shell’s execution environment. […]

It can't affect: thus it cant set.

We can however redirect and do something in the direction of:

list=$'red apple\nyellow banana\npurple grape\norange orange\nyellow banana'

while read -r line; do
    arr[count++]=$line
done <<<"$(awk '!x[$0]++' <<< "$list")"

echo "arr length = ${#arr[@]}"
count=0
while [[  $count -lt ${#arr[@]} ]]; do
    echo ${arr[count++]}
done
  • Thank you. I know how long it takes to write a detailed and informative answer. – Mike Jun 21 '16 at 20:42
  • @Mike: Know you got your answer (and you "got it") while I was writing, but posted :P. Please fill inn / edit if you find any improvements to be done to answer (i.e. to make it clearer.) You are in an unique position to see what is badly explained and what is not :) – user162946 Jun 21 '16 at 21:09
  • Just one minor typo unless "subhell" is one of Dante's circles. ;p I'm embarrassed how many hours I spent testing this (before posting) trying to understand what was happening. It was obvious in hindsight since I had echo statements of the array elements correctly stored inside the subshell but empty after. I just couldn't see the forest for the trees. – Mike Jun 22 '16 at 04:29
1

Some solutions to your problem without the loop

# use bash's mapfile with process substitution 
mapfile -t arr < <( awk '!x[$0]++' <<<"$list" )

# use array assignment syntax (at least bash, ksh, zsh) 
# of a command-substituted value split at newline only
# and (if the data can contain globs) globbing disabled
set -f; IFS='\n' arr=( $( awk '!x[$0]++' <<<"$list" ) ); set +f
  • Thanks for the response. I had checked mapfile previously but it is not installed by default on a Mac and I want to make this script as portable as possible. – Mike Jun 22 '16 at 04:22
  • By the way I upvoted you but as I'm just a newbie here, my upvotes aren't publicly displayed. :p – Mike Jun 22 '16 at 04:24
  • For anyone reading this later and confused why mapfile (a builtin command) is missing on the Mac. It's a bash 4.x command and Apple is stuck back on 3.2.57 (listed in original question). I would install a more up-to-date shell but I wanted this script to be portable on Macs. Joy! – Mike Jun 24 '16 at 20:31