4

Is there a way to sort the list of positional parameters in POSIX sh? Each positional parameter could contain arbitrary characters (e.g. spaces, newlines, tabs, etc.). The sort algorithm should be general enough to sort the list based on any comparison defined by the programmer (e.g. numeric/lexicographical sorting using expr comparison, sorting that considers only a substring of each positional parameter, etc.).

It seems that POSIX sh's list of positional arguments has the characteristics of both a stack and a queue. It supports push (set -- "$x" "$@"), pop (x="$1"; shift), enqueue (set -- "$@" "$x") and dequeue (x="$1"; shift) operations. However, there can only be a single list, which means that sorting must be done in-place and I cannot use common algorithms such as merge sort and quicksort.

Sample:

#!/bin/sh

set -- "Hello" "world" 9 8 7 "$(printf "0\n1")" "$(printf "1\t2")"

How do I sort the positional parameters above using comparison

expr "$x" \< "$y" such that the result is the list below?

"$(printf "0\n1")" "$(printf "1\t2")" 9 8 7 "Hello" "world"

Note 1: I am looking for a POSIX solution that works for arbitrary positional arguments supplied by a user and arbitrary comparisons specified by the programmer.

Note 2: I am not trying to solve any practical problem. I have challenged myself to this sh sorting problem but I could not come up with a solution, so I am posting it here on Stack Exchange.

Flux
  • 2,938
  • I assume clunky workarounds like printing to a temp file, then using sort and reading the file are not what you're after, right? – terdon Nov 19 '23 at 16:36
  • 1
    Note that your "pop" and "dequeue" operations are the same. There's no reasonable way remove an element off the end of the list. (In Bash, you could do set -- "${@:1: $#-1}", but it's not POSIX and will croak if the list if empty to begin with.) This is probably one of those things that would be better done in another programming language with better support for data structures. – ilkkachu Nov 19 '23 at 17:15
  • That is to say, POSIX sh only allows indexing into the list of positional parameters for getting the values, but you can't assign values in the middle, and you can't remove them from the middle. Any algorithm that involves swapping list items is pretty much impossible to use. You could swap the first two with something like x=$1 y=$2; shift; shift; set -- "$y" "$x" "$@", though, but generalizing that to an arbitrary pair, or even an arbitrary consecutive pair gets rather difficult. – ilkkachu Nov 19 '23 at 17:23
  • You could move the values to variables called e1, e2 etc. and do eval tricks to "index" that, but that way lies madness. You can't really even shove the values to an outside program (like sort), since to be able to hold arbitrary values, you'd need to read NUL-separated output back from the program, and that's not really easy POSIXly. – ilkkachu Nov 19 '23 at 17:25

4 Answers4

3

Probably easiest is to resort to awk which can do strcoll(), strcmp(), and number comparisons (including of floating points).

To avoid reinventing the wheel, we can use the quicksort awk implementation at https://rosettacode.org/wiki/Sorting_algorithms/Quicksort#AWK (GPLv2).

Then use (Edit: see further down for a better, safer approach):

code=$(
  awk -v q="'" -- '
    code-from-that-link-above
    BEGIN {
      delete ARGV[0]
      n = qsortArbIndByValue(ARGV, sorted)
      printf "set --"
      for (i = 1; i <= n; i++) {
        s = ARGV[sorted[i]]
        gsub(q, q "\\" q q, s)
        printf " %s", q s q
      }
    }' "$@"
) || exit
eval "$code"

That assumes the positional parameters contain valid text in the user's locale and that the list is small enough to fit in command line arguments (and in awk's array size limit).

That uses awk's < operator which will do number comparison if operands are recognised as numbers or strcoll() comparison otherwise. You can force strcoll() comparison by changing the comparisons from a < b to a"" < b"" (fix the locale to C for strcmp() comparison) and force number comparison by changing to a+0 < b+0 (+a < +b would also be POSIX but in practice not portable), or you can always write a custom compare() awk function to do whatever comparison you want.

Note that to be POSIX compliant, that code should have gsub(q, q "\\\\" q q, s) instead of gsub(q, q "\\" q q, s), however the latter, even though yielding unspecified behaviour by POSIX is more portable as the former doesn't work properly with gawk unless $POSIXLY_CORRECT is in the environment nor busybox awk for instance.

If the data is not guaranteed to be valid text in the user's locale, one can set the locale to C and then the strings will be considered as arrays of bytes and sorted as if by strcmp() (on ASCII-based systems), not as per the user's locale collation order.

Giving eval something that is potentially the result of unspecified behaviour is quite uncomfortable, but on second thought, it should be possible to make it reliable if instead of having awk output something like set -- '3rd argument hoped to be quoted correctly' 'first' 'second', we have it output set -- "${3}" "${1}" "${2}".

That should also be easier to do, shorter and more efficient:

code=$(
  awk -- '
    code-from-that-link-above
    BEGIN {
      delete ARGV[0]
      n = qsortArbIndByValue(ARGV, sorted)
      printf "set --"
      for (i = 1; i <= n; i++)
        printf " \"${%s}\"", sorted[i]
    }' "$@"
) || exit
eval "$code"
2

you need to use the shell to eval these string first, and then can sort the results, applying the same operations to the strings that get sorted, and the array of original indices. I'll illustrate below.

You could certainly implement your own sorting algorithm in shell script. Because the language is so badly suited for data structures/algorithms like that, it probably won't make much sense to aim for anything more complex (and theoretically less inefficient) than Bubble Sort. It'd be fairly short, ugly and hilarious.

  1. original array as given in your question.
  2. get the length of the array, and use seq to generate an array 1, 2, …, N
  3. make a new array of identical length let the shell evaluate every entry of your original entry (so that $(printf "0 \n" becomes 0 \n). This is what we'll actually need to sort
  4. implement bubble sort:
    1. compare the first element to the second. If the first is larger, according to your sorting criterion (which I'm not sure about, you're a bit muddy on that), then swap (using a temporary variable) the two values, as well as the two indices in the same position
    2. continue to compare the second and third; repeat the rule above; rinse and repeat for each pair of values in sequence.
    3. after doing the comparison (and swapping) of the next-to-last with the last element, start from the beginning again (you can stop comparing one element earlier than in the previous run)
    4. at some point, you've got everything sorted. Stop!
  5. using the reordered index array, you now construct the reordered array containing the original (uninterpreted) values.
  • how do you create another array to hold the indexes when POSIX sh only has the list of positional parameters to play with? – ilkkachu Nov 19 '23 at 17:17
  • different ways I considered, one that I liked was to keep the indices in a tab-separated string indices, the index into that string in a local variable start, and then use printf '%s\n' indices | cut -f -${start},$((start+2)),$((start+1)),$((start+3))- to get the elements before the potentially swapped, then the formerly second, now first element, the the formerly first, now second element, then the rest of the sequence; then increase start. (not sure whether POSIX sh does $(( )), might need a temporary variable to build that fields selector) – Marcus Müller Nov 19 '23 at 17:29
  • The trick is that unlike the arguments that need to be sorted, indices are pretty restricted in valid characters, so we can throw cut at them. – Marcus Müller Nov 19 '23 at 17:30
  • and how to keep both the original and the evaluated argument list around: do the evaluation and sorting in a separate shell process, and print only the permuted index vector, which the "outer" shell uses. – Marcus Müller Nov 19 '23 at 17:32
  • 1
    (Frankly, POSIX specifies a little program called c99; I'm told that is turing-complete…) – Marcus Müller Nov 19 '23 at 17:33
  • 1
    cut doesn't reorder the fields, though (printf '%s\t' a b c | cut -f 3,1 prints a\tc). But ok, sure you can do shenanigans like that. I'm still going to say it's insane. :) c99 would help, for sure. – ilkkachu Nov 19 '23 at 18:00
  • Insane? Maybe. Totally useless in practice? 100%! – Marcus Müller Nov 19 '23 at 18:02
0

Here's an idea:

  • In sh, escape all newline and backslash characters in each positional parameter, so that we could use the newline character as delimiter.
  • Pipe the newline-separated positional parameters to awk.
  • In awk, use quicksort to sort the parameters and write the newline-separated sorted parameters to standard output.
  • In sh, replace the unsorted list of positional parameters with the newline-separated sorted parameters from awk by splitting on newlines.
  • Unescape all newline and backslash characters in the list of sorted positional parameters.

Implementation:

#!/bin/sh
# Run this shell script to sort this list of positional parameters:
set -- \
    'Hello' \
    '  0  1  ' \
    '' \
    '*' \
    '\0150\0151' \
    "$(printf '\a\b\e\f\n\r\t\v')" \
    '\a\b\e\f\n\r\t\v%%\\' \
    'qwerty
uiop' \
    '

new

lines

'

printf '====== NOT YET SORTED ======\n' for param; do printf '%s' "$param" | od -ab done

quicksort() { for param; do # * Escape newlines (newline -> "\n" [two characters]). # Rationale: # * To be able to use newline as AWK record separator. # * To be able to use it as the shell's $IFS for separating records. # * Escape backslashes (backslash -> "\" [two characters]). # Rationale: # * To distinguish between escaped newlines and the two-character # string "\n" (escaped version: "\n" [three-character string]). # * To avoid special interpretation of backslashes when passed to # printf '%b' .... # # sed solution adapted from: # https://unix.stackexchange.com/questions/761885/how-to-convert-all-newlines-to-n-in-posix-sh-strings printf '%s\n' "$param" | LC_ALL=C sed ' :1 $ ! { N b1 } s/\/\\/g s/\n/\n/g' done | awk -f ./quicksort.awk }

Create temporary file.

tmpfile="$(printf 'mkstemp(template)'
| m4 -D template="${TMPDIR:-/tmp}/XXXXXX")" || exit 1 exec 3>"$tmpfile" 4<"$tmpfile" # Open tmpfile for writing and reading. rm -f -- "$tmpfile"

quicksort "$@" >&3 3>&-

set -- while IFS= read -r line; do # Unescape backslashes and newlines. # To preserve trailing newlines (if any), insert a trailing character 'x' # and later remove it. elem="$(printf '%bx' "$line")" set -- "$@" "${elem%x}" done <&4 4<&-

printf '\n====== SORTED RESULTS ======\n' for param; do printf '%s' "$param" | od -ab done

This is quicksort.awk:

# Takes newline-separated records from standard input and sorts them according
# to the `compare` function. Outputs the sorted newline-separated records to
# standard output.
#
# Backslash and newline characters supplied within each input record must be
# escaped like this:
# * backslash character -> "\\" (two backslash characters)
# * newline character -> "\n" (backslash character followed by the "n" character)
#
# Backslash and newline characters within each output record will be escaped in
# the same manner.
#
# Usage: awk -f quicksort.awk
#
# Attribution:
# Functions `quicksort` and `quicksort_swap` are adapted from the public domain
# quicksort implementation by Arnold Robbins retrieved from
# https://git.savannah.gnu.org/cgit/gawk.git/tree/awklib/eg/lib/quicksort.awk?h=gawk-5.3.0

function compare(a, b) { # Unescape backslashes and newlines. gsub(/\/, "\", a) gsub(/\/, "\", b) gsub(/\n/, "\n", a) gsub(/\n/, "\n", b)

# For sorting in ascending lexicographic order.
return a &lt; b

}

function quicksort(data, left, right, i, last) { if (left >= right) # Do nothing if array contains fewer than two elements. return

quicksort_swap(data, left, int((left + right) / 2))
last = left
for (i = left + 1; i &lt;= right; i++)
    if (compare(data[i], data[left]))
        quicksort_swap(data, ++last, i)
quicksort_swap(data, left, last)
quicksort(data, left, last - 1, less_than)
quicksort(data, last + 1, right, less_than)

}

function quicksort_swap(data, i, j, temp) { temp = data[i] data[i] = data[j] data[j] = temp }

{ lines[counter++] = $0 }

END { quicksort(lines, 0, counter - 1) for (k = 0; k < counter; k++) print lines[k] }

Tested in:

  • Debian 12, Dash 0.5.12, GNU sed 4.9, Gawk 5.2.1
  • Debian 12, Dash 0.5.12, Busybox 1.35.0's sed and awk
  • FreeBSD 13.2's sh, sed, and awk
Flux
  • 2,938
  • @StéphaneChazelas I have edited the shell script to fix the issues that arise when the positional parameters is set '*' or set a '' b. Please take a look. Thank you. – Flux Dec 01 '23 at 02:46
  • Note that setting IFS to newline for read doesn't make much sense since read only reads one line at a time, so there will never be any newline in it to split on. To read a line, the idiomatic syntax is IFS= read -r line. See Understanding "IFS= read -r line" – Stéphane Chazelas Dec 01 '23 at 10:17
  • If you set LC_ALL=C for sed, then you need to set it for awk as well, but then it will affect the string collation order. In the C locale, every byte is treated as a character. While in the user's locale in text utilities such as sed or awk (not printf), byte sequences are meant to be decoded into characters. In some locales, there are some characters whose encoding contains the encoding of backslash. – Stéphane Chazelas Dec 01 '23 at 11:03
  • @StéphaneChazelas Fixed the IFS issue. Thank you for the explanation. – Flux Dec 02 '23 at 22:08
  • @StéphaneChazelas Regarding the LC_ALL=C for sed: is it a good idea to remove LC_ALL=C and leave the choice of LC_ALL to the programmer/user? – Flux Dec 02 '23 at 22:12
  • Well with printf working at byte level, it's only guaranteed to work properly in the C locale (in practice that should also work OK in other single byte locales and in UTF-8 provided the strings are properly UTF-8 encoded). BTW, see the update to my own answer. – Stéphane Chazelas Dec 03 '23 at 09:20
0

Here is an insertion sort in POSIX sh:

sort ()
{
    f=false
    for x
    do
        if ! $f
        then
            set -- "$1"
            f=true
        else
            q=false
            # marginally faster than while + "$1" in my tests
            for y
            do
                if $q || "$cmp" "$x" "$y"
                then
                    set -- "$@" "$y"
                else
                    set -- "$@" "$x" "$y"
                    q=true
                fi
                shift
            done
            $q || set -- "$@" "$x"
        fi
    done
    "$cont" "$@"
}

islonger () { [ ${#1} -gt ${#2} ] }

puts () { printf %s\n "$@" }

cmp=islonger cont=puts sort "$@"

While not a pinnacle of efficiency, on my machine bash runs it fast enough to be useful on really small inputs:

$ shuf /usr/share/dict/words | head -n 50 | xargs -d '\n' sh -c 'time sh sort.sh "$@"' sh
sic
alto
Biko
needs
Capet
scuba
bowed
sicks
waxier
Kodaly
Tuvalu
hubbub
Gide's
panache
Joann's
peeling
mermaid
wingnut
savvies
crybaby
Python's
nitwit's
junction
tailored
tussocks
rotaries
Brandi's
leafiest
banknote
Spence's
Heriberto
prepaying
telephony
indelible
addendum's
stampeding
hatchway's
pathogenic
Englishman
escarole's
outstaying
synonymous
Youngstown
rebroadcast
overstuffed
interweaves
deliquescent
grandmothers
Cryptozoic's
mammography's

real 0m0.039s user 0m0.038s sys 0m0.001s

I believe many sorting algorithms can be implemented reasonably straightforwardly in POSIX shell. The biggest pain point is the non-existent swap operation. Of course, if you are willing to do things like serialization and eval, everything is possible.

Here is a merge sort using the same principles:

sort ()
{
    n=1
    while [ $n -lt $# ]
    do
        h=0
        k=$n
        l=$n
        while [ $h -lt $# ]
        do
            i=0
            j=0
            if [ $(($# - h)) -lt $((n << 1)) ]
            then
                if [ $(($# - h)) -le $n ]
                then
                    k=$(($# - h))
                    l=0
                else
                    l=$(($# % n))
                fi
            fi
            for x
            do
                [ $i -eq 0 ] && shift $k
                while [ $j -lt $l ]
                do
                    if [ $i -eq $k ] || "$cmp" "$x" "$1"
                    then
                        set -- "$@" "$1"
                        shift
                        j=$((j + 1))
                    else
                        break
                    fi
                done
                [ $i -eq $k ] && break
                set -- "$@" "$x"
                i=$((i + 1))
            done
            h=$((h + (n << 1)))
        done
        n=$((n << 1))
    done
    "$cont" "$@"
}
$ shuf /usr/share/dict/words | head -n 1000 | xargs -d '\n' sh -c 'time sh sort.sh "$@"' sh >/dev/null

real 0m19.918s user 0m19.917s sys 0m0.001s

gldrk
  • 1