Sorting an XML file in UNIX with a Bash script?

Question

I am trying to sort my XML file that looks like this by alphabetical order. This is a part of a larger bash script so it needs to work within that script:

<Module>
    <Settings>
        <Dimensions>
            <Volume>13000</Volume>
            <Width>5000</Width>
            <Length>2000</Length>
        </Dimensions>
        <Stats>
            <Mean>1.0</Mean>
            <Max>3000</Max>
            <Median>250</Median>
        </Stats>
    </Settings>
    <Debug>
        <Errors>
            <Strike>0</Strike>
            <Wag>1</Wag>
            <MagicMan>0</MagicMan>
        </Errors>
    </Debug>
</Module>

I want the end result to look like this, I only want the innermost tags to be sorted:

<Module>
    <Settings>
        <Dimensions>
            <Length>2000</Length>
            <Volume>13000</Volume>
            <Width>5000</Width>
        </Dimensions>
        <Stats>
            <Max>3000</Max>
            <Mean>1.0</Mean>
            <Median>250</Median>
        </Stats>
    </Settings>
    <Debug>
        <Errors>
            <MagicMan>0</MagicMan>
            <Strike>0</Strike>
            <Wag>1</Wag>
        </Errors>
    </Debug>
</Module>

I am trying to use sort like this where -t sorts by the > delimeter and then the 4 sorts by the 4th column which would be in the inner but it is not working.

sort -t'>' -k4 file > final.xml

I get funky output that sorts the other columns in with the sorted inner tags.

Any help would be appreciated

You have to use an XML parser to parse XML data. sort is only line-based, so it just can't handle XML. — glenn jackman, Jul 20 '21 at 22:03
The usual text processing tools like grep, sed, awk etc are not well-suited to context-free languages such as XML. There are, however, tools made for processing XML. — berndbausch, Jul 20 '21 at 22:04
@berndbausch I was able to use awk to successfully remove duplicate lines like this: awk '!seen[$0]++' $file > results.xml. It got rid of duplicate lines that were accidentally added. Is there definitely no way to sort with awk? — palacetrading, Jul 20 '21 at 22:08
awk is probably Turing-complete, which would mean that you can implement an XML parser with it. However, why reinvent the wheel? — berndbausch, Jul 20 '21 at 22:14
If you have more information about the XML file, it may be easier. For example, if you are certain that lines to be sorted contain the opening and closing tag, you could look for such lines and sort them. However, XML does not impose such restrictions. — berndbausch, Jul 20 '21 at 22:18
Taking a step backwards for a moment, why do you need the XML file to be sorted? The usual XML parsing tools don't generally need to care — Chris Davies, Jul 20 '21 at 22:20
Here is an example of what you're looking at; https://stackoverflow.com/q/9161934/7552 — glenn jackman, Jul 20 '21 at 22:21
@berndbausch i'm working on an embedded platform so I don't have access to xml parser — palacetrading, Jul 20 '21 at 22:23
@colinodowd when you say "working on an embedded platform" does that mean you only have the mandatory POSIX toolset (e.g. you have grep, sed, and awk but not perl or any other non-mandatory tools) or something else? Are they the GNU versions of those tools or something else (e.g. what does awk --version output)? — Ed Morton, Jul 20 '21 at 22:51
Embedded as in one of the Entware platforms? That has an XML parser as well as many GNU tools - should you choose to install them — Chris Davies, Jul 20 '21 at 23:03
xsltproc is a very widely used very widely ported relatively modest resource consumer (if bash runs, it's unlikely xsltproc can't run). One copy of xsltproc plus one question to the XSLT folks would probably get you a tiny XSLT program that will do the job efficiently and correctly (i.e., working no matter what the XML formatting is). Also provides some insurance for any future XML manipulation that may crop up for you. — Ron Burk, Jul 21 '21 at 00:39
All the advice about using XML-aware tools is good in general BUT platforms do exist where there are no XML-aware tools yet tools still have to produce and consume text in some format and it's no less useful immediately to choose a restricted subset of XML as that format than it is to invent/use "Timmy's Interchange Format" and choosing to use XML at least gives you something you can pull off your box if/when necessary and run XML-aware tools on plus makes it easier in future if/when you CAN get XML-aware tools on your box. In those case, you still need some way to parse your files. — Ed Morton, Jul 21 '21 at 11:52
So when someone posts a question and says "we use XML but we have no XML-aware tools", it's not an outrageous situation and saying "you shouldn't use XML" or "you need to have/install XML tools" isn't useful or necessary since they're invariably just using a small, well-formed, consistent, simple subset of XML that's just as easily handled with awk or similar as any other simple text file format they might have invented/chosen instead. — Ed Morton, Jul 21 '21 at 11:55
@EdMorton I probably ought to take this to meta, but while I agree with you I also think there are times to question whether the solution is sufficiently robust or if alternative approaches would provide an all-round better result. Anything with XML could be argued as one of those times. — Chris Davies, Jul 21 '21 at 12:55
@roaima I agree it's worth asking the question, but the OP has already replied that they do not have an XML parser, people are still telling them they need to have one, and a moderator removed the awk tag from this question (I added it back) when an awk solution is very likely to be exactly what the OP needs. I disagree with the "you must have an XML parser to use XML" mantra - there's nothing wrong with using a restricted set of XML on a box that doesn't have an XML parser, it's just engineering judgement, and If the OP decided to use "Timmy's Interchange Format", we'd help them. — Ed Morton, Jul 21 '21 at 13:11
@roaima XML schemas can require XML payloads to be order dependent. — Will Hartung, Jul 21 '21 at 14:46

steeldriver · Answer 1 · 2021-07-21T14:50:59.260

8

[with a generous assist from Kusalananda]

You can do it using the xq wrapper from yq (a jq wrapper for YAML/XML) to leverage jq's sorting capabilities:

$ xq -x 'getpath([paths(scalars)[0:-1]] | unique | .[])
    |= (to_entries|sort_by(.key)|from_entries)' file.xml
<Module>
  <Settings>
    <Dimensions>
      <Length>2000</Length>
      <Volume>13000</Volume>
      <Width>5000</Width>
    </Dimensions>
    <Stats>
      <Max>3000</Max>
      <Mean>1.0</Mean>
      <Median>250</Median>
    </Stats>
  </Settings>
  <Debug>
    <Errors>
      <MagicMan>0</MagicMan>
      <Strike>0</Strike>
      <Wag>1</Wag>
    </Errors>
  </Debug>
</Module>

Explanation:

paths(scalars) generates a list of all paths, from root to leaf, then array slice [0,-1] removes the leaf node resulting in a list of paths to the deepest non-leaf nodes:

["Module","Settings","Dimensions"]
["Module","Settings","Dimensions"]
["Module","Settings","Dimensions"]
["Module","Settings","Stats"]
["Module","Settings","Stats"]
["Module","Settings","Stats"]
["Module","Debug","Errors"]
["Module","Debug","Errors"]
["Module","Debug","Errors"]

[paths(scalars)[0:-1]] | unique | .[] puts the list into an array so that it may be de-duplicated by unique. The iterator .[] turns it back to a list:
```
["Module","Debug","Errors"]
["Module","Settings","Dimensions"]
["Module","Settings","Stats"]
```
getpath() turns the de-duplicated list into bottom-level objects whose contents may be sorted and updated with the |= update-assign operator

The -x option tells xq to convert the result back to XML rather than leaving it as JSON.

Note that while sort works here in place of sort_by(.key) the former implicitly sorts by values as well as keys if the keys are non-unique.

edited Jul 21 '21 at 14:50

answered Jul 20 '21 at 22:41

steeldriver

81,074

Maybe someone with stronger jq-fu can figure out how (map? with_entries?) to remove some of the duplication... – steeldriver Jul 20 '21 at 23:29
.Module[][] |= (to_entries|sort_by(.key)|from_entries) This makes it more explicit that you're sorting by the keys, and sorts all 2nd-layer objects down from .Module. You can't use with_entries() here without rethinking as you will need to have a construct like to_entries|map(something)|from_entries to use that. – Kusalananda Jul 21 '21 at 06:16
1

@Kusalananda thanks I always forget that [] can be used to iterate nested objects, not just arrays. I was trying to do something from "the other end" using paths(scalars)[0:-1] but couldn't make it work. – steeldriver Jul 21 '21 at 11:54
We were just lucky that all the keys on the same level in the structure needed sorting. If it had been a more uneven structure to the document, you would have needed to do something like what your code does. – Kusalananda Jul 21 '21 at 11:55
1

@Kusalananda finally figured out a way to do it ... I think – steeldriver Jul 21 '21 at 14:48
That'll work, but looks slightly awkward. I would still go via .Module[][] I think, at least in this instance, but it's good see see alternatives. The nice thing about your solution is that it lends itself to situations where you need to hand-pick paths to modify. – Kusalananda Jul 21 '21 at 16:04

Ed Morton · Answer 2 · 2021-07-21T14:15:33.157

Using any awk, sort, and cut in any shell on every Unix box and assuming your input is always formatted like the sample you provided in your question where the lines to be sorted always have start/end tags and the other lines don't and <s don't appear anywhere else in the input:

$ cat tst.sh
#!/usr/bin/env bash
awk '
BEGIN { FS="<"; OFS="\t" }
{
    idx = ( (NF == 3) && (pNF == 3) ? idx : NR )
    print idx, $0
    pNF = NF
}
' "${@:--}" |
sort -k1,1n -k2,2 |
cut -f2-

$ ./tst.sh file
<Module>
    <Settings>
        <Dimensions>
            <Length>2000</Length>
            <Volume>13000</Volume>
            <Width>5000</Width>
        </Dimensions>
        <Stats>
            <Max>3000</Max>
            <Mean>1.0</Mean>
            <Median>250</Median>
        </Stats>
    </Settings>
    <Debug>
        <Errors>
            <MagicMan>0</MagicMan>
            <Strike>0</Strike>
            <Wag>1</Wag>
        </Errors>
    </Debug>
</Module>

The above uses awk to decorate the input to sort so that we can just run sort once on the whole file and then use cut to remove the number that awk added. Here are the intermediate steps so you can see what's happening:

awk '
BEGIN { FS="<"; OFS="\t" }
{
    idx = ( (NF == 3) && (pNF == 3) ? idx : NR )
    print idx, $0
    pNF = NF
}
' file
1       <Module>
2           <Settings>
3               <Dimensions>
4                   <Volume>13000</Volume>
4                   <Width>5000</Width>
4                   <Length>2000</Length>
7               </Dimensions>
8               <Stats>
9                   <Mean>1.0</Mean>
9                   <Max>3000</Max>
9                   <Median>250</Median>
12              </Stats>
13          </Settings>
14          <Debug>
15              <Errors>
16                  <Strike>0</Strike>
16                  <Wag>1</Wag>
16                  <MagicMan>0</MagicMan>
19              </Errors>
20          </Debug>
21      </Module>

awk '
BEGIN { FS="<"; OFS="\t" }
{
    idx = ( (NF == 3) && (pNF == 3) ? idx : NR )
    print idx, $0
    pNF = NF
}
' file | sort -k1,1n -k2,2
1       <Module>
2           <Settings>
3               <Dimensions>
4                   <Length>2000</Length>
4                   <Volume>13000</Volume>
4                   <Width>5000</Width>
7               </Dimensions>
8               <Stats>
9                   <Max>3000</Max>
9                   <Mean>1.0</Mean>
9                   <Median>250</Median>
12              </Stats>
13          </Settings>
14          <Debug>
15              <Errors>
16                  <MagicMan>0</MagicMan>
16                  <Strike>0</Strike>
16                  <Wag>1</Wag>
19              </Errors>
20          </Debug>
21      </Module>

Alternatively, using GNU awk for sorted_in:

$ cat tst.awk
BEGIN { FS="<" }
NF == 3 {
    rows[$0]
    f = 1
    next
}
f && (NF < 3) {
    PROCINFO["sorted_in"] = "@ind_str_asc"
    for (row in rows) {
        print row
    }
    delete rows
    f = 0
}
{ print }

If you don't have GNU awk you can use any awk and any sort for that same approach:

$ cat tst.awk
BEGIN { FS="<" }
NF == 3 {
    rows[$0]
    f = 1
    next
}
f && (NF < 3) {
    cmd = "sort"
    for (row in rows) {
        print row | cmd
    }
    close(cmd)
    delete rows
    f = 0
}
{ print }

but it'll be much slower then the first 2 solutions above as it's spawning a subshell to call sort for every block of nested lines.

score 2 · Answer 3 · answered Jul 21 '21 at 01:46

Answered as asked: pure(ish) bash solution (still calls sort however). Produces specified output from example input. Fragile, of course, as any solution that treats XML as line-oriented must be.

#!/bin/bash
function FunkySort(){
    local inputfile="$1"
    local -a linestosort=()
    local line ltchars
    while IFS= read -r line; do
        # strip all but less-than characters
        ltchars="${line//[^<]}"
        # if we guess it is "innermost" tag
        if [ ${#ltchars} -gt 1 ]; then
            # append to array
            linestosort+=("${line}")
        else
            # if non-innermost but have accumulated some of them
            if [ ${#linestosort} -gt 0 ]; then
                # then emit accumulated lines in sorted order
                printf "%s\n" "${linestosort[@]}" | sort
                # and reset array
                linestosort=()
            fi
            printf "%s\n" "$line"
        fi
    done < "$inputfile"
}
FunkySort "test.xml" >"test.out"

Sorting an XML file in UNIX with a Bash script?

3 Answers3

Linked