How can I reformat blocks of data until the end of the file is reached?

Question

I have a file that looks like this:

# Time-averaged data for fix avetimeall
# TimeStep Number-of-rows
# Row c_gyrationchunkall
1000 3
1 2.09024e-14
2 4.88628
3 5.69321
2000 3
1 2.10518e-14
2 8.33702
3 8.83162
3000 3
1 1.96656e-14
2 12.1396
3 11.5835
...

In my file, the first three lines are always headers. After the headers, my file lists blocks of data of the same size, each starting with a labeling subheader. I want to reorganize the data in my file such that the data in each block are sent into a line starting with the relevant portion of the label of that block and listing the relevant data values of that block afterwards, all separated from each other by spaces. As an example, I want to convert the sample above into:

# Time-averaged data for fix avetimeall
# TimeStep c_gyrationchunkall
1000 2.09024e-14 4.88628 5.69321
2000 2.10518e-14 8.33702 8.83162
3000 1.96656e-14 12.1396 11.5835
...

How do I do this in Bash? I have some experience in Bash, but I'm afraid not enough to handle this problem swiftly...

bash is a poor tool for multiline text processing. Perl (man perl) would be a better choice. — waltinator, Dec 03 '23 at 00:59
for the line 1000 3, what is the significance of the 3? does this designate the number of follow-on lines? will all such lines always have a 3 in the 2nd field or could it vary, and if it can vary, then please update the sample to show an example — markp-fuso, Dec 03 '23 at 02:14
@markp-fuso Yes, 3 designates the number of follow-on lines, and it’s always there. All subheaders have that 3 there. — Felipe Evaristo, Dec 03 '23 at 03:51
Hello, it looks like the header lines change somewhat in the output? Will the output always be first header line as is (no change), and second header line as a combination of rows 2 and 3? Thx! — jubilatious1, Dec 03 '23 at 05:08

Ed Morton · Accepted Answer · 2023-12-03T15:26:31.493

6

Using any awk whether that 3 number of lines in a block can vary or not:

$ awk '
    NR == 2 { $3=""; saved=$0; next }
    NR == 3 { $0=saved $3 }
    NR  < 4 { print; next }
    !numLines {
        numLines = $2
        printf "%s%s", $1, OFS
        next
    }
    { printf "%s%s", $2, (--numLines ? OFS : ORS) }
' file
# Time-averaged data for fix avetimeall
# TimeStep c_gyrationchunkall
1000 2.09024e-14 4.88628 5.69321
2000 2.10518e-14 8.33702 8.83162
3000 1.96656e-14 12.1396 11.5835

Following up on a discussion under Xavier G.s answer about a preference in style for readability, here is an awk script written in the same style as that shell script (and contained in a shell script so it behaves the same way externally) but it will run orders of magnitude faster* than and be more robust and portable than the shell script:

$ cat ./script_filename
#!/usr/bin/env bash
awk '
    BEGIN {
        # Reformat comments:
        getline first_line
        print first_line
        getline; split($0,line2)
        getline; split($0,line3)
        printf "# %s %s\n", line2[2], line3[3]
    # Reformat data:
    while ( getline &gt; 0 ) {
        timestep=$1; number_of_rows=$2
        printf &quot;%s&quot;, timestep
        for ( i=1; i&lt;=number_of_rows; i++ ) {
            getline; row_value=$NF
            printf &quot; %s&quot;, row_value
        }
        print &quot;&quot;
    }
}

'

$ ./script_filename < input
# Time-averaged data for fix avetimeall
# TimeStep c_gyrationchunkall
1000 2.09024e-14 4.88628 5.69321
2000 2.10518e-14 8.33702 8.83162
3000 1.96656e-14 12.1396 11.5835

* Here's the third-run timing results from running the bash script vs the above awk script on a file containing 90,000 of the OPs records:

$ time ./script_bash < file > /dev/null
real    0m9.425s
user    0m5.062s
sys     0m4.139s

$ time ./script_awk < file > /dev/null
real    0m0.265s
user    0m0.171s
sys     0m0.000s

edited Dec 03 '23 at 15:26

answered Dec 03 '23 at 11:06

Ed Morton

31,617

Puzzled why not #!/usr/bin/env awk -f insead of the shell script with big constant. – Joshua Dec 03 '23 at 16:02
@Joshua you should never use a shebang to call awk, see https://stackoverflow.com/a/61002754/1745001 for just some of the issues you have to deal with if you do. You obviously don't need to use the same indenting as I used, you could just have awk 'BEGIN { and the final }' at the start of a line and start every line in between indented once or start them at the start of a line too if you prefer, it won't make much difference to the readability. – Ed Morton Dec 03 '23 at 16:17
2

@Joshua Besides what's mentioned in the answer linked by Ed, #!/usr/bin/env awk -f wouldn't work on Linux, because everything after the first space is a single argument. env wouldn't be able to find the executable awk -f. – JoL Dec 04 '23 at 01:00
@JoL: Now I understand why muawk exists. (muawk filename just did exec awk -f filename $@ although it wasn't written in shell.) I still have a copy of it on a superformatted floppy disk if I can find a way to read it again. Having not ever actually needed to worry about env awk -f I never found out it wouldn't work. Hmm. I wonder if I'm hallucinating a memory or if it is actually smart enough to split its argument if invoked directly by #!. – Joshua Dec 04 '23 at 04:17
@Joshua I don't know muawk. Trying to make sense of it from your description, I guess it was actually exec awk -f "$@", so the shebang would be #!/usr/bin/env muawk. muawk wouldn't have any need to split arguments. You know, #!/bin/awk -f does also work. It just means you hardcode the location instead of using $PATH via env. – JoL Dec 04 '23 at 05:31
@JoL: All the uses of it I saw in the wild were either #!/bin/muawk or #!/bin/muawk options. Note that -f didn't ever need to be given. Had the author known about -- (which he most definitely didn't) I'm pretty sure it would have added -- after the first argument. – Joshua Dec 04 '23 at 15:31
It was a very strange system. Most of the textutils were actually written in awk, and sendmail was written in this weird combination of awk and sh. – Joshua Dec 04 '23 at 15:39

jubilatious1 · Answer 2 · 2023-12-03T06:25:01.253

Using Raku (formerly known as Perl_6)

Use skip to forget about header lines for the moment:

~$ raku -e 'my @a = lines.skip(3).rotor(4, partial => True).map: *.words; .[0,3,5,7].put for @a;'  file
#OR
~$ raku -e 'my @a = lines.skip(3).batch(4).map: *.words; .[0,3,5,7].put for @a;'  file

Above is an answer coded in Raku, a member of the Perl-family of programming languages. Briefly, lines are read in, skipping the first 3 header lines. Every 4 lines are rotored/batched together, including final partial "rotorings" at the end of the file. While we're at it, let's break each rotor/batch into whitespace-separated words.

These rotor/batches-of-4-lines-each-broken-on-whitespace are saved in an @-sigiled Array called @a. Finally (in the second statement), using for each @a position is iterated through and output, taking care that unwanted elements are dropped (via indexing brackets [0,3,5,7]).

Sample Input:

# Time-averaged data for fix avetimeall
# TimeStep Number-of-rows
# Row c_gyrationchunkall
1000 3
1 2.09024e-14
2 4.88628
3 5.69321
2000 3
1 2.10518e-14
2 8.33702
3 8.83162
3000 3
1 1.96656e-14
2 12.1396
3 11.5835

Sample Output:

1000 2.09024e-14 4.88628 5.69321
2000 2.10518e-14 8.33702 8.83162
3000 1.96656e-14 12.1396 11.5835

Regarding the header lines, it could be just as easy to start the Raku code with two put statements, e.g. put "Time-averaged data..."; etc. But indeed, following works to give the output desired by the OP:

~$ raku -e 'lines[0].put; .words[0..1, *-1].put for lines[0..1].rotor(2);  \
            my @a = lines.rotor(4, partial => True).map: *.words;          \
            .[0,3,5,7].put for @a;'  file
## Time-averaged data for fix avetimeall
# TimeStep c_gyrationchunkall
1000 2.09024e-14 4.88628 5.69321
2000 2.10518e-14 8.33702 8.83162
3000 1.96656e-14 12.1396 11.5835

https://raku.org

Prabhjot Singh · Answer 3 · 2023-12-03T18:23:51.203

Using AWK:

$ awk '
    NR==2{sub(/[[:space:]]+[^[:space:]]+$/,"");rec = $0; next}
    NR==3{$0 = rec OFS $NF};
    NR<4;                                
    NR>3{printf "%s", (NR%4==0) ? ((NR==4) ? "" : ORS) $1 : ($1="")$0 }
   END{if (NR)print ""}'

$ awk '
   NR==2{sub(/[[:space:]]+[^[:space:]]+$/,"");rec = $0; next}
   NR==3{$0 = rec OFS $NF};
   NR<4;
   $NF ~ /^[0-9]+$/{a=$NF;n=NR+a; sub(/[[:space:]]+[^[:space:]]+$/,""); printf "%s", $0; next}                    
   NR<=n{$1 =""; printf "%s", $0((NR==n) ? ORS : "") }'

Xavier G. · Answer 4 · 2023-12-03T14:09:18.523

2

Quick and dirty anwser -- feel free to run shellcheck on this:

#!/usr/bin/env bash
Reformat comments:
read -r first_line
echo "${first_line}"
read -r sharp line2_word1 line2_word2
read -r sharp line3_word1 line3_word2
echo "# ${line2_word1} ${line3_word2}"
Reformat data:
while read -r timestep number_of_rows; do
    echo -n "${timestep}"
    for (( i=1; i<=number_of_rows; i++ )); do
        read -r row value
        echo -n " ${value}"
    done
    echo
done

Usage: ./script_filename < input

Limitations:

this script assumes data lines are ordered (i.e. 1, 2, 3, as shown in the example)
this script does not handle interrupted data (e.g. announcing 3 lines of data but providing only 1)

edited Dec 03 '23 at 14:09

answered Dec 03 '23 at 04:22

Xavier G.

31

1

Please read why-is-using-a-shell-loop-to-process-text-considered-bad-practice. – Ed Morton Dec 03 '23 at 11:11
4

My first thought was to answer with a well-presented Perl one-liner but I decided to stick with bash because the question is "How do I do this in Bash?" and I still think "Don't" is too dogmatic, especially since I perceive that readability is more important than performance here (I could be wrong though, only OP can tell). In the meantime, I ran shellcheck on my side and adjusted a couple things. Do you see security issues in the code above? – Xavier G. Dec 03 '23 at 14:27
1

When people ask "how do I do this in bash?" they're never asking for how to do it using only bash builtins. I don't see any security issues but those echos will do different things depending on which version of echo you're picking up and what the values of the variables are. You could use printf '# %s %s\n' "$line2_word1" "$line3_word2", for example, to remove that issue. Other than that it'd just be orders of magnitude slower to run than an awk (or sed, perl, etc.) script. – Ed Morton Dec 03 '23 at 14:35
If it's that you find the style you used to be more readable than idiomatic awk, you can write exactly the same style of code in awk using getline instead of read. I added an awk script to the end of my answer that shows how to do that. – Ed Morton Dec 03 '23 at 14:48
Ah, yes, the builtin echo could be disabled and we could pick a variant that does not support -n. (also, xpg_echo could be on) printf could also be disabled, but the behaviour of %s and \n remains reliable. Essentially, the fact that bash builtins can be disabled is indeed a strong argument. – Xavier G. Dec 03 '23 at 14:56
1

Your second version of your awk answer is indeed much more appealing. – Xavier G. Dec 03 '23 at 15:00
1

I understand why you'd feel that way. It took me a bit of usage to really appreciate and understand the benefits of awk already having the while-read loop, splitting input into fields, and condition{action} body structure built into the tool/language. – Ed Morton Dec 03 '23 at 15:04
I just added some timing results to the bottom of my answer too in case you're interested. – Ed Morton Dec 03 '23 at 15:28

user9101329 · Answer 5 · 2023-12-03T22:07:46.880

With the caveats mentioned in your question and using your sample input as file q762948, you can do this by a simple awk command:

$ head -2 q762948 >result.txt
# dump the comments as required
$ tail +4 q762948 | awk '{c=(NR-1)%4} c==0{p=$1;print ""} c>0{p=$2}{printf p"  "}'>>result.txt    
$ cat result.txt
Time-averaged data for fix avetimeall
TimeStep Number-of-rows
1000  2.09024e-14  4.88628  5.69321

2000  2.10518e-14  8.33702  8.83162

3000  1.96656e-14  12.1396  11.5835

How can I reformat blocks of data until the end of the file is reached?

5 Answers5

Reformat comments:

Reformat data:

Time-averaged data for fix avetimeall

TimeStep Number-of-rows