Make new column at each occurence of a specific row value using awk

Question

I have a text file column.txt with two columns like shown below:

  1 1.1
  2 4.0
  3 3.2
  start newset
  1 2.2
  2 6.1
  3 10.3
  4 2.1
  start newset
  1 18.2
  2 4.3

I need to convert this to multiple columns such that a pair of new column is created when start newset is reached. Therefore, my desired output text file would look like (I would like to discard the row with string start newset):

  1 1.1 1 2.2 1 18.2
  2 4.0 2 6.1 2 4.3
  3 3.2 3 10.3
        4 2.1

score 2 · Accepted Answer · edited May 20 '18 at 21:45

You can try this awk

awk '
    /^start newset/ {
        max = max>i ? max : i
        i = 0
        set++
        next
    }
    {
        ++i
        a[i][set] = $0
    }
    END {
        for( i=1 ; i<=max ; i++ ) { 
            for( j=1 ; j<=set ; j++ )
                b = b OFS a[i][j]
            sub( "\t" , "" , b )
            print b
            b=""
        }
    }
' set=1 OFS='\t' column.txt

Gaultheria · Answer 2 · 2018-05-21T08:21:55.647

csplit & paste

Use csplit to break one file into multiple files at a pattern. Then use paste to join the new files together.

awk 'NF' column.txt | csplit --suppress-matched -s -z -f INTERIM -n 4 - '/start newset/' '{*}' ; paste INTERIM* | expand -t 6,13 ; rm -f INTERIM*

The same code, reformatted for clarity:

awk 'NF' column.txt | \
csplit --suppress-matched -s -z -f INTERIM -n 4 - '/start newset/' '{*}' ;

paste INTERIM* | \
expand -t 6,13 ;

rm -f INTERIM*

Description:

awk 'NF' column.txt
Remove empty lines. Otherwise, empty lines in the input file would place extra column separators in the output.
csplit
- --suppress-matched
  Don't include lines containing the splitting pattern in the output.
- -s
  Don't show summary information about the output files.
- -z
  Don't produce empty output files (ie, when two adjacent lines of the input file contain the splitting pattern).
- -f INTERIM
  Filenames of the split files begin with this string.
- -n 4
  Filenames of the split files end with a number containing this many digits.
- -
  Take input from STDIN, since we're first running the input file through awk.
- '/start newset/'
  Split the input file at the first line containing this regular expression.
- '{*}'
  Keep splitting the input file on every additional line containing that regular expression.
paste INTERIM*
Join the interim files.
expand -t 6,13
Adjust the column spacing between the joined files (eg, start the second file at column 6 and the third file at column 13).
rm -f INTERIM*
Delete the interim files.

Example input file column.txt:

1 1.1
2 4.0
3 3.2
start newset
1 2.2
2 6.1
3 10.3
4 2.1
start newset
1 18.2
2 4.3

Example output:

1 1.1 1 2.2  1 18.2
2 4.0 2 6.1  2 4.3
3 3.2 3 10.3 
      4 2.1

It's a little more complicated if the lines of the input file and the final output are indented.

Example input file column.txt:

  1 1.1
  2 4.0
  3 3.2
  start newset
  1 2.2
  2 6.1
  3 10.3
  4 2.1
  start newset
  1 18.2
  2 4.3

Change awk 'NF' to awk 'NF { sub(/^ +/,"",$0) ; print $0 }' to remove the indentation before further processing.
Change expand -t 6,13 to awk '{ print " " $0 }' | expand -t 8,15 to indent the output.

Example output:

  1 1.1 1 2.2  1 18.2
  2 4.0 2 6.1  2 4.3
  3 3.2 3 10.3 
        4 2.1

Kusalananda · Answer 3 · 2018-05-20T09:57:34.587

1

Taking the route through some temporary files:

$ awk 'BEGIN { n = 1 } /^start newset/ { n++; next } { name = sprintf("tmp-%04d", n); print >name }' file

This will produce no output in the terminal, but will create files called tmp-n where n is a zero-filled four-digit integer greater than or equal to 1. There will be one file for each set of data.

Then we may paste these temporary files together:

$ paste tmp-*
1 1.1   1 2.2   1 18.2
2 4.0   2 6.1   2 4.3
3 3.2   3 10.3
        4 2.1

or, with spaces as delimiters instead of tabs:

$ paste -d ' ' tmp-*
1 1.1 1 2.2 1 18.2
2 4.0 2 6.1 2 4.3
3 3.2 3 10.3
 4 2.1

If there is a large number of sets in the data, then there are two issues with this:

You may run out of filedescriptors in awk. This may be protected against by changing the second awk code block from
```
{ name = sprintf("tmp-%04d", n); print >name }
```
to
```
{ name = sprintf("tmp-%06d", n); print >>name; close(name) }
```
(notice the change in the formatting string too, to allow for bigger numbers)
There may be issues executing the paste command since the pattern tmp-* expands to too many files. Let me know if this is an issue and I'll sort it out then (there will be a shell loop that builds up the result by adding columns from the tmp-* files).

edited May 20 '18 at 09:57

answered May 20 '18 at 09:48

Kusalananda

333,661

I have 136 sets of data with each set having somewhere between 450-650 lines. The awk code creates exactly that many files and everything is fine until this point. But when I execute the paste command I am not getting what is expected. – flappingwings May 20 '18 at 11:39
@flappingwings That's unfortunate. In what way is the result unexpected? – Kusalananda May 20 '18 at 11:42
It is adding extra lines beyond the last line of a set. Could it be due to different number of lines in each set? – flappingwings May 20 '18 at 11:47
@flappingwings that's because your terminal cannot show more than default I think 80 characters position, save the paste command result to a file and open with a text editor, that will display correctly – αғsнιη May 20 '18 at 12:01
@αғsнιη Thanks, but I am writing the output to a file and is using notepad++ to view the file. As I understand it is an issue of proper spacing. Because my second set has more rows (or lines) than the first set, all lines in second set beyond the number of lines in the first set is printed beneath the first. Btw, I tried the solution from the thread you pointed to as a possible answer to this question. It's not working for my problem. – flappingwings May 20 '18 at 12:07
for that, try paste tmp-*| column -s $'\t' -tn. The flagged question is exactly answer to your question – αғsнιη May 20 '18 at 12:15

Kamaraj · Answer 4 · 2018-05-20T12:17:47.040

$ awk '$1+0>=1{a[$1]=a[$1]" "$0}END{for (i in a)print a[i]}' file.txt 
 1 1.1 1 2.2 1 18.2
 2 4.0 2 6.1 2 4.3
 3 3.2 3 10.3
 4 2.1

the above awk command create an array called a and store/append the values based on the first column. once you fully read the file, just print the array values.

step 1 : a[1] = "1 1.1"
step 2 : a[2] = "2 4.0"
step 3 : a[3] = "3 3.2"
step 4 : ignore the line # 4. because the first column is not numeric
step 5 : a[1] = "1 1.1 1 2.2"
step 6 : a[2] = "2 4.0 2 6.1".
....
...
once the file is fully procssed by awk, then just print the array values a[1],a[2],a[3]...a[n]

Can you please add some explanations about the code? – Romeo Ninov May 20 '18 at 12:04 — Romeo Ninov, May 20 '18 at 12:04

Make new column at each occurence of a specific row value using awk

4 Answers4

csplit & paste