Converting rows to row groups

Question

Can somebody please help me with the below conversion shell script ?

Source File File 1 : for example

EXCHANGE_ID     :  192,                       410,
EXCHANGE_DTTM   :  2015-06-11+02:18:40+0000,        2015-06-11+02:12:28+0000,
PART_NAME       :  MRT,                     LR04,
PART_TRANS_ID   :  123,                       JAS04,
M_NAME      :  FAILED,  FAILED,
M_DTTM      :  2015-06-11T02:18:40+0000      2015-06-11T02:12:28+0000

OutPut as :

EXCHANGE _ID    :  192
EXCHANGE_ DTTM  :  2015-06-11T02:18:40+0000
PART_NAME       :  MRT
PART_TRANS_ID   :  123
M_NAME          :  FAILED
M _DTTM         :  2015-06-11T02:18:40+0000

EXCHANGE _ID    :  410
EXCHANGE_DTTM   :  2015-06-11T02:12:28+0000
PART_NAME       :  LR04
PART_TRANS_ID   :  JAS04
M_NAME          :  FAILED
M_DTTM          :  2015-06-11T02:12:28+0000

Here's what I've tried so far:

awk '{ for (i = 1; i <= NF; i++) f[i] = f[i] " " $i ; if (NF > n) n = NF } END { for (i = 1; i <= n; i++) sub(/^ */, "", f[i]) ; for (i = 1; i <= n; i++) print f[i] } ' FAILED.csv > TGT_FAILED.out

but that just prints things out a CSV instead of the desired format. Here's an example of the actual output, as opposed to the desired output from above:

EXCHANGE_ID EXCHANGE_DTTM PART_NAME PART_TRANS_ID M_NAME M_DTTM
: : : : : :
192, 2015-06-11+02:18:40+0000, MRT, 123, FAILED, 2015-06-11T02:18:40+0000
410, 2015-06-11+02:12:28+0000, LR04, JAS04, FAILED, 2015-06-11T02:12:28+0000

Tried few awk option but nothing worked for me ... really apprecate a quick response.... Thanks a lot in advance. — arsh, Jun 24 '15 at 15:15
Can you show us what you tried and what about it wasn't working for you? — Eric Renouf, Jun 24 '15 at 15:16
awk '{ for (i = 1; i <= NF; i++) f[i] = f[i] " " $i ; if (NF > n) n = NF } END { for (i = 1; i <= n; i++) sub(/^ */, "", f[i]) ; for (i = 1; i <= n; i++) print f[i] } ' FAILED.csv > TGT_FAILED.out — arsh, Jun 24 '15 at 15:24
I know my script is not giving me the right output. Can you help me with correct script to get the desired output ? — arsh, Jun 24 '15 at 15:43
Does your Source File 1 have more fields ? If so how many approximately. Also does Source File 1 have more repeating line groups like the 6 lines shown, and if so are they delimited by a blank line or suchlike? — Peter.O, Jun 24 '15 at 17:52
Source file does not have any more feilds ... and this is how exactly the file looks... — arsh, Jun 25 '15 at 15:56

user1794469 · Answer 1 · 2015-06-24T18:35:52.783

How about:

awk -F'[ \t,]+' '{a=a$1"\t"$2"\t"$3"\n"; b=b$1"\t"$2"\t"$4"\n"} END {print a; print b}' data.txt

Here we treat one or more space, tab, or comma as a field separator. Then on each line we build the output. Finally we print the results. This is a pretty dirty one liner; for example it has to read the whole file before it prints anything so for large files it's going to be bad on memory, but for smallish files it should do the trick.

From your input, this should result in:

EXCHANGE_ID :   192
EXCHANGE_DTTM   :   2015-06-11+02:18:40+0000
PART_NAME   :   MRT
PART_TRANS_ID   :   123
M_NAME  :   FAILED
M_DTTM  :   2015-06-11T02:18:40+0000

EXCHANGE_ID :   410
EXCHANGE_DTTM   :   2015-06-11+02:12:28+0000
PART_NAME   :   LR04
PART_TRANS_ID   :   JAS04
M_NAME  :   FAILED
M_DTTM  :   2015-06-11T02:12:28+0000

If you want the fields to be nicely spaced you can add sprintf to the call like this:

awk -F'[ \t,]+' '{label=sprintf("'%-10s'",$1); a=a""label"\t"$2"  "$3"\n"; b=b""label"\t"$2"  "$4"\n"} END {print a; print b}' data.txt

This gives the prettier output of:

EXCHANGE_ID     :  192
EXCHANGE_DTTM   :  2015-06-11+02:18:40+0000
PART_NAME       :  MRT
PART_TRANS_ID   :  123
M_NAME          :  FAILED
M_DTTM          :  2015-06-11T02:18:40+0000

EXCHANGE_ID     :  410
EXCHANGE_DTTM   :  2015-06-11+02:12:28+0000
PART_NAME       :  LR04
PART_TRANS_ID   :  JAS04
M_NAME          :  FAILED
M_DTTM          :  2015-06-11T02:12:28+0000

Thanks alot for you response but still not getting the desired output .... :( — arsh, Jun 24 '15 at 17:16
Note that your output is a little inconsistent. For example, EXCHANGE_ID has a space before the _ and DTTM has one before but only in the first case. I'm guessing the field names are contiguous and that's just a typo. — user1794469, Jun 24 '15 at 17:23
if col1 col2 col3 a b c 1 2 3 x y z
or

if col1 a,1,x col2 b,2,y col3 c,3,z

Required Output is

Col1 : a Col2 : b Col3 : c

Col1 : 1 Col2 : 2 Col3 : 3

Col1 : x Col2 : y Col3 : x

Appreciate your help — arsh, Jun 24 '15 at 19:38

Peter.O · Answer 2 · 2015-06-26T11:18:08.083

To handle an input file with just 1 group of 6 lines and just 2 data columns, a simple –but limited, and greedy on resources– approach makes the coding overhead minimal, ie, no need for arrays to hold data from all 6 line:

f='src.txt'       # input fule
d=' +: +| +|, *'  # field delimiter regex
set {2,3}         # data columns - not label (which is column 1)
for c; do paste \
   <(gawk -F"$d" '1,6{print $1"\t:"}' "$f") \
   <(gawk -F"$d" '1,6{print $c}' c=$c "$f") |
     column -t; echo; done

The following awk method handles unlimited repeating 6-line groups, and any number of fields per line.

awk 'BEGIN{ FS=" +: +|, +|,| +"; OFS="\t"; maxw=length("EXCHANGE_DTTM") }
     /^EXCHANGE_ID/,/^M_DTTM/{ rn++
       if($NF=="") NF--
       for(fn=1;fn<=NF;fn++) cell[rn"."fn]=$fn
       if(rn==6){
         for(fn=2;fn<=NF;fn++) 
           for(rn=1;rn<=6;rn++)
             printf("%-"maxw"s : %s\n"(rn==6?"\n":""), cell[rn"."1], cell[rn"."fn]) 
         rn=0 }}' <"$f"

Output:

EXCHANGE_ID   : 192
EXCHANGE_DTTM : 2015-06-11+02:18:40+0000
PART_NAME     : MRT
PART_TRANS_ID : 123
M_NAME        : FAILED
M_DTTM        : 2015-06-11T02:18:40+0000

EXCHANGE_ID   : 410
EXCHANGE_DTTM : 2015-06-11+02:12:28+0000
PART_NAME     : LR04
PART_TRANS_ID : JAS04
M_NAME        : FAILED
M_DTTM        : 2015-06-11T02:12:28+0000

Here is a bash version of the awk script above. The logic flow is the same. I've added some error checking and blank-line skipping.

set EXCHANGE_DTTM; maxw=${#1}; nl=0    # length of longest label; line number
set -f; declare -A cell; cm=" "; nf=0  # no-globbing; cells-array; cell-margin; number-of-fields 
while IFS= read -r line; do ((nl+=1))  # increment line number
  [[ $line =~ ^[[:blank:]]*$ ]] && continue            # skip blank/empty lines 
  [[ $line =~ ^EXCHANGE_ID\ * ]] && rn=1 || ((rn+=1))  # reset/increment record number
  IFS=" ,"; f=(${line/ : / }); IFS=; f=(${f[@]})       # split line into fields
  (( nf )) && (( nf!=${#f[@]} )) && { echo ERROR: field count is not consistent; exit 1; } || nf=${#f[@]} 
  for (( fn=0; fn<nf; fn++ ));do cell[$rn.$fn]="${f[$fn]}"; done  # build cells-array
  (( rn==6 )) && {
    [[ $line =~ ^M_DTTM\ .* ]] || { echo ERROR: unexpected label found - record $rn$'\n'"$line"; exit 2; }
    for (( fn=1; fn<nf; fn++ )) ;do
      for (( rn=1; rn<=6; rn++ )) ;do
        (( rn==6 )) && b=$"\n" || b=""
        printf "%-${maxw}s${cm}:${cm}%s\n$b" "${cell[$rn.0]}" "${cell[$rn.$fn]}"
        done; done; } done <"$f"

awk '{print $1 ,": " $3}' Src.txt | column -t awk '{for(x=2;$x;++x) print $1, $x "\n"}' Src.txt | column -t — arsh, Jun 25 '15 at 22:15
@arsh: I'm not sure what the point of your comment is. It seems you have 2 awk | column snippets, neither of which deal with the delimiters properly. Aside from the problem of delimiters. The first one deals with only the first data column. The second one's output is in the wrong 'orientation' - thoughe each output block is row-based, as per the input, each column needs to be precessed individually (in blocks of 6 rows). – What I wanted from my answer, was to not use column, and to be able to handle any number of data columns, and any number of 6-row-blocks of data. — Peter.O, Jun 26 '15 at 00:06
the only reason am using the column -t is for proper indenting ... but am not able achieve the orientation of column names and data values ... — arsh, Jun 26 '15 at 02:13
By proper indenting do you mean a margin of of 2 spaces on both sides of the : as in your output sample? - My script uses amargin of 1 space on both sides of :, but that is very easy to change. — Peter.O, Jun 26 '15 at 02:31
awk '{for(x=2;$x;++x) print $1, $x "\n"}' Src.txt Can you help me with proper looping for this command ? — arsh, Jun 26 '15 at 02:42
@arsh: Looping is not the primary issue. The main thing is that you need to store the relevant data from *all* 6 lines, before you can start the looping ... arrays are the way to do it. refer to my code above where I store each *cell* in an array (awk's version of a 2-dimensional array), and then when all 6 lines have been read, the looping processing begins: see for(fn=2;fn<=NF;fn++) and for(rn=1;rn<=6;rn++) — Peter.O, Jun 26 '15 at 05:02

score 0 · Answer 3 · answered Jun 26 '15 at 05:52

If you haven't yet considered how usefully sed might be applied to your data, please, allow me to demonstrate:

sed '    s/\([^, ]\{1,\}\),*/\1,/2                                                                                 
         s//\1/3;H;y/,/\n/;P;/^M_D.*/!d
         s///;x;s/[^, ]*, *//g
'        <<\IN
EXCHANGE_ID:  192,                       410,
EXCHANGE_DTTM:  2015-06-11+02:18:40+0000,        2015-06-11+02:12:28+0000,
PART_NAME:  MRT,                     LR04,
PART_TRANS_ID:  123,                       JAS04,
M_NAME:  FAILED,  FAILED,
M_DTTM:  2015-06-11T02:18:40+0000      2015-06-11T02:12:28+000
IN

So, the first go round we Print only up to the first occurring comma - and that would be far more simple still if all of the fields were comma-delimited. But they're not, and so about half of the work there is spent ensuring they are.

Anyway, we print only the header and first following sequence of not space characters. Next, we append a copy of the current line to Hold space, then delete it - unless it matches ^M_D. In that case we exchange hold and pattern spaces, then with a single s///ubstitution simultaneously remove the second not space sequence of characters and all following spaces from every line we have just saved.

The result is printed to stdout:

EXCHANGE_ID:  192
EXCHANGE_DTTM:  2015-06-11+02:18:40+0000
PART_NAME:  MRT
PART_TRANS_ID:  123
M_NAME:  FAILED
M_DTTM:  2015-06-11T02:18:40+0000

EXCHANGE_ID:  410
EXCHANGE_DTTM:  2015-06-11+02:12:28+0000
PART_NAME:  LR04
PART_TRANS_ID:  JAS04
M_NAME:  FAILED
M_DTTM:  2015-06-11T02:12:28+000

Converting rows to row groups

3 Answers3