1

Please help me to convert the row values into column.  My input file.dat is:

First two fields are alpha string and numbers respectively and input will always have same number of fields i.e., 6.

INPUT

A|1|DLT|07:30|10:30|34
A|1|STG|07:30|10:30|NA
A|1|MAIN|06:30|10:30|NA
A|2|STG|07:30|10:30|NA
A|2|UNLD|04:30|10:30|90
B|1|DLT|03:30|10:30|34
B|1|STG|07:30|09:30|NA
B|1|MAIN|07:25|10:30|NA
B|1|UNLD|05:30|12:30|8

There is a change in the output. Output fields have 18 fields for all the rows and each row will have 1st two fields of input file followed with 3rd,...,18th fields in below order in which blank fields are missing rows in input file and order of records are DLT,STG,MAIN,UNLD following with corresponding records.

OUTPUT

A|1|DLT|07:30|10:30|34|STG|07:30|10:30|NA|MAIN|06:30|10:30|NA|||| 
A|2|||||STG|07:30|10:30|NA|||||UNLD|04:30|10:30|90
B|1|DLT|03:30|10:30|34|STG|07:30|09:30|NA|MAIN|07:25|10:30|NA|UNLD|05:30|12:30|8
  • UNLD is missing for A|1
  • DLT and MAIN are missing for A|2
  • none of the records are missing

Similarly if any of the records corresponding to 3rd field of input file are missing then it has to be replaced with blank values so that output fields will have 18 fields.

  • Hi Valentin, i want to convert the data mentioned as input file to output file format – Yuva Raj Nov 16 '16 at 09:33
  • Do you guarantee that the input will be sorted?  What general statement can you make about the first two fields? (For example, always one letter and one digit? or an alpha string and a number?)  Will there always be exactly six fields on each line?  Do you want your output data lines separated by blank lines (as the question currently shows)? Please do not respond in comments; [edit] your question to make it clearer and more complete. – G-Man Says 'Reinstate Monica' Nov 17 '16 at 01:30
  • I have edited my answer according to your new output requirements. – Valentin B. Nov 18 '16 at 16:22

3 Answers3

2

My awk proposition:

awk -F'|' '{ if (a[$1$2] == "") {
               a[$1$2] = $0
             } 
             else {
               a[$1$2] = a[$1$2]","$3"|"$4"|"$5"|"$6
             }
           }
           END {
             for (key in a) {
               print a[key]
             }
           }' <input.txt | sort

Explanation

The -F'|' option defines the field separator (what awk uses to parse fields within a line) as character '|', since this is how your file is formatted.

a[...] is an array. Arrays in awk function a bit like python dictionnaries, instead of indexes you can have keys that are in fact strings. For each line of the input file, the test if (a[$1$2] == "") checks for the key that corresponds to the first 2 fields ($1$2 = A1 for the first line for example) if there is an entry. If not (first A|1|... line read), the whole line is stored at this key (a[$1$2] = $0). If there is already something (another A|1|... line has already been stored), then we concatenate the entry with a comma and the fields from 3 to 6 separated by "|" (a[$1$2] = a[$1$2]","$3"|"$4"|"$5"|"$6).

Finally, when we have gone through the file, we need to output the entries for each key. We do this in an END block (those instructions are executed once all the file has been read). To do this we simply go through all the keys of the array (for (key in a)) and print the entry for each one.

The final output is then piped to sort, because awk will not run through array keys in alphanumeric order, so it's cleaner to sort the output so that you get the A|1|... line followed by A|2|... and so on.


Your last edit made the whole thing a bit trickier. The awk instructions needed are going to get a bit furry, so I would advise that you create a awk script file (create a text file with .awk extension e.g. myScript.awk). Copy the following script inside it:

BEGIN { FS="|" }

$3 == "DLT" {
  dlt[$1"|"$2]=$3"|"$4"|"$5"|"$6
  a[$1"|"$2]++
}

$3 == "STG" {
  stg[$1"|"$2]=$3"|"$4"|"$5"|"$6
  a[$1"|"$2]++
}

$3 == "MAIN" {
  main[$1"|"$2]=$3"|"$4"|"$5"|"$6
  a[$1"|"$2]++
}

$3 == "UNLD" {
  unld[$1"|"$2]=$3"|"$4"|"$5"|"$6
  a[$1"|"$2]++
}

END {
  for (key in a) {
    if (dlt[key] == "") dlt[key]="|||"
    if (stg[key] == "") stg[key]="|||"
    if (main[key] == "") main[key]="|||"
    if (unld[key] == "") unld[key]="|||"
    print key"|"dlt[key]"|"stg[key]"|"main[key]"|"unld[key]          
  }
}

To use it :

awk -f myScript.awk <input.txt | sort

If you understood the explanation to my initial answer, you should be able to understand this algorithm. This time we make an array for every data type (dlt, stg, main and unld) and store their values at the key corresponding to the first two fields. Array "a" is used to keep track of all possible keys. At the end we go through the keys of array a, and if one of the data arrays is empty at this key, fill it "|||" as you wanted, so that every line ends up with 18 fields.

0

A slight variation on Valentin B.’s answer:

awk -F"|" '
     {
        key = $1 "|" $2
        a[key] = 1
        b[key][$3] = $3 "|" $4 "|" $5 "|" $6
     }
 END {
        subtype[1] = "DLT"
        subtype[2] = "STG"
        subtype[3] = "MAIN"
        subtype[4] = "UNLD"
        for (key in a)
            {
                output = key
                for (i = 1; i <= 4; i++)
                    {
                        st = subtype[i]
                        if (b[key][st] == "")
                                output = output "||||"
                        else
                                output = output "|" b[key][st]
                    }
                print output
            }
     }' file.dat

As in Valentin B.’s answer:

  • -F"|" sets the field separator to |, so we can extract the fields from the input lines.
  • key gets set to A|1, A|2, or B|1 (or, in general, the concatenation of the first two fields).  We use this to combine the data from the (up to) four lines that relate to that key combination.
  • Set a[key] to 1 to keep a record of what data (keys) we’ve seen.
  • Set b[key][$3] to the rest of the line after the key.  For example, b["A|1"]["DLT"] will be set to "DLT|07:30|10:30|34".  This way, we keep a record of all the data we’ve seen (assuming that lines have no more than six fields, as specified in the question).  Note that we do this without knowing what might be in $3.

After we’ve read all the data:

  • Define an array of the subtypes (i.e., valid $3 values, which are “DLT”, “STG”, “MAIN”, and “UNLD”) so we don’t need to list them more than once.
  • Iterate over all the keys we have seen.  Begin to build the output line.
  • Iterate over the four subtypes (listed above).  If we have data for this key and this subtype, append it to the output line; otherwise, append four blank fields.
  • Print the line.

Pipe this into sort if you want the output sorted.  (I say “if” because, while the question shows the example output sorted, it doesn’t say that the output must be sorted.)

The question is imprecise about what will be in the second field; it says only that they are “numbers”.  If they might be integers with varying numbers of digits, and you want them sorted as numbers, use

sort -t"|" -k1,1 -k2,2n
-1

capture the values from 4th filed and store it in variable called "val"

create a array and append the value of array based on the first and second filed.

in the end, print the array values and do small adjustments ( replace, remove..etc )

$ awk -F\| '{ val=$3;
              for (i=4;i<=NF;i++) {val=val"|"$i}
              Arr[$1OFS$2]=Arr[$1OFS$2]","val;
              next
            }
        END { for (i in Arr) {
                  A=Arr[i];
                  sub(" ","|",i);
                  print i,A
              }
            }' test.txt | sed "s/ ,/\|/"
A|1|DLT|07:30|10:30|34 ,STG|07:30|10:30|NA ,MAIN|06:30|10:30|NA
B|1|DLT|03:30|10:30|34 ,STG|07:30|09:30|NA ,MAIN|07:25|10:30|NA ,UNLD|05:30|12:30|8
A|2|UNLD|04:30|10:30|90 ,DLT|08:30|11:30|4 ,STG|07:30|10:30|NA
Kamaraj
  • 4,365