2

I want to merge different columns within my file that share the same column header. The file looks like this and can be tab-separated or something else:

AaBbN    CcDdEeN    FfN     AaBbN    FfN
1        5          4   
3        1          2
2        NA         1
                            1        3
                            3        2
                            NA       4

So there are numbers or the string "NA" in the fields. The result would look like this:

AaBbN    CcDdEeN    FfN
1        5          4
3        1          2
2        NA         1
1                   3
3                   2
NA                  4

There are a lot of columns that are not ordered, so the title headers would need to be read automatically instead of manually specifying every single one. There are also a lot of empty fields. I've been looking into the paste and join commands to do the job. Especially join seems to do what I need, except it works with separate files, whereas my columns are within the same file.

So I tried to separate the columns into separate files and then combine them with join. I used an awk command that I derived from here:

https://www.unix.com/unix-for-dummies-questions-and-answers/204303-splitting-up-text-file-into-multiple-files-columns.html

awk ' { for( i = 1; i <= NF; i++ ) printf( "%s\n", $(i) ) >i ".txt"; } ' file.txt

which gives me separate columns, but here I ran into the first problem. All columns with empty space between the header and data were not processed correctly. Instead, only the column header was present in these files.

My second problem is with join: When I try to merge files back again, I get errors because the input is not sorted, which is of course impossible to do. Any sorting would destroy the relationship I am looking after.

So here I am at a dead end. Is there a more convenient way to merge the columns directly within a file?


Edit:

AdminBees solution comes closest to solving the problem, but the result is not quite right. Here is the result of the awk script applied to the example above. I made sure that all entries are tab separated with sed -i "s/[[:space:]]/ /g" (tab inserted with CTRL+V and TAB).

AaBbN   CcDdEeN FfN     FfN
1   5   4   

3   1   2

2   NA  1

            1
            3
            NA
And
  • 167
  • 6
  • Is your input file space- or tab-separated? If it is tab-separated, there may be a way to do it in awk, otherwise more effort will be needed ... – AdminBee Apr 09 '20 at 08:47
  • I can change it to anything. If tab-separation is the best, then please use that. – And Apr 09 '20 at 08:48
  • 2
    Are you sure you can change to tabs? It won't be trivial in a large file with fixed width columns like this. If you can, however, it would make things much easier. Is each field, including the headers, always only one character? – terdon Apr 09 '20 at 09:37
  • Are "duplicates" possible, i.e. a value for each "A" in line 1? What to do, then? – RudiC Apr 09 '20 at 09:41
  • The headers are actually strings of letters with varying length. And duplicates are not present in my file. – And Apr 09 '20 at 10:00
  • @And then please [edit] your question and show us a representative example of your file. Such details are essential. If you can indeed convert it to a tab-separated, or any other proper column-based format, instead of just visually aligning the fields, please do so. If we each field is separated by one, unique character that makes everything easier. If we need to deal with multiple consecutive spaces, and no set field delimiter, the problem is far more complicated. – terdon Apr 09 '20 at 10:03

3 Answers3

2

If your input is tab-separated:

awk -F"\t" '
NR == 1 {for (i=1; i<=NF; i++)  COL[i] = $i
        }
        {for (i=1; i<=NF; i++) OUT[NR, COL[i]] = $i
        }
END     {for (n=1; n<=NR; n++)  {split ("", DUP)
                                 for (i=1; i<=NF; i++)  if (!DUP[COL[i]]++) printf "%s" FS, OUT[n,COL[i]]
                                 printf RS
                                }
        }
' file
A   B   C   
1   5   4   
3   1   2   
2   2   1   
1       3   
3       2   
1       4   

It saves column headers for use as partial indices later, then for each line collects values into an array indexed by line No. and header partial index. In the END section, it prints that array in the original sequence taking care of duplicate column headers.

Duplicate handling may become a major effort for more intricate file structures.

RudiC
  • 8,969
  • It almost works, only the last few columns are empty except for the header. Also, an extra line with a tab is added between each line. I tried to delete those with sed '2~2d' file.txt, but it did not work. Do you know why? I updated my example to match my real data more closely. – And Apr 10 '20 at 09:32
  • A bit surprising. Above code applied to exactly your NEW input data gives exactly the desired output. – RudiC Apr 10 '20 at 09:50
1

A slightly different approach which doesn't require "buffering" the entire file:

AWK script colmerge.awk:

FNR==1{
    for (i=1; i<=NF; i++)
    {
    hdr[i]=$i;
    if (map[$i]==0) {map[$i]=i; uniq_hdr[++u]=$i; printf("%s",$i);}
    if (i==NF) printf("%s",ORS); else printf("%s",OFS);
    }
}

FNR>1{
    delete linemap;
    for (i=1; i<=NF; i++) if ($i!="") linemap[hdr[i]]=$i;
    for (i=1; i<=u; i++)
    {
    printf("%s",linemap[uniq_hdr[i]]);
    if (i==u) printf("%s",ORS); else printf("%s",OFS);
    }
}

Use as

awk -F'\t' -v OFS='\t' -f colmerge.awk file

This will gather all headers and identify the "unique" headers and their first occurence on line 1, and for each successive line create a map between headers and non-empty values, which is then printed out in the order of the "unique" headers as identified while processing the first line.

This only works, however, if your input file is tab-separated, as this is the only way to reliably detect "empty" fields.

Note also that the delete statement for the entire array linemap may not be supported by all awk implementations (should work on gawk, mawk and nawk, however).

AdminBee
  • 22,803
  • Your answer works best so far, only some of the columns behave funny. After the end of the first iteration of columns, some columns lack their header, or are empty altogether. I updated my example to match my data more closely. – And Apr 10 '20 at 09:16
  • That is interesting. I tried the script with your updated sample input and it still works for me. Can you post sample input where it doesn't work (like amending your question with an "Edit" section that addresses the issues with my proposed script specifically), and also provide the output when you apply the script to that sample input? On a side note, such effects could happen if the field delimiters are either not all tabs, or some are spaces instead of tabs etc., so it may be worth verifying that the input file is really tab-separated without exception. – AdminBee Apr 14 '20 at 06:53
  • I see you updated the question. I think the sed line you are using is dangerous, because it requires that in the original form, the fields of the input file are separated by exactly one space. If that is not the case, the number of tabs in the transformed file will be wrong, and the column association fails. I copy-and-pasted your sample input into a file and manually changed all separations between the columns to exactly one tab, and the script worked in that case. So, I think the remaining problem is your transformation process. Please check how the columns are separated originally. – AdminBee Apr 14 '20 at 08:42
  • Also, it was already mentioned, but if you generated the text file on Windows, the line-breaks will be different from what Unix/Linux tools expect and that can lead to strange side-effects; you will need to run dos2unix to fix that. – AdminBee Apr 14 '20 at 09:31
  • Alright, dos2unix fixes the problem with the example above. But in my original data I still have a problem. The empty columns in between are gone, but at some point the last columns contain only the header and no information. I am using an ubuntu shell on Windows 10, and use excel to process the data. Do you an idea, what else could cause trouble in that regard? – And Apr 14 '20 at 09:51
  • That is difficult to assess without the actual data. Can you somehow post an anonymized version of the original data (at least the section that has distorted output) and what the output looks like when applying the script? Also, I would be curious to see what the output for the sample data you already provided looks now, it may contain clues ... – AdminBee Apr 14 '20 at 10:02
1

for the tab-separated input.

read header and the corresponding columns number(s) into an array where they appeared in input file; then splitting the input file on each column into the same filename headerName.txt having same headerName. after all paste them together and column command used for beautifying output.

awk -F'\t' '
    ## find all the column number(s) when same header found and store in `h` array
    ## key is the column number and value is header name. for an example:
    ## for the header value 'A', keys will be columns 1 &4
    NR==1{ while (++i<=NF) h[i]=$i; next; }

         { for (i=1; i<=NF; i++) {

    ## save the field content to a file which its key column matches with the column 
    ## number of the current field. for an example:
    ## for the first field in column 1; the column number is 1, and so 1 is the key  
    ## column for header value A, so this will be written to "A.txt" filename
    ## only if it was not empty.
               if ($i!=""){ print $i> h[i]".txt" };
         }; }

    ## at the end paste those all files and beautify output with `column` command.
    ## number of .txt files above is limit to the number of uniq headers in your input. 
END{ system("paste *.txt |column \011 -tn") }' infile

comments-free command:

awk -F'\t' '
    NR==1{ while (++i<=NF) h[i]=$i; next; }
         { for (i=1; i<=NF; i++) {
               if ($i!=""){ print $i> h[i]".txt" };
         }; }
END{ system("paste *.txt |column \011 -tn") }' infile
αғsнιη
  • 41,407
  • Unfortunately this does not work. There are a whole bunch of lines added with seemingly arbitrary content, and in other lines the data is definitely matched wrongly. I updated my example to match my real data more closely. – And Apr 10 '20 at 09:19
  • @And it's mentioned that data should be tab-separated and only one tab between each column (as well as if it's empty). my answer doesn't change for your new updated sample as it is data type independent. – αғsнιη Apr 10 '20 at 09:40
  • Yes, this is what I did. I have the information stored in an excel file and exported the sheet with tab-separation, if this is important. – And Apr 10 '20 at 09:44
  • @And OK, I think you need first do dos2unix filename (or tr -d'\r' <filename) then apply awk command above assuming you exported file in Windows – αғsнιη Apr 10 '20 at 10:10