2

A similar question to mine was asked that concerned transposing multiple columns into rows. However, the particular column format that I am dealing with is slightly different in that my file contains a varying number of columns that sometimes have duplicates.

For example:

100  
1  
2
3  

200  300  
1    1  
2    2
3    3

100  
1  
2  
3

400  500  600  700  800  900  
1    1    1    1    1    1
2    2    2    2    2    2
3    3    3    3    3    3

100 400 700   
1   1   1
2   2   2
3   3   3

I would like to transpose these columns into rows that look like the following:

100 1 2 3
200 1 2 3
300 1 2 3
100 1 2 3
400 1 2 3
500 1 2 3
600 1 2 3
700 1 2 3
800 1 2 3
900 1 2 3
100 1 2 3
400 1 2 3
700 1 2 3 

And then sort the rows by the values of the first column as follows:

100 1 2 3
100 1 2 3
100 1 2 3    
200 1 2 3
300 1 2 3
400 1 2 3
400 1 2 3    
500 1 2 3
600 1 2 3
700 1 2 3
700 1 2 3    
800 1 2 3
900 1 2 3 

And sum the values for duplicated rows as follows:

100 3 6 9    
200 1 2 3
300 1 2 3
400 2 4 6
500 1 2 3
600 1 2 3
700 2 4 6   
800 1 2 3
900 1 2 3 

You will notice that because rows 100, 400, and 700 had duplicates, their columnar values have been summed.

Any insights or suggestions are greatly appreciated.

Dave
  • 45
  • How many rows and columns do you expect the output file to have? (i.e., can you build a structure in memory, or will you have to use temporary files of some sort) – cjm Nov 20 '13 at 23:10
  • @cjm I expect the output file to have a fixed number of columns (four columns in the above example), and a fixed number of rows (nine rows in the above example). However, the original input file might have varying occurrences of any one column. In the above example, the "100" column appears three times. But another input file may have the "100" column appear five times. – Dave Nov 21 '13 at 00:05
  • 1
    No, I'm trying to get a sense of the maximum size of the problem. Will it be 500 columns by 400 rows, or 5 billion columns by 4 billion rows? If the problem is small enough to hold the final output file in memory, it's much simpler to solve. – cjm Nov 21 '13 at 08:56

1 Answers1

3

I would use Perl's paragraph mode for this:

#!/usr/bin/env perl

use strict;
use warnings;

my %final_lines;        # Data structure to hold the modified lines
my $filename = shift // die "No input file given!\n";
open my $IN,'<',$filename or die "Failed to open $filename for input: $!\n";

PARAGRAPH: {
    local $/="";        # Paragraph mode
    while(<$IN>){       # Read a new "paragraph"
        my @lines  = split /\n/;
        my @fields = split /\s+/,(shift @lines);
        my $line_number =0;
        for my $line (@lines){
            my @data = split /\s+/,$line;
            map { 
                  $final_lines{$fields[$_]}->[$line_number] += $data[$_]
            } (0..$#data);
            $line_number++;
        }
    }
}

for my $key (sort { $a <=> $b } keys %final_lines){
    local $,=' ';
    print STDOUT $key,@{$final_lines{$key}};
    print STDOUT "\n";
}

Use it like so:

$ /path/to/script input_file > output_file

This code is tested and should work fine. As pointed out by @cjm in the comments though, it will probably take some time if your input file is large. The step most likely to take time is the final sort of the keys.

Joseph R.
  • 39,549