Splitting each row of a correlation matrix into individual files

Question

I have a correlation matrix of 22000 genes and for some analysis, I need to split each row of the matrix into a new file. Which means I need to create 22000 individual files.

I don't want to use the split command (because I want to get the output file as the gene_name.txt) Eg Input file

                IGHD2-15    IGHD3-22    IGHD3-16    IGHD3-10    
       IGHD2-15 1   0.696084    0.799736    0.818788    
       IGHD3-22 0.696084    1   0.691419    0.67505 
       IGHD3-16 0.799736    0.691419    1   0.810656    
       IGHD3-10 0.818788    0.67505 0.810656    1

Example input is a good first step, but we'll also need an example of the output you'd like to achieve. ;) — n.st, Nov 29 '18 at 23:27
Output file for IGHD2-15:
IGHD2-15 1 0.696084 0.799736 0.818788 — Priya, Nov 29 '18 at 23:52
This question is completely on topic and welcome to stay here, but for future reference, you might be interested in our sister site: [bioinformatics.se]. — terdon, Nov 29 '18 at 23:58

terdon · Accepted Answer · 2018-12-01T16:19:41.627

4

Assuming your gene names are in the first column, all you need is:

awk '{print >> $1".txt"; close(n".txt")}' matrix.txt

That will print each line into a file whose name is the 1st field of that line plus a (completely optional) .txt extension. If you don't want the gene name in the file, use:

awk '{n=$1; $1="";print >> n".txt"; close(n".txt")}' matrix.txt

And, if your first line is a header, use:

awk 'NR>1{print >> $1".txt"; close($1".txt")}' matrix.txt

Finally, in the unlikely case where your file can contain lines whose first field isn't a simple gene name but can contain NULL or a valid path, so you need to sanitize your input, you can use:

awk 'NR > 1 && ($1 ~ /^[A-Z0-9-]+$/) { print >> $1; close($1) }'

edited Dec 01 '18 at 16:19

answered Nov 29 '18 at 23:55

terdon

242,166

1

Expecting more than 20000 files, you might want to close () each file after printing... – RudiC Nov 30 '18 at 18:38
@RudiC why? There is no reason to assume the input file will be sorted, or that the genes will all be unique, that's why I'm using >>. What benefit would there be if I added close(n".txt") so the file would be closed each time? Actually, isn't that what awk will do anyway? How would an explicit close help? – terdon Nov 30 '18 at 18:43
1

awk will crash if you exceed the OPEN_MAX system configuration value, or its internal maximum (if different). Like: awk: cannot open "/tmp/1022" for output (Too many open files) and getconf OPEN_MAX 1024 – RudiC Nov 30 '18 at 18:53
@rudic I see, yes that makes a lot of sense. Answer edited, thanks! – terdon Dec 01 '18 at 00:59
@mosvy yes, I know, that's why the last line there is "if your first line is a header". – terdon Dec 01 '18 at 01:02
@mosvy I'm not at my machine now, but isn't that exactly what my last one produces? As for the names, yes that's a fair point but since this is tightly controlled data, I don't feel it's a big issue. That said, you're quite right and if the files have an existing path as the first field, that can indeed be dangerous. – terdon Dec 01 '18 at 01:32
@mosvy OK, I added that. Although I ran some tests and 200000 files seemed to cause no problems to my GNU awk. And while I agree that it's better safe than sorry, this sort of sanitation is rarely useful in this particular field because the input data is almost always pretty well controlled. – terdon Dec 01 '18 at 16:21

Wayne · Answer 2 · 2018-11-30T00:05:43.840

0

Since you didn't give and example of what you wanted each file to have in it, or what the files should be named im guessing.

This one will take the file "DATA" from your current directory, create a new file (in the same directory) named after the first column of each row, then fill that file with the data from the rest of the columns.

Meaning

IGHD2-15 1   0.696084    0.799736    0.818788

Creates a file called IGHD2-15 and puts this in it

1   0.696084    0.799736    0.818788

Script:

#!/bin/bash

while read -r line; do
        newFileName="$(echo "$line" | awk '{print $1}')"
        newFileData="$(echo "$line" | awk '{$1 = ""; print $0}')"
        echo $newFileData > $newFileName
done < DATA

edited Nov 30 '18 at 00:05

answered Nov 29 '18 at 23:53

Wayne

66

This is going to be very slow for a file this size. Also, as a general rule, using the shell for this sort of thing is a bad idea. – terdon Nov 29 '18 at 23:56
yea i like yours answer better. I didn't even know you could do that. I tried to make mine easy to understand how to change everything, like the file name and what data to include in the file – Wayne Nov 30 '18 at 00:00
2

Oh, don't worry, I used to do things like this too before I started hanging out here and the local gurus beat it out of me :) – terdon Nov 30 '18 at 00:01

score -1 · Answer 3 · answered Nov 30 '18 at 09:59

I tried by below method as checked it worked fine too

Here each individual line is copied to new file. file name will be first column of each line

cat data_file.txt
IGHD2-15 1   0.696084    0.799736    0.818788
IGHD3-22 0.696084    1   0.691419    0.67505
IGHD3-16 0.799736    0.691419    1   0.810656
IGHD3-10 0.818788    0.67505 0.810656    1


root@praveen_linux_example dev]# j=`cat data_file.txt| wc -l`
[root@praveen_linux_example dev]# for ((z=1;z<=$j;z++));  do filename=`awk -v line="$z" 'NR==line{print $1}' data_file.txt`; sed -n ''$z'p' data_file.txt >$filename.txt;done
[root@praveen_linux_example dev]#

Splitting each row of a correlation matrix into individual files

3 Answers3