1

I have gene.csv file which is about 1.3 GB in size and has got 300 columns and more than a million rows. it looks like following

id1      id2    id3    id4         count1      count2
S1001    450    GAF    ARHGAP18    1.56E-05    1483
S1001    450    GAF    ARHGAP12    5E-05       3698
S1001    450    GAF    ARHGAP15    2.75E-06    93
S1001    450    GAF    ARHGAP17    3E-05       1889
S1001    450    GAF    ARHGAP19    4.291E-06   596
S1002    450    GAF    ARHGAP18    5.955E-05   5353
S1002    450    GAF    ARHGAP12    8.578E-08   14
S1002    450    BAF    ARHGAP15    2.91E-05    5381
S1002    450    BAF    ARHGAP17    1.78E-06    105
S1002    450    BAF    ARHGAP19    3.62E-05    5764
S1003    450    BAF    ARHGAP18    5.2697E-06  330
S1003    450    BAF    ARHGAP12    5.955E-05   2263
S1003    450    BAF    ARHGAP15    8.578E-08   3147
S1003    450    CAF    ARHGAP17    2.91E-05    50
S1003    450    CAF    ARHGAP19    5.955E-05   1595
S1004    450    CAF    ARHGAP18    8.578E-08   970
S1004    450    CAF    ARHGAP12    2.91E-05    816
S1004    450    CAF    ARHGAP15    5.955E-05   4981
S1004    450    CAF    ARHGAP17    8.578E-08   816
S1004    450    CAF    ARHGAP19    2.91E-05    4981

I want only id1, id4 and count2 data fetched in the following format (as a matrix)

id4        S1001   S1002 S1003  S1004
ARHGAP18    1483   5353  330    970
ARHGAP12    3698   14    2263   816
ARHGAP15    93     5381  3147   4981
ARHGAP17    1889   105   50     816
ARHGAP19    596    5764  1595   4981

Please note i need to be repeated only once and as a column name in a new file (Since id1 is repeated in each row for each gene). How can i get this in from simple bash command (matrix of gene names(rows) and sample ids (columns).

Rui F Ribeiro
  • 56,709
  • 26
  • 150
  • 232
David
  • 13
  • 3

2 Answers2

3


with Miller, using reshape

mlr --tsv cut -o -f id4,id1,count2 then reshape -s id1,count2 input.tsv

you will have

id4     S1001   S1002   S1003   S1004
ARHGAP18        1483    5353    330     970
ARHGAP12        3698    14      2263    816
ARHGAP15        93      5381    3147    4981
ARHGAP17        1889    105     50      816
ARHGAP19        596     5764    1595    4981

As input and output I have set a TSV file. Is you text file tab separated?

Some comment:

  • --tsv is to set input & output format;
  • cut to extract only id4,id1 and count2;
  • reshape to modify the structure.
aborruso
  • 2,855
  • it did not output as desired, in fact after reshape it showed only 3 columns. my command was mlr reshape -s icgc_specimen_id,raw_read_count miller_1.tsv – David Jan 14 '19 at 05:56
  • @David I think you are using another input file. Please could you share some rows here? – aborruso Jan 14 '19 at 09:00
  • I have made an example file with only min rows, but the original file is of 2.7 GBs.Please have a look at the following link, here I need only [https://drive.google.com/file/d/1joMO2D1RuppJBbepeEu3-lLA7LQZJeJR/view?usp=sharing] i need icgc_specimen_id as colmns, gene_id as row and raw_read_count column. Thanks a lot :) – David Jan 14 '19 at 13:54
  • @David there is no link – aborruso Jan 14 '19 at 14:00
  • hi @dave the command for this new file is mlr --tsv cut -o -f icgc_specimen_id,gene_id,raw_read_count then reshape -s gene_id,raw_read_count then unsparsify --fill-with "" exp_seq_example.csv >out.tsv. Here the output – aborruso Jan 14 '19 at 14:14
  • Thanks, can i get icgc_specimen_id as Column id instead of gene id? – David Jan 14 '19 at 14:17
  • Hi @David could you insert a link of an example of output you want? Because in my output icgc_specimen_id is a column – aborruso Jan 14 '19 at 15:41
  • Hi @David with mlr --tsv cut -o -f gene_id,icgc_specimen_id,raw_read_count then reshape -s icgc_specimen_id,raw_read_count then unsparsify --fill-with "" exp_seq_example.csv >tmp.tsv you have this. Is it right? – aborruso Jan 14 '19 at 16:16
  • 1
    Awesome! It worked! Thanks a lot for your kind help... You made my day! – David Jan 15 '19 at 04:46
  • @David how long is the processing of this big file? – aborruso Jan 15 '19 at 07:07
  • 1
    First, it crashed my system, Later I had split the file into 4 parts and the file was processed in half an hour. – David Jan 15 '19 at 13:18
0

What you are asking for is a cross-tabulation (or pivot table). You can achieve it using GNU Datamash

datamash --header-in --whitespace crosstab id4,id1 unique count2 < file
    S1001   S1002   S1003   S1004
ARHGAP12    3698    14  2263    816
ARHGAP15    93  5381    3147    4981
ARHGAP17    1889    105 50  816
ARHGAP18    1483    5353    330 970
ARHGAP19    596 5764    1595    4981

(you can skip the --whitespace if your data are tab delimited).

However there may be dedicated bioinformatics tools that are better suited to the task.

steeldriver
  • 81,074