1

I need to do something very similar to this Replace string with sequential index, but I don't need to add a number to a column, but substitute a whole column with incrementing numbers. Like this:

0   0   chr1    3000575 3000801 0   chr1    4340023 4340249 32  32  
0   0   chr1    3000641 3000801 -1  chr1    3311943 3311783 32  32  
0   0   chr1    3000674 3000801 -1  chr1    3001534 3001407 32  32  
0   0   chr1    3000674 3000801 -1  chr1    3001534 3001407 32  32  
0   0   chr1    3000674 3000801 -1  chr1    3001534 3001407 32  32

becomes

0   0   chr1    3000575 3000801 0   chr1    4340023 4340249 32  32  
1   0   chr1    3000641 3000801 -1  chr1    3311943 3311783 32  32  
2   0   chr1    3000674 3000801 -1  chr1    3001534 3001407 32  32  
3   0   chr1    3000674 3000801 -1  chr1    3001534 3001407 32  32  
4   0   chr1    3000674 3000801 -1  chr1    3001534 3001407 32  32

(I don't care whether it starts with 0 or 1)

I feel very stupid, but I can't adjust the solution from that question to my case...

Phlya
  • 113

3 Answers3

3

To number lines, you may use nl. To remove columns (or rather filter out the ones you want to keep), you may use cut:

$ cut -f 2- cols.txt | nl
     1  0       chr1    3000575 3000801 0       chr1    4340023 4340249 32      32
     2  0       chr1    3000641 3000801 -1      chr1    3311943 3311783 32      32
     3  0       chr1    3000674 3000801 -1      chr1    3001534 3001407 32      32
     4  0       chr1    3000674 3000801 -1      chr1    3001534 3001407 32      32
     5  0       chr1    3000674 3000801 -1      chr1    3001534 3001407 32      32

The only annoying thing with nl is that it inserts space padding at the start of the line (by default, the line number field is 6 characters wide, and you don't want to lower this because it would truncate the line numbers). We may get rid of these like so:

$ cut -f 2- cols.txt | nl | sed 's/^ *//'
1       0       chr1    3000575 3000801 0       chr1    4340023 4340249 32      32
2       0       chr1    3000641 3000801 -1      chr1    3311943 3311783 32      32
3       0       chr1    3000674 3000801 -1      chr1    3001534 3001407 32      32
4       0       chr1    3000674 3000801 -1      chr1    3001534 3001407 32      32
5       0       chr1    3000674 3000801 -1      chr1    3001534 3001407 32      32

The cut utility takes a list of columns that you want to "cut out of" the input. In our case it's columns 2 and onwards (-f 2-). Since your data is tab-delimited, cut will do this without modification, otherwise it's possible to tell it to use another delimiter with -d.

The sed command will simply substitute those spaces at the start of the line from nl with nothing.

Kusalananda
  • 333,661
  • 1
    I compared time it took for these two commands, and although user time was similar, real time was smaller here, so accepted, thanks! – Phlya Jun 25 '16 at 18:19
  • @Phlya awk is actually too big a beast for doing simple things sometimes. – Kusalananda Jun 25 '16 at 18:20
  • I know it is, I am just not familiar with it unfortunately! – Phlya Jun 25 '16 at 18:21
  • @Phlya forgot to mention that the output is all still tab-delimited. nl puts a tab after the line number by default. – Kusalananda Jun 25 '16 at 18:22
  • @Phlya You're working on genomic data. If the line numbers get too big, increase the width for nl with nl -w 10. – Kusalananda Jun 25 '16 at 18:26
  • I am, there are ~200-300 mln lines. What do you mean by too big? – Phlya Jun 25 '16 at 18:31
  • 1
    @Phlya nl uses 6 characters for the line numbers and will truncate them if they get bigger than '999999'. With nl -w 10 you would allow for line numbers ranging from 1 to 9999999999. – Kusalananda Jun 25 '16 at 18:38
  • 1
    Maybe on my system the default is different, but I get longer numbers without this parameter, but thanks for the explanation! – Phlya Jun 25 '16 at 18:41
3

With awk

$ awk '{$1=FNR-1; print}' OFS='\t' file
0   0   chr1    3000575 3000801 0   chr1    4340023 4340249 32  32
1   0   chr1    3000641 3000801 -1  chr1    3311943 3311783 32  32
2   0   chr1    3000674 3000801 -1  chr1    3001534 3001407 32  32
3   0   chr1    3000674 3000801 -1  chr1    3001534 3001407 32  32
4   0   chr1    3000674 3000801 -1  chr1    3001534 3001407 32  32
steeldriver
  • 81,074
  • Seems like the other answer is a little faster, so accepted it, but thanks for your answer! – Phlya Jun 25 '16 at 18:20
2

With ed (using a literal tab, composed as Ctrl-V+TAB, in the substitution)

$ ed -s file << EOF
,s/0    //
,n
q
EOF

1   0   chr1    3000575 3000801 0   chr1    4340023 4340249 32  32  
2   0   chr1    3000641 3000801 -1  chr1    3311943 3311783 32  32  
3   0   chr1    3000674 3000801 -1  chr1    3001534 3001407 32  32  
4   0   chr1    3000674 3000801 -1  chr1    3001534 3001407 32  32  
5   0   chr1    3000674 3000801 -1  chr1    3001534 3001407 32  32

The (.,.)n command prints the addressed lines, preceding each line by its line number and a tab - perfect for your tab-delimited format.

steeldriver
  • 81,074