(EDITED) I have a file containing quite some problems. For example as below:
Chr1_RagTag_p AUGUSTUS transcript 393571 396143 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; gene_name "AT1G02100"; oId "g109.t1"; cmp_ref "AT1G02100.1"; class_code "="; tss_id "TSS64"; num_samples "1";
Chr1_RagTag_p AUGUSTUS exon 393571 393638 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "1";
Chr1_RagTag_p AUGUSTUS exon 393732 393945 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "2";
Chr1_RagTag_p AUGUSTUS exon 394047 394094 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "3";
Chr1_RagTag_p AUGUSTUS exon 394178 394259 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "4";
Chr1_RagTag_p AUGUSTUS exon 394457 394559 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "5";
Chr1_RagTag_p AUGUSTUS exon 394698 394818 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "6";
Chr1_RagTag_p AUGUSTUS exon 394911 394958 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "7";
Chr1_RagTag_p AUGUSTUS exon 395153 395236 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "8";
Chr1_RagTag_p AUGUSTUS exon 395347 395411 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "9";
Chr1_RagTag_p AUGUSTUS exon 395716 395767 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "10";
Chr1_RagTag_p AUGUSTUS exon 395957 395995 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "11";
Chr1_RagTag_p AUGUSTUS exon 396069 396143 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "12";
Chr1_RagTag_p manual transcript 396451 399224 . + . transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; gene_name "AT1G02110"; oId "g110.t1"; cmp_ref "AT1G02110.1"; class_code "="; tss_id "TSS65"; num_samples "2";
Chr1_RagTag_p manual exon 396451 397570 . + . transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; exon_number "1";
Chr1_RagTag_p manual exon 397661 397848 . + . transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; exon_number "2";
Chr1_RagTag_p manual exon 397923 398146 . + . transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; exon_number "3";
Chr1_RagTag_p manual exon 398367 399224 . + . transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; exon_number "4";
Chr1_RagTag_p AUGUSTUS transcript 77905 78201 . + . transcript_id "TCONS_00000004"; gene_id "XLOC_000004"; oId "g15.t1"; cmp_ref "AT1G01150.1"; class_code "x"; cmp_ref_gene "AT1G01150"; tss_id "TSS4"; num_samples "1";
Chr1_RagTag_p AUGUSTUS exon 77905 78201 . + . transcript_id "TCONS_00000004"; gene_id "XLOC_000004"; exon_number "1";
You could see that clearly there are two genes in the example according to the gene_name
, but somehow the software merges these two genes into one gene (gene_id
). I would like to correct these problems by using the gene_name
to replace the gene_id
for line with the same transcript_name
. Also, for the third gene (gene_id "XLOC_000004"
), I would like to replace its name by the things after oId
without the part .t[0-9]
output would be like this
Chr1_RagTag_p AUGUSTUS transcript 393571 396143 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; gene_name "AT1G02100"; oId "g109.t1"; cmp_ref "AT1G02100.1"; class_code "="; tss_id "TSS64"; num_samples "1";
Chr1_RagTag_p AUGUSTUS exon 393571 393638 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "1";
Chr1_RagTag_p AUGUSTUS exon 393732 393945 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "2";
Chr1_RagTag_p AUGUSTUS exon 394047 394094 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "3";
Chr1_RagTag_p AUGUSTUS exon 394178 394259 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "4";
Chr1_RagTag_p AUGUSTUS exon 394457 394559 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "5";
Chr1_RagTag_p AUGUSTUS exon 394698 394818 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "6";
Chr1_RagTag_p AUGUSTUS exon 394911 394958 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "7";
Chr1_RagTag_p AUGUSTUS exon 395153 395236 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "8";
Chr1_RagTag_p AUGUSTUS exon 395347 395411 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "9";
Chr1_RagTag_p AUGUSTUS exon 395716 395767 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "10";
Chr1_RagTag_p AUGUSTUS exon 395957 395995 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "11";
Chr1_RagTag_p AUGUSTUS exon 396069 396143 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "12";
Chr1_RagTag_p manual transcript 396451 399224 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; gene_name "AT1G02110"; oId "g110.t1"; cmp_ref "AT1G02110.1"; class_code "="; tss_id "TSS65"; num_samples "2";
Chr1_RagTag_p manual exon 396451 397570 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "1";
Chr1_RagTag_p manual exon 397661 397848 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "2";
Chr1_RagTag_p manual exon 397923 398146 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "3";
Chr1_RagTag_p manual exon 398367 399224 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "4";
Chr1_RagTag_p AUGUSTUS transcript 77905 78201 . + . transcript_id "TCONS_00000004"; gene_id "g15"; oId "g15.t1"; cmp_ref "AT1G01150.1"; class_code "x"; cmp_ref_gene "AT1G01150"; tss_id "TSS4"; num_samples "1";
Chr1_RagTag_p AUGUSTUS exon 77905 78201 . + . transcript_id "TCONS_00000004"; gene_id "g15"; exon_number "1";
I guess the logic would be grep the gene_name
by transcript_id
, and then replace the gene_id
in every line with gene_name
according to the transcript_id
number.
So far I had created a list with transcript_id
and gene_name
TCONS_00000070 AT1G02100
TCONS_00000071 AT1G02110
TCONS_00000004 g15
Then I need to replace the gene_id
in each line by the related gene_name
according to transcript_id
. But I'm not sure how to do it. Can I use sed
?
Last but not least, I should mention that my real file is huge, which contains 35000 different trnascript_id
.
Thanks lot in advance!
gffcompare
to generate this file. – zzz Apr 19 '22 at 19:36gene_name
s in your description, but I only see onegene_name
--and that's"AT1G02110"
. I see 2 differentgene_id
; they are"XLOC_000060"
and"XLOC_000004"
. You then refer totranscript_name
but there are no elements that fit that description. Not sure what's going on, but the Sample Input you've posted no longer matches your textual description. – jubilatious1 Jun 16 '22 at 04:47