replace string x by string y according to string z

Question

(EDITED) I have a file containing quite some problems. For example as below:

Chr1_RagTag_p   AUGUSTUS    transcript  393571  396143  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; gene_name "AT1G02100"; oId "g109.t1"; cmp_ref "AT1G02100.1"; class_code "="; tss_id "TSS64"; num_samples "1";
Chr1_RagTag_p   AUGUSTUS    exon    393571  393638  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "1";
Chr1_RagTag_p   AUGUSTUS    exon    393732  393945  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "2";
Chr1_RagTag_p   AUGUSTUS    exon    394047  394094  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "3";
Chr1_RagTag_p   AUGUSTUS    exon    394178  394259  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "4";
Chr1_RagTag_p   AUGUSTUS    exon    394457  394559  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "5";
Chr1_RagTag_p   AUGUSTUS    exon    394698  394818  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "6";
Chr1_RagTag_p   AUGUSTUS    exon    394911  394958  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "7";
Chr1_RagTag_p   AUGUSTUS    exon    395153  395236  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "8";
Chr1_RagTag_p   AUGUSTUS    exon    395347  395411  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "9";
Chr1_RagTag_p   AUGUSTUS    exon    395716  395767  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "10";
Chr1_RagTag_p   AUGUSTUS    exon    395957  395995  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "11";
Chr1_RagTag_p   AUGUSTUS    exon    396069  396143  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "12";
Chr1_RagTag_p   manual  transcript  396451  399224  .   +   .   transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; gene_name "AT1G02110"; oId "g110.t1"; cmp_ref "AT1G02110.1"; class_code "="; tss_id "TSS65"; num_samples "2";
Chr1_RagTag_p   manual  exon    396451  397570  .   +   .   transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; exon_number "1";
Chr1_RagTag_p   manual  exon    397661  397848  .   +   .   transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; exon_number "2";
Chr1_RagTag_p   manual  exon    397923  398146  .   +   .   transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; exon_number "3";
Chr1_RagTag_p   manual  exon    398367  399224  .   +   .   transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; exon_number "4";
Chr1_RagTag_p   AUGUSTUS    transcript  77905   78201   .   +   .   transcript_id "TCONS_00000004"; gene_id "XLOC_000004"; oId "g15.t1"; cmp_ref "AT1G01150.1"; class_code "x"; cmp_ref_gene "AT1G01150"; tss_id "TSS4"; num_samples "1";
Chr1_RagTag_p   AUGUSTUS    exon    77905   78201   .   +   .   transcript_id "TCONS_00000004"; gene_id "XLOC_000004"; exon_number "1";

You could see that clearly there are two genes in the example according to the gene_name, but somehow the software merges these two genes into one gene (gene_id). I would like to correct these problems by using the gene_name to replace the gene_id for line with the same transcript_name. Also, for the third gene (gene_id "XLOC_000004"), I would like to replace its name by the things after oId without the part .t[0-9]

output would be like this

Chr1_RagTag_p   AUGUSTUS    transcript  393571  396143  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; gene_name "AT1G02100"; oId "g109.t1"; cmp_ref "AT1G02100.1"; class_code "="; tss_id "TSS64"; num_samples "1";
Chr1_RagTag_p   AUGUSTUS    exon    393571  393638  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "1";
Chr1_RagTag_p   AUGUSTUS    exon    393732  393945  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "2";
Chr1_RagTag_p   AUGUSTUS    exon    394047  394094  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "3";
Chr1_RagTag_p   AUGUSTUS    exon    394178  394259  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "4";
Chr1_RagTag_p   AUGUSTUS    exon    394457  394559  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "5";
Chr1_RagTag_p   AUGUSTUS    exon    394698  394818  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "6";
Chr1_RagTag_p   AUGUSTUS    exon    394911  394958  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "7";
Chr1_RagTag_p   AUGUSTUS    exon    395153  395236  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "8";
Chr1_RagTag_p   AUGUSTUS    exon    395347  395411  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "9";
Chr1_RagTag_p   AUGUSTUS    exon    395716  395767  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "10";
Chr1_RagTag_p   AUGUSTUS    exon    395957  395995  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "11";
Chr1_RagTag_p   AUGUSTUS    exon    396069  396143  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "12";
Chr1_RagTag_p   manual  transcript  396451  399224  .   +   .   transcript_id "TCONS_00000071"; gene_id "AT1G02110"; gene_name "AT1G02110"; oId "g110.t1"; cmp_ref "AT1G02110.1"; class_code "="; tss_id "TSS65"; num_samples "2";
Chr1_RagTag_p   manual  exon    396451  397570  .   +   .   transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "1";
Chr1_RagTag_p   manual  exon    397661  397848  .   +   .   transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "2";
Chr1_RagTag_p   manual  exon    397923  398146  .   +   .   transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "3";
Chr1_RagTag_p   manual  exon    398367  399224  .   +   .   transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "4";
Chr1_RagTag_p   AUGUSTUS    transcript  77905   78201   .   +   .   transcript_id "TCONS_00000004"; gene_id "g15"; oId "g15.t1"; cmp_ref "AT1G01150.1"; class_code "x"; cmp_ref_gene "AT1G01150"; tss_id "TSS4"; num_samples "1";
Chr1_RagTag_p   AUGUSTUS    exon    77905   78201   .   +   .   transcript_id "TCONS_00000004"; gene_id "g15"; exon_number "1";

I guess the logic would be grep the gene_name by transcript_id, and then replace the gene_id in every line with gene_name according to the transcript_id number.

So far I had created a list with transcript_id and gene_name

TCONS_00000070 AT1G02100
TCONS_00000071 AT1G02110
TCONS_00000004 g15

Then I need to replace the gene_id in each line by the related gene_name according to transcript_id. But I'm not sure how to do it. Can I use sed?

Last but not least, I should mention that my real file is huge, which contains 35000 different trnascript_id.

Thanks lot in advance!

IMO, your process will be simplified if you can figure out how to fix "somehow the software merges these two genes into one" rather than trying to transform the output. — glenn jackman, Apr 19 '22 at 19:31
I would love to do that, but I also don't know how to fix it... Do you know something about it? I use gffcompare to generate this file. — zzz, Apr 19 '22 at 19:36
You might want to try over at bioinformatics.stackexchange.com — steeldriver, Apr 19 '22 at 22:18
In your own answer you wrote I would like to adapt the command from Philippos above to cope with no gene_name, and also make it easier to print out the transcript_id for each replacement. Could you edit your question to show what you mean by that and adapt the example and desired output to cover those cases? Also, you should mention that it is a huge file, so it should be a one-pass solution (I hope mine was fast enough). — Philippos, Apr 20 '22 at 15:22
@Philippos I post the edited file. Thanks! Btw, I read your code explanation again, and this time I understand more! Thanks! Your explanation is pretty clear, but the last part of the sed command (the last point in your explanation) always cause me headache. Doesn't mean your explanation is not clear. It's just my own problem in understanding the expression. I will try to read several times more... — zzz, Apr 20 '22 at 18:14
I updated the answer according to the updated request. I also tried to illustrate the complex regular expression. — Philippos, Apr 20 '22 at 18:57
You refer to "two gene_names in your description, but I only see one gene_name--and that's "AT1G02110". I see 2 different gene_id; they are "XLOC_000060" and "XLOC_000004". You then refer to transcript_name but there are no elements that fit that description. Not sure what's going on, but the Sample Input you've posted no longer matches your textual description. — jubilatious1, Jun 16 '22 at 04:47

Philippos · Accepted Answer · 2022-04-20T18:56:53.537

I can't judge whether this will cover all edge cases, but for your example it does what you want:

sed '/.*gene_name/{h;s///;s/;.*//;x;};G;s/gene_id[^;]*\(.*\)\n\(.*\)/gene_id\2\1/' file

What it does is to extract the gene_name if present, store it in hold space and use it as replacement for all following gene_ids until a new gene_name shows up.

More verbose:

/.*gene_name/ is an address, so everything inside {} will only be applied for lines with that pattern
Before we mess up everything, we store the original line to the hold space
s/// deletes the previous pattern (everything until gene_name); s/;.*// removes everything starting from the semicolon. So what remains is the whitespace and the string in double quotes.
x exchanges both spaces, to now we have the replacement in hold space and the original line in pattern space
Everything from now on is applied to all lines: G appends the hold space to each line, so we have the line, a newline character and the replacement
s/gene_id[^;]*\(.*\)\n\(.*\)/gene_id\2\1/' is easier to write than to read: [^;]matches everything betweengene_idand the;, thus the part to be replaced. The (.)parts cover the text before and after the embedded newline, so we can refer to them as\1and\2` in the replacement.

To illustrate the last step see what the buffer will look like after appending the hold space with the gene_name with <CR> as the embedded newline:

Chr1_RagTag_p ………; gene_id "XLOC_000060"; exon_number "1";<CR> "AT1G02100"
                          \______v_____/\________v_______/    \____v____/
                   gene_id     [^;]*          \(.*\)       \n    \(.*\)

Maybe it's a little easier to read with extended regular expressions (option -E):

sed -E '/.*gene_name/{h;s///;s/;.*//;x;};G;s/(gene_id)[^;]*(.*)\n(.*)/\1\3\2/' file

Update considering the no gene_name case as in the updated question

I simply add the extraction of the oId similar to the gene_name extraction, but before it. So if there comes a gene_name afterwards, it will overwrite the oId. This time in separate lines for better readability:

sed '
  /.*oId/{
    h
    s///
    s/\..*/"/
    x
  }
  /.*gene_name/{
    h
    s///
    s/;.*//
    x
  }
  G
  s/gene_id[^;]*\(.*\)\n\(.*\)/gene_id\2\1/' file

+1 for extending your explanation to include the -E version of the regular expression. The change is minimal but illustrate well the way BRE and ERE syntax vary when applying postfix operators, i.e. when group syntax goes from \(..\) to (...). I think such complete explanations are all too rare and deserve more appreciation than they get. — Cbhihe, Apr 21 '22 at 06:58

Ed Morton · Answer 2 · 2022-04-20T12:17:18.620

Using any awk in any shell on every Unix box:

$ awk '
    match($0,/gene_name "[^"]+"/) {
        name = substr($0,RSTART+10,RLENGTH-10)
    }
    match($0,/gene_id "[^"]+"/) {
        $0 = substr($0,1,RSTART+7) name substr($0,RSTART+RLENGTH)
    }
    { print }
' file
Chr1_RagTag_p   AUGUSTUS    transcript  393571  396143  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; gene_name "AT1G02100"; oId "g109.t1"; cmp_ref "AT1G02100.1"; class_code "="; tss_id "TSS64"; num_samples "1";
Chr1_RagTag_p   AUGUSTUS    exon    393571  393638  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "1";
Chr1_RagTag_p   AUGUSTUS    exon    393732  393945  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "2";
Chr1_RagTag_p   AUGUSTUS    exon    394047  394094  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "3";
Chr1_RagTag_p   AUGUSTUS    exon    394178  394259  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "4";
Chr1_RagTag_p   AUGUSTUS    exon    394457  394559  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "5";
Chr1_RagTag_p   AUGUSTUS    exon    394698  394818  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "6";
Chr1_RagTag_p   AUGUSTUS    exon    394911  394958  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "7";
Chr1_RagTag_p   AUGUSTUS    exon    395153  395236  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "8";
Chr1_RagTag_p   AUGUSTUS    exon    395347  395411  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "9";
Chr1_RagTag_p   AUGUSTUS    exon    395716  395767  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "10";
Chr1_RagTag_p   AUGUSTUS    exon    395957  395995  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "11";
Chr1_RagTag_p   AUGUSTUS    exon    396069  396143  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "12";
Chr1_RagTag_p   manual  transcript  396451  399224  .   +   .   transcript_id "TCONS_00000071"; gene_id "AT1G02110"; gene_name "AT1G02110"; oId "g110.t1"; cmp_ref "AT1G02110.1"; class_code "="; tss_id "TSS65"; num_samples "2";
Chr1_RagTag_p   manual  exon    396451  397570  .   +   .   transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "1";
Chr1_RagTag_p   manual  exon    397661  397848  .   +   .   transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "2";
Chr1_RagTag_p   manual  exon    397923  398146  .   +   .   transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "3";
Chr1_RagTag_p   manual  exon    398367  399224  .   +   .   transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "4";

Thanks @Ed Morton! It looks super, and I will go and search about match function in awk. Thanks a lot! — zzz, Apr 20 '22 at 18:20
+1: your solution is the better one, for being generalizable, Ed. No doubt about it, even though it "looks" more complicated. ;-) . — Cbhihe, Apr 20 '22 at 20:43
@Cbhihe thanks but it may be over-engineered for what the OP needs and then I'd argue that yours is the better one. All depends what the input might contain now and in future - it's good that the OP has options! — Ed Morton, Apr 20 '22 at 21:12

Cbhihe · Answer 3 · 2022-04-20T20:36:32.393

Or,again using gawk (GNU Awk), a simpler treatment would be:

$ awk '/gene_name/ {gene_id=$14; $12=$14; print; next;} {$12=gene_id; print;}' file

or, even simpler, as suggested by @EdMorton:

$ awk '/gene_name/ {gene_id=$14} {$12=gene_id; print}' file
Chr1_RagTag_p AUGUSTUS transcript 393571 396143 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; gene_name "AT1G02100"; oId "g109.t1"; cmp_ref "AT1G02100.1"; class_code "="; tss_id "TSS64"; num_samples "1";
Chr1_RagTag_p AUGUSTUS exon 393571 393638 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "1";
Chr1_RagTag_p AUGUSTUS exon 393732 393945 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "2";
Chr1_RagTag_p AUGUSTUS exon 394047 394094 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "3";
Chr1_RagTag_p AUGUSTUS exon 394178 394259 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "4";
Chr1_RagTag_p AUGUSTUS exon 394457 394559 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "5";
Chr1_RagTag_p AUGUSTUS exon 394698 394818 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "6";
Chr1_RagTag_p AUGUSTUS exon 394911 394958 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "7";
Chr1_RagTag_p AUGUSTUS exon 395153 395236 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "8";
Chr1_RagTag_p AUGUSTUS exon 395347 395411 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "9";
Chr1_RagTag_p AUGUSTUS exon 395716 395767 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "10";
Chr1_RagTag_p AUGUSTUS exon 395957 395995 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "11";
Chr1_RagTag_p AUGUSTUS exon 396069 396143 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "12";
Chr1_RagTag_p manual transcript 396451 399224 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; gene_name "AT1G02110"; oId "g110.t1"; cmp_ref "AT1G02110.1"; class_code "="; tss_id "TSS65"; num_samples "2";
Chr1_RagTag_p manual exon 396451 397570 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "1";
Chr1_RagTag_p manual exon 397661 397848 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "2";
Chr1_RagTag_p manual exon 397923 398146 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "3";
Chr1_RagTag_p manual exon 398367 399224 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "4";

Note:
I don't think that solution can get any simpler,although in its present form, it excludes the possibility of having any current field with one or more blank spaces. For that @EdMorton's solution is needed.

Thanks a lot @Cbhihe! Finally I find the code I can understand! — zzz, Apr 20 '22 at 18:17

frabjous · Answer 4 · 2022-04-20T14:55:44.890

Sometimes it's best to actually write a little program rather than trying to do everything with grep, awk and sed. (Though I'm sure some awk-genius will swoop in with a three line solution any minute or so.)

Bash wouldn't be my first choice for this, but it's the only thing you're more or less guaranteed to have on a modern Linux/Unix system, so let's see what we can do.

Here's a script. Save it as gene_name_fixer.sh and make it executable chmod a+x gene_name_fixer.sh and then run it as ./gene_name_fixer.sh inputfile.txt.

The output should be more or less what yours is, and you can redirect it to a file if you'd like:

gene_name_fixer.sh inputfile.txt > outputfile.txt

#!/bin/bash
inputfile="$1"
turn the file into an array of lines
readarray -t inputlines < "$inputfile"
declare an array for collecting transcript_id => gene_name pairs
declare -A genenames
populate this array by parsing those lines that contain both 'gene_name'
and 'transcript_id'
start with all lines
for line in "${inputlines[@]}" ; do
    # ignore lines that don't contain both "transcript_id" and 
    # "gene_name"
    if [[ ! "$line" =~ "transcript_id &quot;" ]] ||
       [[ ! "$line" =~ "gene_name &quot;" ]] ; then
        continue
    fi
    # get the transcript_id
    # first cut off everything up to including 'transcript id "'
    transcriptid="${line#transcript_id &quot;}"
    # then cut off everything after the first double quote
    transcriptid="${transcriptid%%&quot;}"
# get the gene_name
# first cut off everything up to an including 'gene_name &quot;'
genename=&quot;${line#*gene_name \&quot;}&quot;
# then cut off everything after the first double quote
genename=&quot;${genename%%\&quot;*}&quot;

# add this pair to our associative array if it's not already defined
if [[ -z &quot;${genenames[&quot;$transcriptid&quot;]}&quot; ]] ; then
    genenames[&quot;$transcriptid&quot;]=&quot;$genename&quot;
fi

done
now that we've populated our associative array we can process the
other lines of the file
for line in "${inputlines[@]}" ; do
    # make sure line has "gene_id" and "transcript_id" in it
    if [[ "$line" =~ "gene_id &quot;" ]] && 
       [[ "$line" =~ "transcript_id &quot;" ]] ; then
        # read transcript_id for the line; done like in previous loop
        # first cut off everything up to including 'transcript id "'
        transcriptid="${line#transcript_id &quot;}"
        # then cut off everything after the first double quote
        transcriptid="${transcriptid%%&quot;}"
    # ensure our associative array has a value for this transcript_id
    genename=&quot;${genenames[&quot;$transcriptid&quot;]}&quot;
    if [[ -n &quot;$genename&quot; ]] ; then
        # output everything up to and including 'gene_id \&quot;'
        echo -n &quot;${line%gene_id \&quot;*}gene_id \&quot;&quot;
        #echo
        # output the genename
        echo -n &quot;$genename&quot;
        # determine remainder of line; first cut off up to
        # including 'gene_id &quot;'
        remainder=&quot;${line#*gene_id \&quot;}&quot;
        # then cut off the gene_id itself
        remainder=&quot;${remainder#*\&quot;}&quot;
        # echo rest of line
        echo '&quot;'&quot;$remainder&quot;
    else
        # no genename is set for this transcript_id
        # exit with an error
        echo &quot;ERROR: No gene_name set for transcript_id $transcriptid&quot; &gt;&amp;2
        exit 1
    fi
else
    # if it doesn't have both &quot;gene_id&quot; and &quot;transcript_id&quot; in it,
    # echo it as is
    echo &quot;$line&quot;
fi

done

Basically this reads the file into an array of lines; first goes through those lines to pair up transcript_id's with gene_names, then goes through them again and does the substitutions.

This is not very robust. Expect problems if there are lines with multiple gene_ids, gene_names or transcript_ids, or if they're missing quotation marks, or... well, a lot of things could go wrong if your files aren't exactly like the sample you've shown. You may need to tweak things, but maybe this will get you started.

Far better in the long run to sort out the messed up output from gffcompare, but if you have data already in this format, maybe something like this can help?

If your files are huge, it's probably better to read the lines from the files one by one rather than load the whole file into an array in RAM; I can probably help with that if need be.

Edit: Here's a version that doesn't load the whole file into an array first, but grabs the lines from the file one by one when needed. If your speed issue is related to running out of ram and needing swap it might help, if you really need speed, you'd need to rewrite it in a real, faster, language, which, sorry, I'm not going to do for you. Besides you have other, possibly better answers.

#!/bin/bash
inputfile="$1"
get the line count
linecount="$(wc -l < "$inputfile")"
readarray -t inputlines < "$inputfile"
declare an array for collecting transcript_id => gene_name pairs
declare -A genenames
populate this array by parsing those lines that contain both 'gene_name'
and 'transcript_id'
start with all lines
for ((i=1 ; i < $linecount ; i++)) ; do
    line="$(sed -n "$i"p "$inputfile")"
    # ignore lines that don't contain both "transcript_id" and 
    # "gene_name"
    if [[ ! "$line" =~ "transcript_id &quot;" ]] ||
       [[ ! "$line" =~ "gene_name &quot;" ]] ; then
        continue
    fi
    # get the transcript_id
    # first cut off everything up to including 'transcript id "'
    transcriptid="${line#transcript_id &quot;}"
    # then cut off everything after the first double quote
    transcriptid="${transcriptid%%&quot;}"
# get the gene_name
# first cut off everything up to an including 'gene_name &quot;'
genename=&quot;${line#*gene_name \&quot;}&quot;
# then cut off everything after the first double quote
genename=&quot;${genename%%\&quot;*}&quot;

# add this pair to our associative array if it's not already defined
if [[ -z &quot;${genenames[&quot;$transcriptid&quot;]}&quot; ]] ; then
    genenames[&quot;$transcriptid&quot;]=&quot;$genename&quot;
fi

done
now that we've populated our associative array we can process the
other lines of the fil
for ((i=1 ; i < $linecount ; i++)) ; do
    line="$(sed -n "$i"p "$inputfile")"
    # ignore lines that don't contain both "transcript_id" and
make sure line has "gene_id" and "transcript_id" in it
if [[ &quot;$line&quot; =~ &quot;gene_id \&quot;&quot; ]] &amp;&amp; 
   [[ &quot;$line&quot; =~ &quot;transcript_id \&quot;&quot; ]] ; then
    # read transcript_id for the line; done like in previous loop
    # first cut off everything up to including 'transcript id &quot;'
    transcriptid=&quot;${line#*transcript_id \&quot;}&quot;
    # then cut off everything after the first double quote
    transcriptid=&quot;${transcriptid%%\&quot;*}&quot;

    # ensure our associative array has a value for this transcript_id
    genename=&quot;${genenames[&quot;$transcriptid&quot;]}&quot;
    if [[ -n &quot;$genename&quot; ]] ; then
        # output everything up to and including 'gene_id \&quot;'
        echo -n &quot;${line%gene_id \&quot;*}gene_id \&quot;&quot;
        #echo
        # output the genename
        echo -n &quot;$genename&quot;
        # determine remainder of line; first cut off up to
        # including 'gene_id &quot;'
        remainder=&quot;${line#*gene_id \&quot;}&quot;
        # then cut off the gene_id itself
        remainder=&quot;${remainder#*\&quot;}&quot;
        # echo rest of line
        echo '&quot;'&quot;$remainder&quot;
    else
        # no genename is set for this transcript_id
        # exit with an error
        echo &quot;ERROR: No gene_name set for transcript_id $transcriptid&quot; &gt;&amp;2
        exit 1
    fi
else
    # if it doesn't have both &quot;gene_id&quot; and &quot;transcript_id&quot; in it,
    # echo it as is
    echo &quot;$line&quot;
fi

done

Hi, thanks for the script! I forgot to mention in the file that the transcription_id is unique and every line must have this id. So probably this makes the script a bit simpler. But I would like to ask you is: how can I make the script work faster? My whole file has 35000 transcripts. So far when I tried the script, it's around 1s per transcript, which means I would need to wait for 9 hours to get this renaming finish... — zzz, Apr 20 '22 at 11:49
The issue with the transcript_id on every line won't really make a difference I think. Using bash for this kind of job just isn't ideal. I've added a version that doesn't load everything into RAM first, but without a large file to test with, I have no clue if it'll be faster. I'd use a real programming language if you needed speed. But some of the other answers are likely faster as well. — frabjous, Apr 20 '22 at 14:57

zzz · Answer 5 · 2022-04-20T12:20:09.877

thank you all so much for scripting for me! I have been pondering last night, and thought of the following way to do this.

Let's call my input file with above mentioned info as sample.gtf

I created a file with the transcript_id, gene_id, and gene_name:

$ grep "oId" sample.gtf | awk '{print $10,$12,$14}' | sed 's/\"//g' | sed 's/\;//g' | sed '
s/\.[^\.]*$/ /g' > gene_id_replace_list.txt
# here I use `oId` because I found that some of the lines has no `gene_name`. In a case of no `gene_name`, I would like to grab `oId` as the a `gene_name` for replacement of `gene_id`.
$ head gene_id_replace_list.txt
TCONS_00000070 XLOC_000060 AT1G02100
TCONS_00000071 XLOC_000060 AT1G02110

and then I wrote a loop to sed line by line, which sed command is more or less similar to what Phillipos posted above.

while IFS= read -r line; do
TransId=$(echo $line | awk '{print $1}')
GeneId=$(echo $line | awk '{print $2}')
TairId=$(echo $line | awk '{print $3}')
sed -i "/${TransId}/ s/${GeneId}/${TairId}/g" sample.gtf
done < gene_id_replace_list.txt

However, I'm not so satisfied with this code, because my entire file has 35000 transcripts, and so far this script runs nearly 1 second per transcript. So does anyone have better suggestions?

I would like to adapt the command from Philippos above to cope with no gene_name, and also make it easier to print out the transcript_id for each replacement (I have to see what the code replace in the end to be sure, sorry). Probably one line sed command similar to what Philippos proposed is much faster?

See why-is-using-a-shell-loop-to-process-text-considered-bad-practice — Ed Morton, Apr 20 '22 at 12:12
You may want to rewrite the script in a language that can be easier profiled if you're supposed to crunch a lot of data performance is a concern. Python might be a good option for that. — undercat, Apr 20 '22 at 14:20
General comment: practically every time you see 2 out of {sed, awk, grep} used together with pipes, something fishy is afoot from a scripting standpoint. It is not that those 3 utilities are incompatible or should never be used together, rather they somewhat overlap in their functionality and are rarely needed together. Awk in particular can replicate grep and sed functionalities and has more features for advanced text processing. It is not to say it is "better", just that each is used differently and rarely requires another. Two solutions are pure Awk and one is pure sed. HTH. — Cbhihe, Apr 20 '22 at 15:29

replace string x by string y according to string z

5 Answers5

turn the file into an array of lines

declare an array for collecting transcript_id => gene_name pairs

populate this array by parsing those lines that contain both 'gene_name'

and 'transcript_id'

start with all lines

now that we've populated our associative array we can process the

other lines of the file

get the line count

declare an array for collecting transcript_id => gene_name pairs

populate this array by parsing those lines that contain both 'gene_name'

and 'transcript_id'

start with all lines

now that we've populated our associative array we can process the

other lines of the fil

make sure line has "gene_id" and "transcript_id" in it