How to replace a string by using part of the string in the same line?

Question

I have a file like the following on a Linux machine, and the gene_id is wrong. So I would like to replace the string "PB" in the gene_id with the transcript_id from the same line.

$ cat try.gff
transcript  30351   32332   .   +   .   gene_id "PB"; transcript_id "PB.1.66";
exon    30351   31677   .   +   .   gene_id "PB"; transcript_id "PB.1.59";
exon    31758   31871   .   +   .   gene_id "PB"; transcript_id "PB.1.40";
exon    31968   32178   .   +   .   gene_id "PB"; transcript_id "PB.1.30";
exon    32257   32332   .   +   .   gene_id "PB"; transcript_id "PB.1.20";
transcript  30351   32332   .   +   .   gene_id "PB"; transcript_id "PB.28.309";
exon    30351   31677   .   +   .   gene_id "PB"; transcript_id "PB.58.900";
exon    31758   31871   .   +   .   gene_id "PB"; transcript_id "PB.10000.1001";
exon    31968   32178   .   +   .   gene_id "PB"; transcript_id "PB.19897.1087541";
exon    32257   32332   .   +   .   gene_id "PB"; transcript_id "PB.1.11";

Expected result

transcript  30351   32332   .   +   .   gene_id "PB.1"; transcript_id "PB.1.66";
exon    30351   31677   .   +   .   gene_id "PB.1"; transcript_id "PB.1.59";
exon    31758   31871   .   +   .   gene_id "PB.1"; transcript_id "PB.1.40";
exon    31968   32178   .   +   .   gene_id "PB.1"; transcript_id "PB.1.30";
exon    32257   32332   .   +   .   gene_id "PB.1"; transcript_id "PB.1.20";
transcript  30351   32332   .   +   .   gene_id "PB.28"; transcript_id "PB.28.309";
exon    30351   31677   .   +   .   gene_id "PB.58"; transcript_id "PB.58.900";
exon    31758   31871   .   +   .   gene_id "PB.10000"; transcript_id "PB.10000.1001";
exon    31968   32178   .   +   .   gene_id "PB.19897"; transcript_id "PB.19897.1087541";
exon    32257   32332   .   +   .   gene_id "PB.1"; transcript_id "PB.1.11";

I managed to get the right text for replacement with the code awk -F";" '{gsub(" transcript_id ","");print $2}' try.gff | sed 's/"//g' | cut -d '.' -f 1,2

PB.1
PB.1
PB.1
PB.1
PB.1
PB.28
PB.58
PB.10000
PB.19897
PB.1

How can I replace the "PB" in the gene_id part?

Is the file you have shown your complete, actual file (I assume not) or a really representative example? Specifically, will the transcript_id always have two dots? Will there ever be anything on the line after the transcript_id? Will the gene_id ever not match the first two letters of the transcript_id (e.g., gene_id "AB"; transcript_id "CD.1.59";? If so, what would you want to happen? — G-Man Says 'Reinstate Monica', Mar 16 '22 at 16:18
What kind of transcript IDs are those? I've never seen these identifiers before and can't find them on NCBI. — terdon, Mar 17 '22 at 10:18

Ed Morton · Accepted Answer · 2022-03-16T15:49:47.703

Using any sed in any shell on every Unix box:

$ sed 's/\(.*gene_id "\)[^"]*\(.*"\([^.]*\.[^.]*\).*\)/\1\3\2/' try.gff
transcript  30351   32332   .   +   .   gene_id "PB.1"; transcript_id "PB.1.66";
exon    30351   31677   .   +   .   gene_id "PB.1"; transcript_id "PB.1.59";
exon    31758   31871   .   +   .   gene_id "PB.1"; transcript_id "PB.1.40";
exon    31968   32178   .   +   .   gene_id "PB.1"; transcript_id "PB.1.30";
exon    32257   32332   .   +   .   gene_id "PB.1"; transcript_id "PB.1.20";
transcript  30351   32332   .   +   .   gene_id "PB.28"; transcript_id "PB.28.309";
exon    30351   31677   .   +   .   gene_id "PB.58"; transcript_id "PB.58.900";
exon    31758   31871   .   +   .   gene_id "PB.10000"; transcript_id "PB.10000.1001";
exon    31968   32178   .   +   .   gene_id "PB.19897"; transcript_id "PB.19897.1087541";
exon    32257   32332   .   +   .   gene_id "PB.1"; transcript_id "PB.1.11";

In the regexp part every $ starts a capture group that's terminated by the matching $. In the replacement part every \<digit> refers to the string that matched the regexp segment within the capture group. So \1 refers to the string that matched $.*gene_id "$ and so on. The rest is just basic regexp segments (e.g. .* means 0 or more repetitions of any character, [^"]* means zero or more reps of any char except ", and \. means a literal .).

Thanks a lot! Could you please explain a bit the command? I was trying to search how to write those complicated text in the sed command, but I didn't find the explanation... — zzz, Mar 16 '22 at 15:16
You're welcome. I added an explanation to my answer. Let me know if you have any specific questions after reading it. — Ed Morton, Mar 16 '22 at 15:46

score 0 · Answer 2 · answered Mar 16 '22 at 17:25

Using sed

$ sed 's/"[A-Z]\+\("[^"]*\(.*\)\..*\)/\2\1/' input_file
transcript  30351   32332   .   +   .   gene_id "PB.1"; transcript_id "PB.1.66";
exon    30351   31677   .   +   .   gene_id "PB.1"; transcript_id "PB.1.59";
exon    31758   31871   .   +   .   gene_id "PB.1"; transcript_id "PB.1.40";
exon    31968   32178   .   +   .   gene_id "PB.1"; transcript_id "PB.1.30";
exon    32257   32332   .   +   .   gene_id "PB.1"; transcript_id "PB.1.20";
transcript  30351   32332   .   +   .   gene_id "PB.28"; transcript_id "PB.28.309";
exon    30351   31677   .   +   .   gene_id "PB.58"; transcript_id "PB.58.900";
exon    31758   31871   .   +   .   gene_id "PB.10000"; transcript_id "PB.10000.1001";
exon    31968   32178   .   +   .   gene_id "PB.19897"; transcript_id "PB.19897.1087541";
exon    32257   32332   .   +   .   gene_id "PB.1"; transcript_id "PB.1.11";

jubilatious1 · Answer 3 · 2022-06-18T06:11:22.123

Using Raku (formerly known as Perl_6)

The simple way (assumes TSV file like gff format):

raku -e 'my @a = lines>>.split("\t").flat; @a[7,17,27...*] = @a[9,19,29...*];    \
         @a[7,17,27...*].=comb(/ <alpha>+ \. <digit>+ /).=map(q["] ~ * ~ q["]);  \
         .put for @a.rotor(10)>>.join("\t");'

The code above simply splits each line on \t tabs, and flattens elements to make a completely linear array (10 rows x 10 columns = 100 elements). Each 8th-column gene_id (zero-index = 7) is overwritten by the 10th-column transcript_id (zero-index = 9). Raku figures out the remainder of the indexing sequence, provided you give it enough hints.

Once the full transcript_id has been copied, it's simple enough to comb out <alpha>+ \. \d+ the desired characters, q["] ~ * ~ q["] wrap them in quotation marks, and rotor each 10 elements to reconstruct the 10 x 10 table.

The purely regex way (harder):

~$ raku -pe 's/ <?after ^ .+? gene_id \s \" > (.+?)  (\"\; \s transcript_id \s \")  (<alpha>+ \. \d+) <?before \.> /$2$1$2/;' try.gff

OR

~$ raku -pe 's/  <?after ^ .+? gene_id \s \c[QUOTATION MARK] > (.+?)  (\c[QUOTATION MARK] \; \s transcript_id \s \")  (<alpha>+ \. \d+) <?before \.> /$2$1$2/;' try.gff

Here Raku's -pe autoprinting linewise flags are used, in conjunction with the familiar s/// substitution operator. Five regex atoms are searched for and three are captured with parentheses:

a zerowidth positive look-behind <?after … > targeting the ^ start-of-string up to gene_id \s \" (includes trailing whitespace and quotation mark),
[captured as $0] a non-greedy .+? any-character one-or-more times which is the to-be-discarded gene_id itself,
[captured as $1] the text comprising transcript_id section of the line (including surrounding whitespace and punctuation),
[captured as $2] the desired <alpha>+ \. \d+ portion of the transcript_id, then
a zerowidth positive look-ahead <?before … >, which stops the preceding capture before the second \. dot.

Finally in the replacement, the captured $2 is used to replace the $0 capture. The $1 and $2 captures are restored to their normal positions.

Sample Input:

transcript  30351   32332   .   +   .   gene_id "PB"; transcript_id "PB.1.66";
exon    30351   31677   .   +   .   gene_id "PB"; transcript_id "PB.1.59";
exon    31758   31871   .   +   .   gene_id "PB"; transcript_id "PB.1.40";
exon    31968   32178   .   +   .   gene_id "PB"; transcript_id "PB.1.30";
exon    32257   32332   .   +   .   gene_id "PB"; transcript_id "PB.1.20";
transcript  30351   32332   .   +   .   gene_id "PB"; transcript_id "PB.28.309";
exon    30351   31677   .   +   .   gene_id "PB"; transcript_id "PB.58.900";
exon    31758   31871   .   +   .   gene_id "PB"; transcript_id "PB.10000.1001";
exon    31968   32178   .   +   .   gene_id "PB"; transcript_id "PB.19897.1087541";
exon    32257   32332   .   +   .   gene_id "PB"; transcript_id "PB.1.11";

Sample Output:

transcript  30351   32332   .   +   .   gene_id "PB.1"; transcript_id "PB.1.66";
exon    30351   31677   .   +   .   gene_id "PB.1"; transcript_id "PB.1.59";
exon    31758   31871   .   +   .   gene_id "PB.1"; transcript_id "PB.1.40";
exon    31968   32178   .   +   .   gene_id "PB.1"; transcript_id "PB.1.30";
exon    32257   32332   .   +   .   gene_id "PB.1"; transcript_id "PB.1.20";
transcript  30351   32332   .   +   .   gene_id "PB.28"; transcript_id "PB.28.309";
exon    30351   31677   .   +   .   gene_id "PB.58"; transcript_id "PB.58.900";
exon    31758   31871   .   +   .   gene_id "PB.10000"; transcript_id "PB.10000.1001";
exon    31968   32178   .   +   .   gene_id "PB.19897"; transcript_id "PB.19897.1087541";
exon    32257   32332   .   +   .   gene_id "PB.1"; transcript_id "PB.1.11";

https://docs.raku.org/language/regexes
https://raku.org

How to replace a string by using part of the string in the same line?

3 Answers3