Deleting text after a character multiple times in one column

Question

So I have certain text set up in the second and third columns of my file like so:

GO:0005634^cellular_component^nucleus`GO:0003677^molecular_function^DNA binding`

I want to get rid of all of the text related to function and have the output as so:

GO:0005634`GO:0003677

I'm not sure how to approach this using sed or awk

Note: the lines have various amounts of GO:xxxxxxx.

Register your account; then you'll be able to edit your question (and accept answers, and comment on answers, and...) — Jeff Schaller, Jul 05 '18 at 18:53

slm · Answer 1 · 2018-07-06T08:07:34.463

This does what I believe you're asking for. NOTE: input.txt is your input file.

just sed

$ sed 's/\^[^`]*//g' input.txt
GO:0005634`GO:0003677`
GO:0005634`GO:0003677`
GO:0005634`GO:0003677`
GO:0005634`GO:0003677`

Explanation

sed is used to remove the sub-strings that begin with a caret (^) and can contain anything except a single backtick. Once a backtick is encountered, sed will replace this with nothing, effectively deleting it. This pattern is repeated until exhausted. This has the effect of removing all the ^.... strings.

grep + paste + sed

$ grep -o 'GO:[0-9]\+' input.txt | paste -d'`' - - | sed 's/$/`/'
GO:0005634`GO:0003677`
GO:0005634`GO:0003677`
GO:0005634`GO:0003677`
GO:0005634`GO:0003677`

Explanation

grep pulls out all the GO:XXXXX strings from the input.txt file, paste puts them into 2 columns, with a single tick between the 2 GO:XXXXX strings, and finally the sed adds a single tick to the end.

References

score 1 · Answer 2 · answered Jul 06 '18 at 08:01

It looks like the data uses backticks as record separators and circumflex as field delimiters.

printf 'GO:0005634^cellular_component^nucleus`GO:0003677^molecular_function^DNA binding`' |
awk -F '^' -v RS='`' -v ORS='`' '{ print $1 }'

This prints only the first field of each record (the GO term), with backticks as the output record separator.

Output:

GO:0005634`GO:0003677`

(no trailing newline)

score 0 · Answer 3 · answered Jul 05 '18 at 20:17

0

With GNU Awk (gawk):

gawk 'BEGIN{FPAT="`?GO:[0-9]+"; OFS=""} {$1=$1} 1' file

Ex.

$ echo 'GO:0005634^cellular_component^nucleus`GO:0003677^molecular_function^DNA binding`' | 
  gawk 'BEGIN{FPAT="`?GO:[0-9]+"; OFS=""} {$1=$1} 1'
GO:0005634`GO:0003677

answered Jul 05 '18 at 20:17

steeldriver

81,074

score 0 · Answer 4 · answered Jul 06 '18 at 07:55

perl -lne 'print /((?:^|`)GO:\d+)/g' genes.file

Explanation:

Execute perl in line mode with explicit printing of records to stdout -n
the regex /((?:^|`)GO:\d+)/g will, in the current record, the string GO: to whose right is a number and on it's left is either the BOL or a backquote. this is then grabbed as many times as it is found, /g option, and handed over to the print command which then display to stdout with the default OFS which is null.

Deleting text after a character multiple times in one column

4 Answers4

References