1

So I have certain text set up in the second and third columns of my file like so:

GO:0005634^cellular_component^nucleus`GO:0003677^molecular_function^DNA binding`  

I want to get rid of all of the text related to function and have the output as so:

GO:0005634`GO:0003677

I'm not sure how to approach this using sed or awk

Note: the lines have various amounts of GO:xxxxxxx.

slm
  • 369,824
  • 2
    Register your account; then you'll be able to edit your question (and accept answers, and comment on answers, and...) – Jeff Schaller Jul 05 '18 at 18:53

4 Answers4

1

This does what I believe you're asking for. NOTE: input.txt is your input file.

just sed

$ sed 's/\^[^`]*//g' input.txt
GO:0005634`GO:0003677`
GO:0005634`GO:0003677`
GO:0005634`GO:0003677`
GO:0005634`GO:0003677`

Explanation

sed is used to remove the sub-strings that begin with a caret (^) and can contain anything except a single backtick. Once a backtick is encountered, sed will replace this with nothing, effectively deleting it. This pattern is repeated until exhausted. This has the effect of removing all the ^.... strings.

grep + paste + sed

$ grep -o 'GO:[0-9]\+' input.txt | paste -d'`' - - | sed 's/$/`/'
GO:0005634`GO:0003677`
GO:0005634`GO:0003677`
GO:0005634`GO:0003677`
GO:0005634`GO:0003677`

Explanation

grep pulls out all the GO:XXXXX strings from the input.txt file, paste puts them into 2 columns, with a single tick between the 2 GO:XXXXX strings, and finally the sed adds a single tick to the end.

References

slm
  • 369,824
1

It looks like the data uses backticks as record separators and circumflex as field delimiters.

printf 'GO:0005634^cellular_component^nucleus`GO:0003677^molecular_function^DNA binding`' |
awk -F '^' -v RS='`' -v ORS='`' '{ print $1 }'

This prints only the first field of each record (the GO term), with backticks as the output record separator.

Output:

GO:0005634`GO:0003677`

(no trailing newline)

Kusalananda
  • 333,661
0

With GNU Awk (gawk):

gawk 'BEGIN{FPAT="`?GO:[0-9]+"; OFS=""} {$1=$1} 1' file

Ex.

$ echo 'GO:0005634^cellular_component^nucleus`GO:0003677^molecular_function^DNA binding`' | 
  gawk 'BEGIN{FPAT="`?GO:[0-9]+"; OFS=""} {$1=$1} 1'
GO:0005634`GO:0003677
steeldriver
  • 81,074
0
perl -lne 'print /((?:^|`)GO:\d+)/g' genes.file

Explanation:

  • Execute perl in line mode with explicit printing of records to stdout -n
  • the regex /((?:^|`)GO:\d+)/g will, in the current record, the string GO: to whose right is a number and on it's left is either the BOL or a backquote. this is then grabbed as many times as it is found, /g option, and handed over to the print command which then display to stdout with the default OFS which is null.