1

I'm working with a file that looks like the following, containing with over 50,000 lines of gene IDs followed by their sequence:

gene_A:3342234 CTCTTTCTTTTACGCCT
gene_A:1244-5205 CTCTTTCTTTTACGCCT
gene_A:1838438 CTCTTTCTTTTACGCCT
gene_B:1848584 CTCTTTCTTTTACGCCT
gene_B:1029-4920 CTCTTTCTTTTACGCCT
gene_C:3849029 CTCTTTCTTTTACGCCT

They all have the gene ID, followed by a colon, and then the reference number of 7-9 digits and (some include dashes).

I want to replace the gene IDs with their actual names, for example geneA and geneB, whilst keeping the information that follows them. Desired output:

geneA CTCTTTCTTTTACGCCT
geneA CTCTTTCTTTTACGCCT
geneA CTCTTTCTTTTACGCCT
geneB CTCTTTCTTTTACGCCT
geneB CTCTTTCTTTTACGCCT
geneB CTCTTTCTTTTACGCCT

This is my first time using sed, so I'm really not quite sure where to even start. I know how to replace all lines containing gene_A with 's/gene_A.*/geneA/' but I'm not sure how to preserve the information following the gene IDs.

jubilatious1
  • 3,195
  • 8
  • 17
  • 1
    Could you please add your failing attempts? It's a very easy task for sed/perl/awk. You need a basic regex and a substitution s/to replace/replacement/modifiers – Gilles Quénot Feb 03 '24 at 21:36
  • One more thing, you need a capture group in sed. – Gilles Quénot Feb 03 '24 at 22:45
  • I'm not familiar with capture groups, how would they apply in this case? I know how to solve this line by line with a basic s/to replace/replacement/modifiers, however since this is quite a large file I'm looking for a solution where I could replace the first part of each line containing gene_idA with geneA – bryophyta Feb 03 '24 at 23:21
  • As was stated, edit the question and add what you have tried and where it didn't work. Also provide if the part to be replaced is always gene_id followed by a that same string which is a single uppercase character followed by seven digits. Add this to the question and do not post it in the comments. – Nasir Riley Feb 03 '24 at 23:30
  • Thank you, I have updated my post! – bryophyta Feb 03 '24 at 23:44
  • 1
    (1) Do you really have genes whose IDs are gene_A and gene_B and whose names are geneA and geneB? If so, you could use s/_//. If that's not what you mean, edit your question to be more representative of your situation and your data.  (2) If you want to replace the gene IDs with their actual names, do 's/gene_A/geneA/' (without the .*). Or do 's/gene_A:/geneA/' if you want to remove the colons (or 's/gene_A:/geneA /' if you want to remove the colon and add a space).  (3) If you want to remove the reference numbers (but keep the information that follows them), *say that.* – G-Man Says 'Reinstate Monica' Feb 04 '24 at 18:12
  • Maybe something like this? https://unix.stackexchange.com/a/767403/227738 – jubilatious1 Feb 07 '24 at 21:04

4 Answers4

1

I suspect your example is bad and you actually have to map gene IDs to gene names in some way rather than programmatically transform them, e.g. with a file like this:

$ cat ids2names
gene_A when
gene_B chapmen
gene_C billies

If so then you can, with any awk, do:

$ awk -F'[: ]' 'NR==FNR{map[$1]=$2; next} {print map[$1], $3}' ids2names file
when CTCTTTCTTTTACGCCT
when CTCTTTCTTTTACGCCT
when CTCTTTCTTTTACGCCT
chapmen CTCTTTCTTTTACGCCT
chapmen CTCTTTCTTTTACGCCT
billies CTCTTTCTTTTACGCCT

Otherwise if a gene name really is just a gene ID with the _ removed as in your example then...

Using any sed:

$ sed 's/_\([^:]*\)[^ ]*/\1/' file
geneA CTCTTTCTTTTACGCCT
geneA CTCTTTCTTTTACGCCT
geneA CTCTTTCTTTTACGCCT
geneB CTCTTTCTTTTACGCCT
geneB CTCTTTCTTTTACGCCT
geneC CTCTTTCTTTTACGCCT

or any awk:

$ awk -F'[_: ]' '{print $1 $2, $4}' file
geneA CTCTTTCTTTTACGCCT
geneA CTCTTTCTTTTACGCCT
geneA CTCTTTCTTTTACGCCT
geneB CTCTTTCTTTTACGCCT
geneB CTCTTTCTTTTACGCCT
geneC CTCTTTCTTTTACGCCT

If the spaces in your input aren't always single blanks then change -F'[: ]' to -F'[:[:blank:]]+' or -F'[: \t]+' in the awk scripts (keep the _ when present) and [^ ] to [^[:blank:]] or [^ \t] in the sed script.

when chapmen billies

Ed Morton
  • 31,617
1

Using Raku (formerly known as Perl_6)

~$ raku -pe 's/^  \w+  <( \: <[0..9-]>+  //;'   file

#OR

~$ raku -pe 's/^ (\w+) : <[0..9-]>+ /$0/;' file

Above are answers written in Raku, a member of the Perl-family of programming languages. These two answers use Raku's -pe sed-like autoprinting linewise flags (other answers can be envisioned that use Raku's -ne awk-like non-autoprinting linewise flags).

Briefly, the ^ start-of-string (line) is detected, followed by \w+ one-or-more alnums-or-underscores, followed by \: colon, followed by <[0..9-]>+ a custom character class consisting of Arabic digits and - hyphen. (To include Unicode digits, use <+ :N + [-]>+ instead).

Sample Input:

gene_A:3342234 CTCTTTCTTTTACGCCT
gene_A:1244-5205 CTCTTTCTTTTACGCCT
gene_A:1838438 CTCTTTCTTTTACGCCT
gene_B:1848584 CTCTTTCTTTTACGCCT
gene_B:1029-4920 CTCTTTCTTTTACGCCT
gene_C:3849029 CTCTTTCTTTTACGCCT

Sample Output (both code examples):

gene_A CTCTTTCTTTTACGCCT
gene_A CTCTTTCTTTTACGCCT
gene_A CTCTTTCTTTTACGCCT
gene_B CTCTTTCTTTTACGCCT
gene_B CTCTTTCTTTTACGCCT
gene_C CTCTTTCTTTTACGCCT

Explanation of each answer:

  1. Raku's <( capture marker is used to drop the match before the colon (note capture markers work individually but both <( and/or )> can be used). Then the remaining capture is "substituted-with-nothing" in the replacement half, i.e. deleted.

  2. Raku's () numerical capture parens are used to capture the gene name, essentially dropping the match afterwards. Then the capture which is now in $0 is placed in the replacement half, i.e. deleting the remainder of the match.


Note: If it's truly necessary to "capture-then-modify" (i.e. removing underscores in the gene-names), then Raku lets you run code in the replacement half. Below translates all underscores into empty-string (i.e. nothing), returning gene-names with the underscores deleted:

~$ raku -pe 's/^ (\w+) \: <[0..9-]>+ /{$0.trans("_" => "")}/;'  file

#OR (more simply)

~$ raku -pe 's/^ (\w+) : <[0..9-]>+ /$0.trans("_" => "")/;' file

https://docs.raku.org/language/regexes
https://raku.org

jubilatious1
  • 3,195
  • 8
  • 17
0
sed -E 's/(gene_[A-Z]):[0-9-]+/\1/' input.txt

This command uses the -E option to enable extended regular expressions and captures the gene ID in a group (gene_[A-Z]), followed by a colon and a sequence of digits and dashes :[0-9-]+. The replacement \1 refers to the gene ID in the original input file containing the list of gene IDs followed by their sequence which is named input.txt in this example.

karel
  • 2,030
  • Where do I input the gene name for replacement? – bryophyta Feb 04 '24 at 00:13
  • I input the list of gene IDs followed by their sequence in a file that I named input.txt in this example, but you can name this file whatever you want and change the name of the file in the sed command so that it matches the name of this file. – karel Feb 04 '24 at 00:40
0

You could use sed to remove all unwanted characters:

$ sed  -E 's/[-:_0-9]//g' input_file
geneA CTCTTTCTTTTACGCCT
geneA CTCTTTCTTTTACGCCT
geneA CTCTTTCTTTTACGCCT
geneB CTCTTTCTTTTACGCCT
geneB CTCTTTCTTTTACGCCT
geneC CTCTTTCTTTTACGCCT

You can also do it with tr commmand:

$ tr -cd 'a-zA-Z\n ' < input_file
geneA CTCTTTCTTTTACGCCT
geneA CTCTTTCTTTTACGCCT
geneA CTCTTTCTTTTACGCCT
geneB CTCTTTCTTTTACGCCT
geneB CTCTTTCTTTTACGCCT
geneC CTCTTTCTTTTACGCCT
Kusalananda
  • 333,661
user9101329
  • 1,004
  • If you use the same character list in tr as in your sed then tr -d ':_0-9-' is a bit more concise than tr -cd 'a-zA-Z\n ' (and would handle tabs in the input correctly), you just have to move the - to the end or add -- after -d (tr -d -- '-:_0-9') so the character list doesn't look like an option. – Ed Morton Feb 07 '24 at 10:01