Using Raku (formerly known as Perl_6)
~$ raku -pe 's/^ \w+ <( \: <[0..9-]>+ //;' file
#OR
~$ raku -pe 's/^ (\w+) : <[0..9-]>+ /$0/;' file
Above are answers written in Raku, a member of the Perl-family of programming languages. These two answers use Raku's -pe
sed-like autoprinting linewise flags (other answers can be envisioned that use Raku's -ne
awk-like non-autoprinting linewise flags).
Briefly, the ^
start-of-string (line) is detected, followed by \w+
one-or-more alnums-or-underscores, followed by \:
colon, followed by <[0..9-]>+
a custom character class consisting of Arabic digits and -
hyphen. (To include Unicode digits, use <+ :N + [-]>+
instead).
Sample Input:
gene_A:3342234 CTCTTTCTTTTACGCCT
gene_A:1244-5205 CTCTTTCTTTTACGCCT
gene_A:1838438 CTCTTTCTTTTACGCCT
gene_B:1848584 CTCTTTCTTTTACGCCT
gene_B:1029-4920 CTCTTTCTTTTACGCCT
gene_C:3849029 CTCTTTCTTTTACGCCT
Sample Output (both code examples):
gene_A CTCTTTCTTTTACGCCT
gene_A CTCTTTCTTTTACGCCT
gene_A CTCTTTCTTTTACGCCT
gene_B CTCTTTCTTTTACGCCT
gene_B CTCTTTCTTTTACGCCT
gene_C CTCTTTCTTTTACGCCT
Explanation of each answer:
Raku's <(
capture marker is used to drop the match before the colon (note capture markers work individually but both <(
and/or )>
can be used). Then the remaining capture is "substituted-with-nothing" in the replacement half, i.e. deleted.
Raku's (
…)
numerical capture parens are used to capture the gene name, essentially dropping the match afterwards. Then the capture which is now in $0
is placed in the replacement half, i.e. deleting the remainder of the match.
Note: If it's truly necessary to "capture-then-modify" (i.e. removing underscores in the gene-names), then Raku lets you run code in the replacement half. Below trans
lates all underscores into empty-string (i.e. nothing), returning gene-names with the underscores deleted:
~$ raku -pe 's/^ (\w+) \: <[0..9-]>+ /{$0.trans("_" => "")}/;' file
#OR (more simply)
~$ raku -pe 's/^ (\w+) : <[0..9-]>+ /$0.trans("_" => "")/;' file
https://docs.raku.org/language/regexes
https://raku.org
s/to replace/replacement/modifiers
– Gilles Quénot Feb 03 '24 at 21:36sed
. – Gilles Quénot Feb 03 '24 at 22:45s/to replace/replacement/modifiers
, however since this is quite a large file I'm looking for a solution where I could replace the first part of each line containinggene_idA
withgeneA
– bryophyta Feb 03 '24 at 23:21gene_id
followed by a that same string which is a single uppercase character followed by seven digits. Add this to the question and do not post it in the comments. – Nasir Riley Feb 03 '24 at 23:30gene_A
andgene_B
and whose names aregeneA
andgeneB
? If so, you could uses/_//
. If that's not what you mean, edit your question to be more representative of your situation and your data. (2) If you want to replace the gene IDs with their actual names, do's/gene_A/geneA/'
(without the.*
). Or do's/gene_A:/geneA/'
if you want to remove the colons (or's/gene_A:/geneA /'
if you want to remove the colon and add a space). (3) If you want to remove the reference numbers (but keep the information that follows them), *say that.* – G-Man Says 'Reinstate Monica' Feb 04 '24 at 18:12