1

I have a file bla.tsv (FS = \t):

>hCoV-19/xxx/xxx-YYY/xxx
>hCoV-19/xxx/xxx-ZZZ/xxx

Precision :

  • If a character is really written, it's because it will be present in each line as it is
  • If a character is written xxx, it's because it will be present but different between each line (i.e it coud be a group of letter or number or else)
  • The character YYY and ZZZ are the pattern that I'm interested in and can be number or letter

And I want to transform the file in order to have a new column :

YYY >hCoV-19/xxx/xxx-YYY/xxx
ZZZ >hCoV-19/xxx/xxx-ZZZ/xxx

I know that I have to found a regex that match after the third / and go back to previous - but I haven't fount it after many tries in https://regexr.com/ . Do you have an idea how to do the regex and how put the result in the first column ? Thanks

1 Answers1

2
$ cat file
>hCoV-19/xxx/xxx-YYY/xxx
>hCoV-19/xxx/xxx-ZZZ/xxx
$ awk -F '[/-]' '{ printf "%s %s\n", $5, $0 }' file
YYY >hCoV-19/xxx/xxx-YYY/xxx
ZZZ >hCoV-19/xxx/xxx-ZZZ/xxx

The awk code above treats the data as lines that are divided into fields on either / or -. The fifth such field is the field that you want to prepend to each line, which is what the printf statement does.

If the - is not a good delimiter (it wouldn't be if the string before the first slash sometimes didn't contain a dash, for example), then use only / as a delimiter, split the third slash-delimited field on -, and prepend the second bit of the result to the line:

$ awk -F / '{ split($3,a,"-"); printf "%s %s\n", a[2], $0 }' file
YYY >hCoV-19/xxx/xxx-YYY/xxx
ZZZ >hCoV-19/xxx/xxx-ZZZ/xxx

Using sed:

$ sed 's/.*-\([^/]*\).*/\1 &/' file
YYY >hCoV-19/xxx/xxx-YYY/xxx
ZZZ >hCoV-19/xxx/xxx-ZZZ/xxx

or, if you're on Plan9 or using the Plan9 sed implementation which has issues with the / inside the bracketed expression, use an alternative set of delimiters for the s/// command:

$ sed 's,.*-\([^/]*\).*,\1 &,' file
YYY >hCoV-19/xxx/xxx-YYY/xxx
ZZZ >hCoV-19/xxx/xxx-ZZZ/xxx

The regular expression used here captures the substring consisting of no / characters after the last - on the line. It then prepends the line with this captured substring and a space.

Note that the main difference between this sed solution and the awk solution further up is that the awk code uses the field-like structure of each line, while the sed code is more "sloppy", just looking for a string of non-slash characters after a dash.


The https://regexr.com/ site currently supports JavaScript regular expressions and Perl-compatible regular expressions (PCRE). You are not using any of those two languages here, so whatever the site is telling you may not work. awk is using POSIX extended regular expressions (EREs), and most other standard Unix tools for text manipulation, including sed, uses POSIX basic regular expressions (BREs).

See also Why does my regular expression work in X but not in Y?

Kusalananda
  • 333,661
  • Note that your sed code will not work in posixly sed due to the [^/] char class within forward slashes. Better change the delimiters. – guest_7 Mar 23 '21 at 11:02
  • Thanks for the detail answer ! – jasmine_hubs Mar 23 '21 at 11:53
  • @guest_7 I don't know of any sed implementation that has issues with this. Let me know if you find one, it would be good to know which one is special in this regard. In general, the characters in a bracketed expression are literal and not part of the sed syntax. – Kusalananda Mar 23 '21 at 11:58
  • @Kusalananda Probably irrelevant, but sed from this port of Plan9 tools (which does not claim to comply with POSIX, as far as I can see) complains about a "garbled command" if the command is 's/[^/]/x/', while it is fine with 's/[^\/]/x/' - where it notably does not consider \ as a literal character included in the bracket expression, contrary to other implementations. – fra-san Mar 23 '21 at 12:07
  • @ Kusalananda Add a --posix option to your sed invocation and disable extended regex. It is not a quesion of a char being literal inside a bracketed expression, rather the s/...[^/]...// trips up the POSIX sed parser. – guest_7 Mar 23 '21 at 12:09
  • @guest_7 sed does not use extended regular expressions, so they can't be disabled (and I'm not using -E here). Installing GNU sed and using it with --posix makes no difference here. Using --posix with other sed implementations obviously does not work. Do you have a full example where using / inside a bracketed expression doesn't work? – Kusalananda Mar 23 '21 at 12:14
  • @fra-san Ah, thanks! I did not see your comment there until now. Yes, I have verified that this is in fact a sed that does not behave as other implementations. I will make a note of that in the answer. – Kusalananda Mar 23 '21 at 12:24