4

I'm trying to match a line in a textfile with

if [[ ${regel} =~ ([\s][CN][G]{2}[A]{2}[T]) ]];

I also tried instead of /s to use /A and /b couple examples of things I tried:

if [[ ${regel} =~ (\A[CN][G]{2}[A]{2}[T]) ]];
if [[ ${regel} =~ (\b[CN][G]{2}[A]{2}[T]) ]];
if [[ ${regel} =~ ([\A][CN][G]{2}[A]{2}[T]) ]];
if [[ ${regel} =~ ([\b][CN][G]{2}[A]{2}[T]) ]];

All of these match to nothing, if I remove the first one to just make

if [[ ${regel} =~ ([CN][G]{2}[A]{2}[T]) ]];

it will match what i wanted to match to but I want it to match to the space in front so it does not take mid line strings with it aswell.

Example of what a match looks like how I want it:

OZBMN6HH1KI CGGAATGGGGGGGGGGGGGGGCGAGAATCTGAAATAGAGTGGTGACGTGCTGCGTTGACATAGGTCCTAGGGACCACCAG

What am I doing wrong? How can I make it match ␣CGGAAT?

ilkkachu
  • 138,973

4 Answers4

3

bash regexps in [[ =~ regex ]] are POSIX extended regexps. On systems whose extended regexps have extensions beyond what POSIX specifies (like GNU regexps that support \s (though not inside bracket expressions) or \b), you can only use them in bash as part of an unquoted expansion (unless you turn on bash-3.1 compatibility):

[[ a =~ \ba ]]                    # returns false
[[ a =~ $(printf %s '\ba') ]]     # returns true on GNU systems
BASH_COMPAT=3.1; [[ a =~ '\ba' ]] # returns true on GNU systems
re='\ba'; [[ a =~ $re ]]          # returns true on GNU systems.

If by \A you mean start of subject, then we're talking of perl or perl-compatible regexps, which are again different regexps.

Standard EREs don't have a concept of multiline mode where ^ could match at the beginning of the subject but also after each newline character like when using perl's (?m). Some ERE implementations like ast-open's ones do support it as an extension ([[ a =~ \Aa ]] does work in ksh93), but in any case that multiline mode would not be the default, so you might as well use ^ instead of \A.

Even in perl, [\A] would not match on start of subject. [...] is meant to match one character (or sometimes collating element). [\A] would match on either A or \ in ERE or A in perl REs. [\b] would match on b or \ in ERE and on the backspace character in perl RE. [\s] on s or \ in ERE and the same as \s (whitespace character) in perl RE.

If you want to match on a [CN]G{2}A{2}T at the start of the subject (\A) or following a non-word character (\b), with standard EREs, you would do:

[[ $var =~ (^|[^[:alnum:]_])[CN]G{2}A{2}T ]]
2

\A, \b and \s are Perl for "start of string", "word boundary" and "a whitespace character", respectively. (See the perlre man page) They're not supported in the extended regular expressions that Bash uses.

In ERE, the start of string is represented as ^, and any whitespace character can be matched with [[:space:]], or if you want to just match a space, with a literal space. On some systems (at least GNU), you can represent the left word boundary with \< and the right one with \>. On others, they might match the literal < and>.

However, with spaces and backslashes, you run into problems with how Bash parses the regular expression inside the conditional. Literal unquoted space ends the RE, and backslash still escapes characters. To get around that, store the regex in a variable first:

re=' [CN]GGAAT'
if [[ $regel =~ $re ]]; then echo y; fi

or, if \< works and you want to use that:

re='\<[CN]GGAAT'
if [[ $regel =~ $re ]]; then echo y; fi
ilkkachu
  • 138,973
1

You can match a space with a quoted space:

if [[ ${regel} =~ ' '[CN]G{2}A{2}T  ]]

I removed the [] around single characters.

meuh
  • 51,383
1

Replace [\s] with [[:space:]]. I'm not sure what the origin of [\s] is, but others have had a similar misconception. Hence, the correct form is

>if [[ ${regel} =~ ([[:space]][CN][G]{2}[A]{2}[T]) ]];
Sparhawk
  • 19,941
  • 1
    \s comes from Perl, it matches "any whitespace character", similar to [[:space:]] (though it appears it's not exactly the same). [[:space:]] of course matches not just space, but tabs (and some less common whitespace) too. – ilkkachu Nov 10 '18 at 12:45