-1

Im trying to understand what expressions exactly the regular expression (^[0-9]..[a-zA-Z ]+$) detects in grep command (linux terminal)

I know that if I'd write the following command:

grep ^[0-9]..[a-zA-Z] filename.txt

I will detect any line that contains expressions such as 92afg But Im not sure what the +$ means and what kind of expressions will I be able to detect with the command

grep ^[0-9]..[a-zA-Z]+$ filename.txt

I tried to open a new text file and just type expressions that I thought would be detected, but none of them matched, so I'd appreciate explanation for this.

AdminBee
  • 22,803

2 Answers2

7

Let's break it down. First of all, note that this RegExp uses the "Extended regular expression" syntax (ERE) - the + is a metacharacter that doesn't work in the "Basic regular expression" syntax that grep uses by default (meaning it would match itself and require a literal + at that position), so if you want to use that RegEx with grep, you will need to pass the -E option.

  • The ^ is an anchor that ties this position of the regex to the start of the line.
  • The [0-9] is a character list and will match any single(1) character that falls into the sort range between 0 and 9. What exactly that comprises depends on the "collation order", determined among others by the environment variable LC_COLLATE.
  • The . matches any single character, so two .. means "any two characters".
  • The [a-zA-Z] again is a character list and will match characters(1) that fall between a and z and in addition those that fall between A and Z. Again, what that means depends on the collation order!
  • The + means "one or more of the previous"
  • The $ is an anchor that ties this position of the regex to the end of the line.

So, your RegEx is intended to(1) match any lines that

  • start with any digit
  • followed by any two characters
  • and only contain letters (but at least one) up to the end of the line.

(1)for what it might actually do, see below

Some notes

  1. In your example, you use the regular expression unquoted. That means any characters are open to interpretation by the shell before they are passed to the grep command. If your pattern contains $ or globbing characters (*, ? and [...] character lists!), the shell may try to perform variable expansion (thereby replacing parts of your RegEx) or expand globbing patterns into possibly multiple filenames, so that in the end you would have more arguments on the command-line that you originally intended. Other characters that are special to the shell (>, #, ; and the like) might lead to even more unexpected behavior. You should use

    grep -E '^[0-9]..[a-zA-Z]+$' filename.txt
    

    instead. Note that you can get rid of the opening and closing anchors by using the -x flag to enforce "whole-line" matching:

    grep -x -E '[0-9]..[a-zA-Z]+' filename.txt
    
  2. Character lists containing ranges (such as a-z) are dangerous because they might not give you what you think. Naively one might expect them to match all characters that lie between the start and end character on the ASCII table, but that is only true for the C locale. In other locales (and in particular in the usually set system locales such as en_US.UTF-8) the collation order is something like aAbB ... zZ so a-z would also match most upper-case letters. Also, the match is actually not on the level of single characters but "collation elements" which means in some locales, even combinations of several letters may match (e.g. dzs in Hungarian)! See this answer (or, in general, most answers by @Stéphane Chazelas about pattern matching) for more insight. If you want to ensure that your ranges work, set the collation order at least for the given command via

    LC_COLLATE="C" grep -E ' ... ' filename.txt
    
AdminBee
  • 22,803
  • It would match unusual filenames (starting with ^), but still... – Kusalananda Nov 03 '21 at 10:58
  • 1
    @they, it's not only zsh and bash -O failglob. Also fish (though [...] is not a glob operator there), csh, tcsh and pre-Bourne shells. nullglob would also be a problem in the shells that have it. ^ is also special in many shells. – Stéphane Chazelas Nov 03 '21 at 10:59
  • 1
    +1. Also worth pointing out that it's the + that makes this require -E for Extended Regular Expressions (ERE) - the rest of the regex works the same in either BRE or ERE. Some, but by no means all, Basic Regular Expression (BRE) engines allow you to backslash-escape a + as \+ to make it mean "one-or-more" like in an ERE rather than a literal + character. – cas Nov 03 '21 at 11:33
  • @StéphaneChazelas Thank you; I included that too. You really ought to write up the "reference answer" that explains the entire misery ;) (or maybe you already have, and I just didn't find it so far ...) – AdminBee Nov 03 '21 at 13:08
  • Your answer still says [0-9] or [a-zA-Z] matches a single character though which remains potentially misleading (especially if that's intended as input sanitisation). – Stéphane Chazelas Nov 03 '21 at 14:27
  • @StéphaneChazelas I had mentioned it in the 2nd bullet of the "notes" section where I went more into detail on the pitfalls of character ranges. I will include a reference to that in the "breakdown" section so that it doesn't get overlooked, but for the sake of readability I would hesitate to explain it in detail there already ... – AdminBee Nov 03 '21 at 14:31
4

+ stands for "one or more repetitions of the previous", $ is "end of line". Note the difference versus * with means "zero or more repetitions".

So it basically means: Any line starting with a digit, followed by two characters of any kind and subsequently one or more (possibly capital) letters¹ until the end of the line.

(¹ be careful, some locales might not only have the 26 letters you'd expect in A-Z or a-z, e.g. è or ŷ depending on language)

For a good guide regarding regexes, I strongly suggest grymoire's beautiful website, which I heartly also recommend for e.g. sed and awk.


Why doesn't it match?

+ is part of the extended regular expressions (and otherwise is interpreted as a literal +-sign).

So for using + as "one or more repetitions", use the -E-flag in grep and also quote the regex to avoid any issues with shell special characters:

grep -E '^[0-9]..[a-zA-Z]+$' filename.txt
ilkkachu
  • 138,973
FelixJN
  • 13,566
  • I'd change that some locales might not have the 26 letters to most locales have a lot more than the 26 letters. I'm not aware of any locale on any system where [A-Z] doesn't doesn't match at least ABCDEFGHIJKLMNOPQRSTUVWXYZ. On some, it matches collating elements that are made of more than one character (like Dzs in some Hungarian locales) – Stéphane Chazelas Nov 03 '21 at 10:54
  • 1
    [A-Z], $ and sometimes ^ are also special shell operators, so should be quoted: grep -E '...'. See also the -x option to avoid having to use ^ and $ – Stéphane Chazelas Nov 03 '21 at 10:55
  • @StéphaneChazelas Yes, that would be more fitting - adjusted. – FelixJN Nov 03 '21 at 11:23