3

I have seen the below pattern is used in several places (even on sof) as an example for email id validation.

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b

the above is taken from https://www.regular-expressions.info/tutorial.html, and the quote

 this pattern describes an email address. 

This pattern does not take into consideration lower case alphabets (unless I am missing something).

Is there any thing further I got to understand about this patter? As this pattern can not be really used in production? Why is it so popular?

samshers
  • 678
  • 4
    See How to validate an email address using a regular expression? to understand what’s involved in checking the format of an email address (let alone validating it). You need PCRE or an equivalent regex engine to handle the grammar defined in RFC5322; alternatively, you can use a parser. – Stephen Kitt Sep 15 '20 at 08:14
  • 4
    The title doesn't match the body, and is probably why people are thinking that this is opinion-based. The title invites the response Well how on Earth should we know what goes through people's heads when they copy this stuff?. Whereas the body asks the better and quite different question of whether this regular expression can be used in production. – JdeBP Sep 15 '20 at 08:26
  • @JdeBP, like i said in the Q, the pattern does not match lower case which is quite basic - so using it in production is not correct. Apart from that, I am more interested in knowing if there is a reason for why this pattern is so popular – samshers Sep 15 '20 at 08:29
  • 2
    Email addresses are not case-sensitive (with the possible exception of something from the 1980s -- UseNET?). You are probably expected to run the address through toupper() as a first step. – Paul_Pedant Sep 15 '20 at 08:37
  • 4
    No, it's only a MAY and a SHOULD. Only "Postmaster" MUST be case-insensitive. – JdeBP Sep 15 '20 at 08:52
  • 28
    The very sentence you quote from the linked page includes a link to a page that states, among other things, "This regex is intended to be used with your regex engine’s “case insensitive” option turned on. (You’d be surprised how many “bug” reports I get about that.)" – fra-san Sep 15 '20 at 09:41
  • @fra-san Please submit that as an answer since it indicates that OP has ignored sufficient context that his question is invalid. – Mark Morgan Lloyd Sep 16 '20 at 11:29
  • The pattern is also incomplete. The apostrophe (single quote) is a valid character to the left of the @ – doneal24 Sep 16 '20 at 16:56

2 Answers2

14

I have seen the below pattern is used in several places (even on sof) ... Why is it so popular?

Because people are copy-pasting the first google search result in their answers, blogs and code, which are in turn picked up by search engines, which brings even more people to copy-paste it, generating an infernal vortex which finishes by driving off the internet any better content.

unless I am missing something

Following the link from your question there's a long rambling digression which should "answer" all your questions.

Kusalananda
  • 333,661
  • 3
    This quote from the digression pretty much sums it up: "The virtue of my regular expression above is that it matches 99% of the email addresses in use today." - So it's OK to refuse 1% of people with valid email addresses from your site? To hell with that, it's just making excuses for a crappy RE. Or really, for using the wrong tool for the job (which REs so so often are). – marcelm Sep 15 '20 at 21:22
  • The long rambling digression includes the following: This regex is intended to be used with your regex engine’s “case insensitive” option turned on. (You’d be surprised how many “bug” reports I get about that.) – Barmar Sep 16 '20 at 15:58
  • @marcelm Thank you. My work email uses the apostrophe to the left of the @. This has caused innumerable headaches with web sites saying I have an invalid email address. Even large tech companies have problems with my email. – doneal24 Sep 16 '20 at 16:58
13

It should not be used in production. For example "email me"@contoso.com is a syntactically valid email address but will not be matched by that naïve RE.

See RFC5322 section 3.4.1 for the definitive grammar.

Annoyingly perhaps, there is no BRE or ERE that can match that grammar definition, but you can get very close. However, a PCRE will do the trick. See How to validate an email address using a regular expression? on StackOverflow.

Chris Davies
  • 116,213
  • 16
  • 160
  • 287
  • 4
    Even if you use the RFC regex, you must validate the email address in a second way. Simply because that user might have a typo when he enters it. So why use a regex more complicated than .*@.*\..* at all? – Thomas Weller Sep 15 '20 at 20:18
  • 3
    @ThomasWeller And even that's actually too complicated, since the host part could technically be a TLD, e.g. thomas@de. With gTLDs, that actually becomes a viable option, too. (The host part could even be an IP address, but I don't mind those being rejected.) – marcelm Sep 15 '20 at 21:17
  • 1
    Absolutely. My recommendation would be to try sending to it. Either it'll work (in which case it's potentially valid) or it won't. If it's potentially valid you need some way to verify it's the user's address and not someone picked at random – Chris Davies Sep 15 '20 at 21:22
  • It also fails on explicit source routing. Example from RFC2821: "@ONE,@TWO:JOE@THREE" – Eric Towers Sep 15 '20 at 22:23
  • 2
    @marcelm newgTLDs are not allowed to do that. It was an option for traditional TLDs. Time ago .io did provide such emails but it stopped doing so, and I don't think any top level tld does that. – Ángel Sep 15 '20 at 23:04
  • @Ángel Really? I didn't know that! Do you happen to have more information? – marcelm Sep 16 '20 at 05:58
  • @marcelm it was "discouraged" in RFC822 § 6.2.7 and marked obsolete in RFC2822 and RFC5322. I haven't see anything in the RFCs that indicate it's prohibited for newgTLDs, but if it's obsolete addressing I don't see why anyone should be permitted to use it for newgTLDs that arrived after RFC2822 – Chris Davies Sep 16 '20 at 08:28
  • @roaima I'm a little confused; your references seem to be about explicit source routing (which I'll happily believe is obsolete), not user@tld style e-mail addresses (which is what I was talking about). – marcelm Sep 16 '20 at 08:41
  • 1
    user@hostname could be valid on your internal company network! You are probably not doing email on your internal company network, but someone else running your software might! – user253751 Sep 16 '20 at 09:21
  • @marcelm sorry misread the threads – Chris Davies Sep 16 '20 at 09:48
  • 1
    @marcelm see on the Registry agreement the "DNS Service – TLD Zone Contents" section, it even clarifies at the end "The above language effectively does not allow, among other things, the inclusion of DNS resource records that would enable a dotless domain name (e.g., apex A, AAAA, MX records) in the TLD zone.)" Although it might be approved through a "Registry Services Evaluation Process (RSEP)" – Ángel Sep 17 '20 at 00:21