-2

Hello I have the below awk in my script . The regex pattern is not working correctly for me .I wanted to validate the email address which can have characters [a-z],[0-9] ,[.] ,@

code

here are the sample email patterns in the input file
1.abc@gmail.com
2.abc123@hotmail.com
3.abc.xyz@yahoo.com
4.a1@gmail.net
5.a2@xcom.in

the pattern is extracted from a metadata file and passed as a script paramter .here is the metadata line defines the pattern for email id validation

1~4~~~char~Y~\"\@\.com\"~100

sh -x run for the script code

val=$(
     awk -F , 
         -v n=4
         -v 'm="*@*.com"'
         -v count=0 
         'NR!=1 && $n !~ "^" m "$"
                      {
                         printf "%s:%s:%s\n", FILENAME, FNR, $0 > "/dev/stderr"
                         count++
                       }
                       END {print count}' BNC.csv

vi of the script code

val=$(awk -F "$sep"
        -v n="$col_pos" 
        -v m="$col_patt" 
        -v count=0 
        'NR!=1 && $n !~ "^" m "$" 
                       {
                         printf "%s:%s:%s\n", FILENAME, FNR, $0 > "/dev/stderr"
                         count++
                       }
                       END {print count}' $input_file 

1 Answers1

1

If you're looking for a way to validate email addresses, FWIW this is what I have in an old awk script I have lying around that does that:

    # valid addrs regexp from http://www.regular-expressions.info/email.html
    # Specifically do NOT want to use [:alpha:] to drop Asian characters etc
    # Added a check that we have at least 2 consecutive alphabetic characters
    # both before and after the "@" to get rid of x@y.co etc. garbage
    (addr ~ /^[0-9a-zA-Z._%+-]+@[0-9a-zA-Z.-]+\.[a-zA-Z]{2,}$/) &&
    (addr ~ /^.*[a-zA-Z]{2}.*@.*[a-zA-Z]{2}.*\.[a-zA-Z]{2,}$/)

I'm sure that could be consolidated into 1 regexp but I don't care enough to do it and the end result would probably be less clear anyway.

Ed Morton
  • 31,617
  • I was trying to understand why the pattern @.com didnt worked in the awk . Then steel driver showed me a god example of file globbing and regex pattern example . But surprise for me is email id validation is far beyond what it is expected :) – daturm girl May 21 '21 at 18:14
  • Well, I don't know what other issues you experienced but a couple of obvious things are that . is a regexp metachar meaning any character so @.com would match @xcom, not just @.com, but more importantly valid email addresses don't contain @.com, they contain @<domain>.com. If you still have a question, though, then please update your question to contain concise, testable sample input and expected output that demonstrates the problem and reasonably formatted code plus a clear statement of what exactly the problem is, not just "The regex pattern is not working correctly for me". – Ed Morton May 21 '21 at 18:33
  • And yes, the set of valid email addresses is hard to validate (see https://stackoverflow.com/a/201378/1745001) but what I have in my answer is adequate for most purposes for people using the Roman alphabet. – Ed Morton May 21 '21 at 18:38
  • Also, when you talk about a regexp that matches an email address there's a couple of use-cases: a) finding email addresses in a block of arbitrary text, or b) validating that a specific string that should be an email address actually is one. You would write different code with slightly different regexps for each case. – Ed Morton May 21 '21 at 18:49
  • And there's the additional consideration of the text matching a regexp but it not actually being a valid email address, e.g. "k8e7klo9@a562nd71jd651j5l.foo" should match an email regexp but chances are there is no ".foo" TLD and if there is "a562nd71jd651j5l.foo" probably isn't a real domain at that TLD and if it is "k8e7klo9@" probably isn't a real username at that domain. – Ed Morton May 21 '21 at 18:51