0

I was trying to use an awk command to verify if a particular column is not matching with a regex (basically I am validating a column in a file with uniform format , if not I need to throw error)

format=$2
col_pos=$1

val= awk -F "|’’ -v m="$format" -v n="$col_pos" '$n ~ "^"m"$"{print $1}' sample_file.txt

if [[ $val != "" ]]; then echo " column value is having unexpected format" fi


sh sample.sh  [a-z]{8}@gmail.com 3

Awk command is throwing an error. Can anybody help to correct the same?

Input file:

fileid|filename|contactemail
1|file1.txt|src@gmail.com
2|file2.txt|rec@gmail.com
3|file3.txt|xyz  -------->invalid column value as it doesnt satisfies the format @gmail.com 

Here is the sample program run (expected to catch error as xyz is not a valid email)

$ sh sample.sh 3 [a-z]@gmail.com
$ sh -x sample.sh 3 [a-z]@gmail.com
+ format='[a-z]@gmail.com'
+ col_pos=3
++ awk -F '~' -v 'm=[a-z]@gmail.com' -v n=3 '$n ~ "^"m"$"{print $1}' sample_file.txt
+ val=
+ [[ '' != '' ]]

2 Answers2

3

There are a few issues here.

  • Added a #!/bin/sh shebang to your script. If you make it executable with chmod +x sample.sh, you may call it as ./sample.sh ...
  • Fixed the field separator to '|'
  • Replaced deprecated command substitution backticks notation `...` with $(...) and removed space character in variable assignment
  • Added NR>1 to skip the first (header) line of the input file
  • If you want to match non-matching email addresses, negate the regex match: !~
  • The double bracket [[...]] test is not a valid sh construct and was changed to [...] in combination with the -n test operator, which is true if the following string is non-empty.

I also added $val to the echo output to be able to see where the error occurred and printed $n instead of $1. Change that back as needed. The output goes to stderr (>&2) and the script exits with non-zero exit status to indicate a failure.

Modified script:

#!/bin/sh

val=$( awk -F'|' -v n="$1" -v m="$2" 'NR>1 && $n !~ "^" m "$"{ print $n }' sample_file.txt )

if [ -n "$val" ]; then echo "column value is having unexpected format: $val" >&2 exit 1 fi

Your regexes don't match the email addresses if you match the full field with ^ and $,
using '[a-z]+@gmail.com' would work for example. Make sure to quote at least the regex parameter to prevent possible shell interpretation.

Sample run:

$ ./sample.sh 3 '[a-z]+@gmail.com'
column value is having unexpected format: xyz
$ ./sample.sh 3 'xyz'
column value is having unexpected format: src@gmail.com
rec@gmail.com
Freddy
  • 25,565
1

Building on @Freddy's excellent answer, you can have awk log the errors found in the input file to STDERR and then have the shell redirect STDERR to a log file with 2> (you can write directly to the error log file from awk if you want to, but it's more flexible to use the shell to redirect STDERR).

awk -F'|' -v n="$1" -v m="$2" '
    FNR>1 && $n !~ "^" m "$" {
      print NR ":" $0 > "/dev/stderr"
    }' input.txt 2> error.log

You can also make it return a count of errors on STDOUT, to be captured for the $val shell variable:

#!/bin/sh

val=$(awk -F'|' -v n="$1" -v m="$2" ' FNR>1 && $n !~ "^" m "$" { printf "%s:%s:%s\n", FILENAME, FNR, $0 > "/dev/stderr" count++ } END {print count}' sample_file.txt 2> errors.log )

if [ "$val" != 0 ]; then echo "$val errors found in input:" cat errors.log exit 1 fi

For example:

$ ./sample.sh 3 xyz
2 errors found in input:
sample_file.txt:2:1|file1.txt|src@gmail.com
sample_file.txt:3:2|file2.txt|rec@gmail.com

Note: awk will use - for FILENAME if the input comes from STDIN, so the error log would look something like:

-:4:3|file3.txt|xyz
cas
  • 78,579
  • @freddy and cas thank you for the excellent help . let me try this i would like accept both answers as right . but can select only one – daturm girl May 12 '21 at 13:06
  • @daturmgirl on the SE sites, best practice is to upvote and accept the one that best answers your question and upvote any other answers you like or find useful. Pick Freddy's answer, obviously - mine didn't actually answer your question, just extended Freddy's answer with extra stuff. See What should I do when someone answers my question? – cas May 12 '21 at 13:10
  • thanks @cas i did it i am fairly new to this site and unix thanks for the help – daturm girl May 12 '21 at 13:16
  • @freddy i was trying to test run your code .looks like some small issue i am facing Can you please help. I am getting all the records instead of unmatched records. Test run and code is pasted in the next comment – daturm girl May 12 '21 at 13:36
  • $ sh -x poc_col_val_email.sh 3 '[a-z]+@gmail.com' ++ awk '-F|' -v n=3 -v 'm=[a-z]+@gmail.com' 'NR>1 && $n !~ "^" m "$"{ print $n }' /test/data/infa_shared/dev/SrcFiles/datawarehouse/poc_anjali/sample_file.txt
    • val='src@gmail.com

    rec@gmail.com xyz'

    • '[' -n 'src@gmail.com

    rec@gmail.com xyz' ']'

    • echo 'column value is having unexpected format: src@gmail.com

    rec@gmail.com xyz' column value is having unexpected format: src@gmail.com rec@gmail.com xyz

    • exit 1
    – daturm girl May 12 '21 at 13:37
  • #!/bin/sh

    val=$( awk -F'|' -v n="$1" -v m="$2" 'NR>1 && $n !~ "^" m "$"{ print $n }' /test/data/infa_shared/dev/SrcFiles/datawarehouse/poc_anjali/sample_file.txt )

    if [ -n "$val" ]; then echo "column value is having unexpected format: $val" >&2 exit 1 fi

    – daturm girl May 12 '21 at 13:38
  • cat sample_file* file_id|filename|contactemail 1|file1.txt|src@gmail.com 2|file2.txt|rec@gmail.com 3|file3.txt|xyz – daturm girl May 12 '21 at 13:39
  • 1
    @daturmgirl did your files come from a windows machine? with CR/LF line-endings instead of just LF (aka \n or newline)? run file sample_file.txt, if it mentions CRLF then you need to convert to unix format text files. Use dos2unix. If you don't have that, you can do it with: perl -p -i -e 's/\r\n/\n/' sample_file.txt – cas May 12 '21 at 14:29
  • Thank you @Cas it worked with your suggestion . Yes i did edited the input file in Winscp .Now everything looking good thank you – daturm girl May 12 '21 at 14:35