column data type validation

Question

I was trying to use an awk command to verify if a particular column is not matching with a regex (basically I am validating a column in a file with uniform format , if not I need to throw error)

format=$2
col_pos=$1
val= awk -F &quot;|’’ -v m=&quot;$format&quot; -v n=&quot;$col_pos&quot; '$n ~ &quot;^&quot;m&quot;$&quot;{print $1}' sample_file.txt
if [[ $val != "" ]]; then
   echo " column value is having unexpected format"
fi

sh sample.sh  [a-z]{8}@gmail.com 3

Awk command is throwing an error. Can anybody help to correct the same?

Input file:

fileid|filename|contactemail
1|file1.txt|src@gmail.com
2|file2.txt|rec@gmail.com
3|file3.txt|xyz  -------->invalid column value as it doesnt satisfies the format @gmail.com

Here is the sample program run (expected to catch error as xyz is not a valid email)

$ sh sample.sh 3 [a-z]@gmail.com
$ sh -x sample.sh 3 [a-z]@gmail.com
+ format='[a-z]@gmail.com'
+ col_pos=3
++ awk -F '~' -v 'm=[a-z]@gmail.com' -v n=3 '$n ~ "^"m"$"{print $1}' sample_file.txt
+ val=
+ [[ '' != '' ]]

And what does the content of filename look like? Please edit the question with the extra information ... — tink, May 11 '21 at 22:53
The obvious (awk) errors are (1) =~ should be just ~ and (2) ^ and $ in the computed regex need to be string constants i.e. $n ~ "^" m "$". There are additional issues at the shell level. — steeldriver, May 11 '21 at 23:00
Thank you @steeldriver i edited the program atleast it is running condition .But logic issue is still there — daturm girl, May 11 '21 at 23:22
@daturmgirl you're not actually assigning the awk output to the variable val, owing to the space after the = sign. Really you should not be using "bacticks" at all (they are deprecated), use $(...) instead, so val=$(awk ...). Also your actual script appears to still use the wrong field separator (-F '~' rather than -F '|' to match your sample data). — steeldriver, May 12 '21 at 00:17
... see Spaces in variable assignments in shell scripts for explanation — steeldriver, May 12 '21 at 00:19

score 3 · Accepted Answer · answered May 12 '21 at 01:35

There are a few issues here.

Added a #!/bin/sh shebang to your script. If you make it executable with chmod +x sample.sh, you may call it as ./sample.sh ...
Fixed the field separator to '|'
Replaced deprecated command substitution backticks notation `...` with $(...) and removed space character in variable assignment
Added NR>1 to skip the first (header) line of the input file
If you want to match non-matching email addresses, negate the regex match: !~
The double bracket [[...]] test is not a valid sh construct and was changed to [...] in combination with the -n test operator, which is true if the following string is non-empty.

I also added $val to the echo output to be able to see where the error occurred and printed $n instead of $1. Change that back as needed. The output goes to stderr (>&2) and the script exits with non-zero exit status to indicate a failure.

Modified script:

#!/bin/sh
val=$( awk -F'|' -v n="$1" -v m="$2" 'NR>1 && $n !~ "^" m "$"{ print $n }' sample_file.txt )
if [ -n "$val" ]; then
    echo "column value is having unexpected format: $val" >&2
    exit 1
fi

Your regexes don't match the email addresses if you match the full field with ^ and $,
using '[a-z]+@gmail.com' would work for example. Make sure to quote at least the regex parameter to prevent possible shell interpretation.

Sample run:

$ ./sample.sh 3 '[a-z]+@gmail.com'
column value is having unexpected format: xyz
$ ./sample.sh 3 'xyz'
column value is having unexpected format: src@gmail.com
rec@gmail.com

score 1 · Answer 2 · answered May 12 '21 at 09:25

1

Building on @Freddy's excellent answer, you can have awk log the errors found in the input file to STDERR and then have the shell redirect STDERR to a log file with 2> (you can write directly to the error log file from awk if you want to, but it's more flexible to use the shell to redirect STDERR).

awk -F'|' -v n="$1" -v m="$2" '
    FNR>1 && $n !~ "^" m "$" {
      print NR ":" $0 > "/dev/stderr"
    }' input.txt 2> error.log

You can also make it return a count of errors on STDOUT, to be captured for the $val shell variable:

#!/bin/sh
val=$(awk -F'|' -v n="$1" -v m="$2" '
        FNR>1 && $n !~ "^" m "$" {
          printf "%s:%s:%s\n", FILENAME, FNR, $0 > "/dev/stderr"
          count++
        }
        END {print count}' sample_file.txt 2> errors.log
     )
if [ "$val" != 0 ]; then
    echo "$val errors found in input:"
    cat errors.log
    exit 1
fi

For example:

$ ./sample.sh 3 xyz
2 errors found in input:
sample_file.txt:2:1|file1.txt|src@gmail.com
sample_file.txt:3:2|file2.txt|rec@gmail.com

Note: awk will use - for FILENAME if the input comes from STDIN, so the error log would look something like:

-:4:3|file3.txt|xyz

answered May 12 '21 at 09:25

cas

78,579

@freddy and cas thank you for the excellent help . let me try this i would like accept both answers as right . but can select only one – daturm girl May 12 '21 at 13:06
@daturmgirl on the SE sites, best practice is to upvote and accept the one that best answers your question and upvote any other answers you like or find useful. Pick Freddy's answer, obviously - mine didn't actually answer your question, just extended Freddy's answer with extra stuff. See What should I do when someone answers my question? – cas May 12 '21 at 13:10
thanks @cas i did it i am fairly new to this site and unix thanks for the help – daturm girl May 12 '21 at 13:16
@freddy i was trying to test run your code .looks like some small issue i am facing Can you please help. I am getting all the records instead of unmatched records. Test run and code is pasted in the next comment – daturm girl May 12 '21 at 13:36
$ sh -x poc_col_val_email.sh 3 '[a-z]+@gmail.com' ++ awk '-F|' -v n=3 -v 'm=[a-z]+@gmail.com' 'NR>1 && $n !~ "^" m "$"{ print $n }' /test/data/infa_shared/dev/SrcFiles/datawarehouse/poc_anjali/sample_file.txt
- val='src@gmail.com
rec@gmail.com xyz'
- '[' -n 'src@gmail.com
rec@gmail.com xyz' ']'
- echo 'column value is having unexpected format: src@gmail.com
rec@gmail.com xyz' column value is having unexpected format: src@gmail.com rec@gmail.com xyz
- exit 1
– daturm girl May 12 '21 at 13:37
#!/bin/sh
val=$( awk -F'|' -v n="$1" -v m="$2" 'NR>1 && $n !~ "^" m "$"{ print $n }' /test/data/infa_shared/dev/SrcFiles/datawarehouse/poc_anjali/sample_file.txt )

if [ -n "$val" ]; then echo "column value is having unexpected format: $val" >&2 exit 1 fi
– daturm girl May 12 '21 at 13:38
cat sample_file* file_id|filename|contactemail 1|file1.txt|src@gmail.com 2|file2.txt|rec@gmail.com 3|file3.txt|xyz – daturm girl May 12 '21 at 13:39
1

@daturmgirl did your files come from a windows machine? with CR/LF line-endings instead of just LF (aka \n or newline)? run file sample_file.txt, if it mentions CRLF then you need to convert to unix format text files. Use dos2unix. If you don't have that, you can do it with: perl -p -i -e 's/\r\n/\n/' sample_file.txt – cas May 12 '21 at 14:29
Thank you @Cas it worked with your suggestion . Yes i did edited the input file in Winscp .Now everything looking good thank you – daturm girl May 12 '21 at 14:35

column data type validation

2 Answers2