How to treat a string of a file as a value when run awk?

Question

I have a file that has some missing data point's value and the missing value are shown as ****. I need to select rows having consecutive 7 columns with value less than 10. When I run my script it also gives those rows that have **** in consecutive columns.

I can solve it easily by replacing all **** with a higher value. But, I don't want to change my input file. I want to do something so that my script treat **** as a number(greater than 10 i.e. str=****=100). How can I do that?

sample input consecutive7pointDown10.input-

2     3    4    5    6    7    8   0  12   14   23
2     3    4    12   6    7    8   0  1     2   23
**** **** **** **** **** **** **** 8 ****  **** 12

My script's result consecutive7pointDown10.output-

2     3    4    5    6    7    8    0    12    14   23
**** **** **** **** **** **** ****  8   ****  ****  12

But, expected output

2     3    4    5    6    7    8    0    12  14   23

My script consecutive7pointDown10 is as follows-

#!/bin/bash
########################################################################################################################
# This script results rows having at most 10°C in consecutive at most 7 points.
# input = scriptname.input
# output = scriptname.output
########################################################################################################################
input=`basename "$0"`.input
output=`basename "$0"`.output
awk '{
    for(i=4;i<=34-6;i++)
        {   
            if($i<=10 && $(i+1)<=10 && $(i+2)<=10 && $(i+3)<=10 && $(i+4)<=10 && $(i+5)<=10 && $(i+6)<=10)
            {
                print
                next
            }
        }
}' $input > $output

how about completely skipping lines having ****? or is there a case where such lines can still have 7 consecutive columns less than 10 and you want that in output? — Sundeep, Oct 23 '17 at 15:38
you could temporarily change input line (won't change input file) by using gsub(/\*{4}/,100) before the for-loop — Sundeep, Oct 23 '17 at 15:42
@Sundeep yes, I am finding this type. I will try this. I will notify you. — alhelal, Oct 23 '17 at 15:58
@Sundeep doesn't give right answer. script image output image — alhelal, Oct 23 '17 at 16:01

αғsнιη · Answer 1 · 2017-10-24T18:20:04.467

You can use awk as following to avoid repeating checking 7 consecutive columns by using a flag to increment when those all meet the condition, or reset it when opposite otherwise.

awk '{c=0; split($0,arr,/ +/);
    for(x in arr) if(arr[x]<10 && arr[x]>=0) {
        if(++c==7){ print $0; next } }else{c=0} }' infile

Here we used awk's split function «split(string, array [, fieldsep [, seps ] ])» to split the lines (The $0 represent the whole line in awk) into the array named arr separated by one or more spaces.

Next loop over array elements and checking if its value is between 10 and 0 then increment a flag called c and print the line if it's reached to 7 (means 7 consecutive elements (columns) meet the condition); Otherwise rest the flag with 0.

Or doing the same way without splitting the line into array.

awk '{c=0; for(i=1;i<=NF;i++) if($i<10 && $i>=0) {
    if(++c==7){ print $0; next } }else{c=0} }' infile

In your case as you are going to filter start from column#4 to the end, then you would need something like. The NF represent the number of fields/column in each line proceed by awk.

$ time awk '{c=0; for(i=4;i<=NF;i++) if($i<10 && $i>=0) {
    if(++c==7) {print $0; next} }else{c=0} }' infile
real    0m0.317s
user    0m0.156s
sys     0m0.172s

Or in regex mode, again applied on your Original file where it contains only floating point numbers, you can use below grep command which is more efficient and ~6 times faster than awk (Where used with -P flag, see Grep -E, Sed -E - low performance when '[x]{1,9999}' is used, but why?), but considering flexibility of awk solution as you can change the ranges + will work if Integer/Float/mixed of both numbers.

$ time grep -P '([^\d]\d\.\d[^\d]){7}' infile
real    0m0.060s
user    0m0.016s
sys     0m0.031s

Or in another way:

$ time grep -P '(\s+\d\.\d\s+){7}' infile
real    0m0.057s
user    0m0.000s
sys     0m0.031s

Or compatibility in grep, sed or awk:

$ time grep -E '([^0-9][0-9]\.[0-9][^0-9]){7}' infile
real    0m0.419s
user    0m0.375s
sys     0m0.063s

$ time sed -En '/([^0-9][0-9]\.[0-9][^0-9]){7}/p' infile
real    0m0.367s
user    0m0.172s
sys     0m0.203s

$ time awk '/([^0-9][0-9]\.[0-9][^0-9]){7}/' infile
real    0m0.361s
user    0m0.219s
sys     0m0.172s

I think your answer should be this awk '{ for(i=4;i<=34;i++) if($i<10 && $i>=0){if(++c==7) print $0;next}else{c=0}}' infile instead of awk '{ for(i=4;i<=34;i++) if($i<10 && $i>=0){if(++c==7) print $0}else{c=0}}' infile — alhelal, Oct 24 '17 at 16:26
I add c=0 in starting for each row, otherwise it increments previous c's value when c<7 in previous row. awk '{for(i=4;i<=NF;i++)if($i<10 && $i>=0){if(++c==7) {print $0;c=0;next}}else{c=0}}' infile result one row, but it is not true for this data — alhelal, Oct 24 '17 at 17:23
@alhelal, please check my updates which it works now even for your another data shared sample. — αғsнιη, Oct 24 '17 at 18:51
That's used for getting time cost once output of a command resulted. — αғsнιη, Oct 25 '17 at 07:34
why this script(although about similar as you) results nothing for this data?. There may be a silly error that was not found by me. — alhelal, Oct 26 '17 at 11:26
Hi, it's not same, in yours the else statement is for inner if while it must be for outer if statement. see mine above. — αғsнιη, Oct 26 '17 at 13:49

MiniMax · Answer 2 · 2017-10-24T12:26:38.670

1

awk '/(\<[0-9]\s+){7}/{print}' input.txt

or

sed -rn '/(\b[0-9]\s{1,}){7}/p' input.txt

will do the job.

Explanation for the awk (the same logic for the sed):

/(\<[0-9]\s+){7}/{print} - print lines containing the pattern.
\< - Matches a word boundary; that is it matches if the character to the right is a “word” character and the character to the left is a “non-word” character.
[0-9]\s+ - one digit from 0 to 9, then one or more spaces.
(\<[0-9]\s+){7} - matches, if the \<[0-9]\s+ pattern is repeated seven times.

Input

2     3    4    5    6    7    8   0  12   14   23
2     3    4    12   6    7    8   0  1     2   23
**** **** **** **** **** **** **** 8 ****  **** 12

Output

2     3    4    5    6    7    8   0  12   14   23

EDIT:

For floating numbers with the one digit precision (9.2, 8.1, 7.5, etc).

awk '/(\<[0-9]\.[0-9](\s+|$)){7}/{print}' input.txt

edited Oct 24 '17 at 12:26

answered Oct 23 '17 at 18:41

MiniMax

4,123

1

Clever way! But assuming the values are always in single digit, but as OP's original file doesn't grantee that, so this doesn't answer his/her question. – αғsнιη Oct 23 '17 at 20:46
@αғsнιη How it is not guaranteed? OP said: "I need to select rows having consecutive 7 columns with value less than 10." If the number is less than 10, then one digit is guaranteed. May be negative numbers will be the problem, like -7, -5. – MiniMax Oct 23 '17 at 21:02
@αғsнιη Oh, file. I look at it, now. Then, yes, pattern should be corrected. But it is easy, I think. I thought, only integers (whole numbers) are possible. – MiniMax Oct 23 '17 at 21:05
@MiniMax then write for floating. – alhelal Oct 24 '17 at 03:49
@alhelal See EDIT section. – MiniMax Oct 24 '17 at 11:08

How to treat a string of a file as a value when run awk?

2 Answers2

EDIT: