3

I have a file that has some missing data point's value and the missing value are shown as ****. I need to select rows having consecutive 7 columns with value less than 10. When I run my script it also gives those rows that have **** in consecutive columns.

I can solve it easily by replacing all **** with a higher value. But, I don't want to change my input file. I want to do something so that my script treat **** as a number(greater than 10 i.e. str=****=100). How can I do that?

sample input consecutive7pointDown10.input-

2     3    4    5    6    7    8   0  12   14   23
2     3    4    12   6    7    8   0  1     2   23
**** **** **** **** **** **** **** 8 ****  **** 12

My script's result consecutive7pointDown10.output-

2     3    4    5    6    7    8    0    12    14   23
**** **** **** **** **** **** ****  8   ****  ****  12

But, expected output

2     3    4    5    6    7    8    0    12  14   23

My script consecutive7pointDown10 is as follows-

#!/bin/bash
########################################################################################################################
# This script results rows having at most 10°C in consecutive at most 7 points.
# input = scriptname.input
# output = scriptname.output
########################################################################################################################
input=`basename "$0"`.input
output=`basename "$0"`.output
awk '{
    for(i=4;i<=34-6;i++)
        {   
            if($i<=10 && $(i+1)<=10 && $(i+2)<=10 && $(i+3)<=10 && $(i+4)<=10 && $(i+5)<=10 && $(i+6)<=10)
            {
                print
                next
            }
        }
}' $input > $output
αғsнιη
  • 41,407
alhelal
  • 1,301

2 Answers2

1

You can use awk as following to avoid repeating checking 7 consecutive columns by using a flag to increment when those all meet the condition, or reset it when opposite otherwise.

awk '{c=0; split($0,arr,/ +/);
    for(x in arr) if(arr[x]<10 && arr[x]>=0) {
        if(++c==7){ print $0; next } }else{c=0} }' infile

Here we used awk's split function «split(string, array [, fieldsep [, seps ] ])» to split the lines (The $0 represent the whole line in awk) into the array named arr separated by one or more spaces.

Next loop over array elements and checking if its value is between 10 and 0 then increment a flag called c and print the line if it's reached to 7 (means 7 consecutive elements (columns) meet the condition); Otherwise rest the flag with 0.


Or doing the same way without splitting the line into array.

awk '{c=0; for(i=1;i<=NF;i++) if($i<10 && $i>=0) {
    if(++c==7){ print $0; next } }else{c=0} }' infile

In your case as you are going to filter start from column#4 to the end, then you would need something like. The NF represent the number of fields/column in each line proceed by awk.

$ time awk '{c=0; for(i=4;i<=NF;i++) if($i<10 && $i>=0) {
    if(++c==7) {print $0; next} }else{c=0} }' infile
real    0m0.317s
user    0m0.156s
sys     0m0.172s

Or in regex mode, again applied on your Original file where it contains only floating point numbers, you can use below grep command which is more efficient and ~6 times faster than awk (Where used with -P flag, see Grep -E, Sed -E - low performance when '[x]{1,9999}' is used, but why?), but considering flexibility of awk solution as you can change the ranges + will work if Integer/Float/mixed of both numbers.

$ time grep -P '([^\d]\d\.\d[^\d]){7}' infile
real    0m0.060s
user    0m0.016s
sys     0m0.031s

Or in another way:

$ time grep -P '(\s+\d\.\d\s+){7}' infile
real    0m0.057s
user    0m0.000s
sys     0m0.031s

Or compatibility in grep, sed or awk:

$ time grep -E '([^0-9][0-9]\.[0-9][^0-9]){7}' infile
real    0m0.419s
user    0m0.375s
sys     0m0.063s
$ time sed -En '/([^0-9][0-9]\.[0-9][^0-9]){7}/p' infile
real    0m0.367s
user    0m0.172s
sys     0m0.203s
$ time awk '/([^0-9][0-9]\.[0-9][^0-9]){7}/' infile
real    0m0.361s
user    0m0.219s
sys     0m0.172s
αғsнιη
  • 41,407
1
awk '/(\<[0-9]\s+){7}/{print}' input.txt

or

sed -rn '/(\b[0-9]\s{1,}){7}/p' input.txt

will do the job.

Explanation for the awk (the same logic for the sed):

  • /(\<[0-9]\s+){7}/{print} - print lines containing the pattern.

  • \< - Matches a word boundary; that is it matches if the character to the right is a “word” character and the character to the left is a “non-word” character.

  • [0-9]\s+ - one digit from 0 to 9, then one or more spaces.
  • (\<[0-9]\s+){7} - matches, if the \<[0-9]\s+ pattern is repeated seven times.

Input

2     3    4    5    6    7    8   0  12   14   23
2     3    4    12   6    7    8   0  1     2   23
**** **** **** **** **** **** **** 8 ****  **** 12

Output

2     3    4    5    6    7    8   0  12   14   23

EDIT:

For floating numbers with the one digit precision (9.2, 8.1, 7.5, etc).

awk '/(\<[0-9]\.[0-9](\s+|$)){7}/{print}' input.txt
MiniMax
  • 4,123
  • 1
    Clever way! But assuming the values are always in single digit, but as OP's original file doesn't grantee that, so this doesn't answer his/her question. – αғsнιη Oct 23 '17 at 20:46
  • @αғsнιη How it is not guaranteed? OP said: "I need to select rows having consecutive 7 columns with value less than 10." If the number is less than 10, then one digit is guaranteed. May be negative numbers will be the problem, like -7, -5. – MiniMax Oct 23 '17 at 21:02
  • @αғsнιη Oh, file. I look at it, now. Then, yes, pattern should be corrected. But it is easy, I think. I thought, only integers (whole numbers) are possible. – MiniMax Oct 23 '17 at 21:05
  • @MiniMax then write for floating. – alhelal Oct 24 '17 at 03:49
  • @alhelal See EDIT section. – MiniMax Oct 24 '17 at 11:08