0

I just write few lines to grep smallest value in my files and it is giving me correct result but repeating lines two times can you fix the bug

What I am doing:

  • Greping all files
  • Removing header
  • sorting in scientific notation using column nine
  • taking the first line that is the smallest after sort and printing using awk
  • I want file name so printed $i too

Script:

#!/bin/bash

for i in `ls -v *.txt` 
do 
smallestPValue=`sed 1d $i | sort -k9 -g | head -1 | awk '{print $0}'` 

echo  $i  $smallestPValue >> smallesttPvalueAll.txt
done

output

U1.text 4 rsxxx 1672175 A ADD 759 0.0751 4.918 1.074e-06
U1.txt 4 rsxxxx 1672175 A ADD 759 0.0751 4.918 1.074e-06
U2.txt  16 rsxxxx 596342 T ADD 734 -0.05458 -5.204 2.535e-07
U2.txt 16 rsxxxx 596342 T ADD 734 -0.05458 -5.204 2.535e-07
U3.txt 2 rsxxxx 12426 T ADD 722 0.06825 5.285 1.669e-07

I am getting repetitions for few lines while some are just fine as U3 above is coming once and that's what I want. I can easily get rid of duplicated lines by uniq or sort -u but just curious what is causing this

Desired output each line repeated once

star
  • 595
  • What is the output of ls -v *.txt? – cherdt Jul 28 '17 at 16:49
  • 1
    my guess is that you're probably getting dupes because smallesttPvalueAll.txt matches *.txt so is processed along with all the other .txt files. but there's so many things wrong with the way you're trying to do this that it's not even worth trying to fix. see my answer below for a better method. – cas Jul 29 '17 at 04:49
  • Well in my folder i have just those thousand files I want to process – star Jul 29 '17 at 22:40

1 Answers1

1

If I'm interpreting it right, you can probably do what you're trying to do with just awk and sort - no need for a loop, or parsing ls (subtle hint: DON'T DO THAT!), or head or sed.

awk 'FNR > 1 {print FILENAME, $0}' *.txt | sort -k10 -g | sort -u -k1,1

This skips the first line of each file, then prints all remaining lines prefixed with the filename and a space (awk's default output record separator or ORS). It then pipes it through sort to do a generic numeric sort on field 10. Finally, it does a unique sort of the first field only (-k1,1, the filename), so that only the first line with that filename is output.

Note that we have to sort on field 10 here, not field 9 because we've added the filename as the first field so all other field numbers are incremented by 1.

FNR and FILENAME are built-in awk variables. FNR is the line number ("input record number" in awk-lingo) of the current file, and FILENAME is the current filename.


here's another way of doing it, this time using only awk:

#!/usr/bin/awk -f

FNR > 1 && (! s[FILENAME] || $9 < s[FILENAME]) {
  s[FILENAME]=$9;
  l[FILENAME]=$0
};

END {
  for (f in s) {
    print f, l[f]
  }
}

save it as, e.g. smallest-pvalue.awk, make it executable with chmod +x smallest-pvalue.awk and run it as ./smallest-pvalue.awk *.txt.

This awk script keeps track of the smallest value seen for field 9 of each input file in an array called s, and also keeps the matching input line in array l.

Once it has processed all the files, it prints out the filename and the line containing the smallest 9th field for each file.

cas
  • 78,579
  • Well I need to understand this second part of array in awk as it's little more advanced. I appreciate your comment. I will try to understand and try to use.I am doing that way because I need to make three files using smallest five percent and ten percent value from my thousand files. And I just share first part of my script. I can get rid of these duplicated lines by sort - u but I thought may be there is any other way – star Jul 29 '17 at 22:56
  • The arrays aren't hard to understand. instead of using numbers as array indices, they use strings (the filenames) - e.g. s[U1.txt], and l[U1.txt]. the awk-only version loops through each file and if the line number is > 1 (FNR >1) and either (s[FILENAME] doesn't exist or the current line's pvalue ($9) is smaller than s[FILENAME]) then it sets s[FILENAME] to the current lines pvalue and l[FILENAME] to the entire current line ($0). The END {...} block is run when there's nothing left to do and prints out each filename along with its stored input line. – cas Jul 30 '17 at 02:00
  • what do you mean by current line's pvalue ($9) is smaller than s[FILENAME]) I want to sort in ascending order and record from all 1000 their smallest value so resulting file should have 1000 lines. – star Jul 31 '17 at 20:00
  • I thought you wanted the smallest such value from each file - that's what the sort -u -k1,1 and your original head -1 do. The standalone awk script is just another way to do that - instead of printing all lines and then using sort -u to throw away the smallest for each input file, it only prints the smallest values. – cas Aug 01 '17 at 01:50
  • i tried to run this script save it like u said execute it but name give error even with awk -f scriptname it doesnt work. – star Aug 02 '17 at 02:49
  • what error messsage? and what exactly do you mean by "it doesn't work"? the script works (in that it does what i said it does, I tested it before posting, of course). – cas Aug 02 '17 at 02:51
  • actually, i messed it up a bit when editing it into this answer. should be fixed now. changes are to add -f to the #!/usr/bin/awk line and change the print line to print f, l[f] rather than print f, l[i] (my copy of the script used i as the loop variable, but i changed that to f here to make it clearer that f stands for filename. but only changed part of the script). or just copy the new version of the script from my answer above. – cas Aug 02 '17 at 02:57