I need to speed up a script that essentially determines whether or not all the "columns" for each row are the same, then writes a new file containing either one of the identical elements, or a "no_match". The file is comma delimited, consists of around 15,000 rows, and contains varying numbers of "columns".
For example:
1-69
4-59,4-59,4-59,4-61,4-61,4-61
1-46,1-46
4-59,4-59,4-59,4-61,4-61,4-61
6-1,6-1
5-51,5-51
4-59,4-59
Writes a new file:
1-69
no_match
1-46
no_match
6-1
5-51
4-59
Deleting the second and fourth rows because they contain non-identical columns.
Here is my far from elegant script:
#!/bin/bash
ind=$1 #file in
num=`wc -l "$ind"|cut -d' ' -f1` #number of lines in 'file in'
echo "alleles" > same_alleles.txt #new file to write to
#loop over every line of 'file in'
for (( i =2; i <= "$num"; i++));do
#take first column of row being looped over (string to check match of other columns with)
match=`awk "FNR=="$i" {print}" "$ind"|cut -d, -f1`
#counts how many matches there are in the looped row
match_num=`awk "FNR=="$i" {print}" "$ind"|grep -o "$match"|wc -l|cut -d' ' -f1`
#counts number of commas in each looped row
comma_num=`awk "FNR=="$i" {print}" "$ind"|grep -o ","|wc -l|cut -d' ' -f1`
#number of columns in each row
tot_num=$((comma_num + 1))
#writes one of the identical elements if all contents of row are identical, or writes "no_match" otherwise
if [ "$tot_num" == "$match_num" ]; then
echo $match >> same_alleles.txt
else
echo "no_match" >> same_alleles.txt
fi
done
#END
Currently, the script takes around 11 min to do all ~15,000 rows. I'm not really sure how to speed this up (I'm honestly surprised I could even get it to work). Any time knocked off would be fantastic. Below is a smaller excerpt of 100 rows that could be used:
allele
4-39
1-46,1-46,1-46
4-39
4-4,4-4,4-4,4-4
3-23,3-23,3-23
3-21,3-21
4-34,4-34
3-33
4-4,4-4,4-4
4-59,4-59
3-23,3-23,3-23
1-45
1-46,1-46
3-23,3-23,3-23
4-61
1-8
3-7
4-4
4-59,4-59,4-59
1-18,1-18
3-21,3-21
3-23,3-23,3-23
3-23,3-23,3-23
3-30,3-30-3
4-39,4-39
4-61
2-70
4-38-2,4-38-2
1-69,1-69,1-69,1-69,1-69
1-69
4-59,4-59,4-59,4-61,4-61,4-61
1-46,1-46
4-59,4-59,4-59,4-61,4-61,4-61
6-1,6-1
5-51,5-51
4-59,4-59
1-18
3-7
1-69
4-30-4
4-39
1-69
1-69
4-39
3-23,3-23,3-23
4-39
2-5
3-30-3
4-59,4-59,4-59
3-21,3-21
4-59,4-59
3-9
4-59,4-59,4-59
4-31,4-31
1-46,1-46
1-46,1-46,1-46
5-51,5-51
3-48
4-31,4-31
3-7
4-61
4-59,4-59,4-59,4-61,4-61,4-61
4-38-2,4-38-2
3-21,3-21
1-69,1-69,1-69
3-23,3-23,3-23
4-59,4-59
3-48
3-48
1-46,1-46
3-23,3-23,3-23
3-30-3,3-30-3
1-46,1-46,1-46
3-64
3-73,3-73
4-4
1-18
3-7
1-46,1-46
1-3
4-61
2-70
4-59,4-59
5-51,5-51
3-49,3-49
4-4,4-4,4-4
4-31,4-31
1-69
1-69,1-69,1-69
4-39
3-21,3-21
3-33
3-9
3-48
4-59,4-59
4-59,4-59
4-39,4-39
3-21,3-21
1-18
My script takes ~ 7 sec to complete this.
c
for command two rather thans
- should be nominally quicker still. smart, though. – mikeserv Aug 07 '18 at 03:31