As you have titled your question bash programming I'll submit a semi bash example.
Pure bash:
You could read ip filter-file and then check line by line and match it against these. But on this volume really slow.
You could rather easy implement bubble –, select –, insertion –, merge sort etc. but, again, for this kind of volume it would be a goner and most likely worse then a compare by line. (Depends a lot on volume of filter file).
sort + bash:
Another option would be to sort the file with sort
and process the input in-house by e.g. binary search. This, as well, would be much slower then the other suggestions posted here, but let's give it a try.
Firstly it is a question about bash version. By version 4 (?) we have mapfile
which reads file to array. This is a lot faster then the traditional read -ra …
. Combined with sort
it could be scripted by something like (for this task):
mapfile arr <<< "$(sort -bk1,1 "$file_in")"
Then it is a question about having a search algorithm to find matches in this array. A simple way could be to use a binary search. It is efficient and on e.g. an array of 1.000.000 elements would give a fairly quick lookup.
declare -i match_index
function in_array_bs()
{
local needle="$1"
local -i max=$arr_len
local -i min=0
local -i mid
while ((min < max)); do
(( (mid = ((min + max) >> 1)) < max )) || break
if [[ "${arr[mid]// *}" < "$needle" ]]; then
((min = mid + 1))
else
max=$mid
fi
done
if [[ "$min" == "$max" && "${arr[min]// *}" == "$needle" ]]; then
match_index=$min
return 0
fi
return 1
}
Then you would say:
for x in "${filter[@]}"; do
if in_array_bs "$x"; then
… # check match_index+0,+1,+2 etc. to cover duplicates.
A sample script. (Not debugged) but merely as a starter. For lower volume where one would want only to depend on sort
, it could be a template. But again s.l.o.w.e.r. b.y a. l.o.t.:
#!/bin/bash
file_in="file_data"
file_srch="file_filter"
declare -a arr # The entire data file as array.
declare -i arr_len # The length of "arr".
declare -i index # Matching index, if any.
Time print helper function for debug.
function prnt_ts() { date +"%H:%M:%S.%N"; }
Binary search.
function in_array_bs()
{
local needle="$1"
local -i max=$arr_len
local -i min=0
local -i mid
while ((min < max)); do
(( (mid = ((min + max) >> 1)) < max )) || break
if [[ "${arr[mid]// }" < "$needle" ]]; then
((min = mid + 1))
else
max=$mid
fi
done
if [[ "$min" == "$max" && "${arr[min]// }" == "$needle" ]]; then
index=$min
return 0
fi
return 1
}
Search.
"index" is set to matching index in "arr" by in_array_bs()
.
re='^[^ ]+ +([^ ]+)'
function search()
{
if in_array_bs "$1"; then
while [[ "${arr[index]// *}" == "$1" ]]; do
[[ "${arr[index]}" =~ $re ]]
printf "%s\n" "${BASH_REMATCH[1]}"
((++index))
done
fi
}
sep="--------------------------------------------"
Timestamp start
ts1=$(date +%s.%N)
Print debug information
printf "%s\n%s MAP: %s\n%s\n"
"$sep" "$(prnt_ts)" "$file_in" "$sep" >&2
Read sorted file to array.
mapfile arr <<< "$(sort -bk1,1 "$file_in")"
Print debug information.
printf "%s\n%s MAP DONE\n%s\n"
"$sep" "$(prnt_ts)" "$sep" >&2
Define length of array.
arr_len=${#arr[@]}
Print time start search
printf "%s\n%s SEARCH BY INPUT: %s\n%s\n"
"$sep" "$(prnt_ts)" "$file_srch" "$sep" >&2
Read filter file.
re_neg_srch='^[ '$'\t'$'\n'']*$'
debug=0
while IFS=$'\n'$'\t'-" " read -r ip time trash; do
if ! [[ "$ip" =~ $re_neg_srch ]]; then
((debug)) && printf "%s\n%s SEARCH: %s\n%s\n"
"$sep" "$(prnt_ts)" "$ip" "$sep" >&2
# Do the search
search "$ip"
fi
done < "$file_srch"
Print time end search
printf "%s\n%s SEARCH DONE\n%s\n"
"$sep" "$(prnt_ts)" "$sep" >&2
Print total time
ts2=$(date +%s.%N)
echo $ts1 $ts2 | awk '{printf "TIME: %f\n", $2 - $1}' >&2