3

I have two text files. File 2 has logs over 1,000,000. File 1 has IP addresses line by line. I want to read file 2 lines and search these lines in file 1, I mean:

file 1:

34.123.21.32
45.231.43.21
21.34.67.98

file 2 :

34.123.21.32 0.326 - [30/Oct/2013:06:00:06 +0200]
45.231.43.21 6.334 - [30/Oct/2013:06:00:06 +0200]
45.231.43.21  3.673 - [30/Oct/2013:06:00:06 +0200]
34.123.21.32 4.754 - [30/Oct/2013:06:00:06 +0200]
21.34.67.98 1.765 - [30/Oct/2013:06:00:06 +0200]
...

I want to search for the IP from file 1 line by line in file 2 and print time arguments (example: 0.326) to a new file.

How can I do this?

don_crissti
  • 82,805
DessCnk
  • 33
  • 2
    What output are you expecting? Especially with duplicate ip-adresses? – Bernhard Nov 04 '13 at 15:25
  • I want to see time parametres about duplicate IP adresses – DessCnk Nov 05 '13 at 09:01
  • And I want to add a grep this time parameters.time parameters (4.008 ) has to be bigger than 10 seconds (bigger than 10.000).I have to add " egrep [0-9][0-9].[0-9][0-9] " command on my script. – DessCnk Nov 05 '13 at 09:06

3 Answers3

3

Join + sort

If you're trying to find IP's that are present in both, you can use the join command but you'll need to use sort to pre-sort the files prior to joining them.

$ join -o 2.2 <(sort file1) <(sort file2)

Example

$ join -o 2.2 <(sort file1) <(sort file2)
1.765
0.326
4.754
3.673
6.334

Another example

file 1a:

$ cat file1a
34.123.21.32
45.231.43.21
21.34.67.98
1.2.3.4
5.6.7.8
9.10.11.12

file 2a:

$ cat file2a
34.123.21.32 0.326 - [30/Oct/2013:06:00:06 +0200]
45.231.43.21 6.334 - [30/Oct/2013:06:00:06 +0200]
45.231.43.21  3.673 - [30/Oct/2013:06:00:06 +0200]
34.123.21.32 4.754 - [30/Oct/2013:06:00:06 +0200]
21.34.67.98 1.765 - [30/Oct/2013:06:00:06 +0200]
1.2.3.4 1.234 - [30/Oct/2013:06:00:06 +0200]
4.3.2.1 4.321 - [30/Oct/2013:06:00:06 +0200]

Running the join command:

$ join -o 2.2 <(sort file1) <(sort file2)
1.234
1.765
0.326
4.754
3.673
6.334

NOTE: The original order of file2 is lost with this method, due to the fact that we sorted it first. However this method only needs to scan file2 a single time now, as a result.

grep

You can use grep to search for matches in file2 using lines that are in file1, but this method isn't as efficient as the first method I showed you. It's scanning file2 looking for each line in file1.

$ grep -f file1 file2 | awk '{print $2}'

Example

$ grep -f file1 file2 | awk '{print $2}'
0.326
6.334
3.673
4.754
1.765
1.234

Improving grep's performance

You can speed up the grep's performance by using this form:

$ LC_ALL=C grep -f file1 file2 | awk '{print $2}'

You can also tell grep that the stings in file1 are fixed length (-F) which will also help in getting better performance.

$ LC_ALL=C grep -Ff file1 file2 | awk '{print $2}'

Generally in software, you try to avoid having to do this approach though, since it's basically a loop within a loop type of solution. But there are times when it's the best that can be achieved using a computer + software.

References

slm
  • 369,824
  • thanks for solution but results are not true.I mean; file1 line 1 is :34.123.21.32 , I want to search this Ip all of lines in file2. And I want to search all ıps one by one in file 1 in file2 and write results a new file. – DessCnk Nov 05 '13 at 12:35
  • You want to search for each IP in file1 for all occurrences of it in file2, yes? This solution is doing it, but it is sorting the files so that they're in order, and then pulling the results out for matches, do you need to keep them in the original order? The approach I'm doing is more efficient b/c it doesn't have to keep rescanning the file for matches, it only has to scan one time! – slm Nov 05 '13 at 13:04
  • @user50591 - see updates, I've added a 2nd method that uses grep and explained it's drawbacks. Also added additional info that hopefully explains why the join + sort is the fastest method. – slm Nov 05 '13 at 13:26
  • thanks a lot slm.Notw it is ok and working more efficient.I edit all my scripts and its ok! – DessCnk Nov 05 '13 at 13:33
  • I would have thought that grep -f is smart enough to go through the file only once. I stand corrected. – Joseph R. Nov 05 '13 at 14:17
  • @JosephR. - I"m like 90% sure that's the way it works. I believe Stephane mentioned it in another Q not too long ago. – slm Nov 05 '13 at 14:28
  • Then what is the "loop within a loop" situation you're referring to? – Joseph R. Nov 05 '13 at 15:44
  • @JosephR. - sorry that it's a loop within a loop. – slm Nov 05 '13 at 16:21
  • I don't understand what you mean by "a loop within a loop" if grep -f will go through the file only once. Can you please explain? – Joseph R. Nov 05 '13 at 16:31
  • grep -f goes through file2 looking for lines from file1. One at a time. – slm Nov 05 '13 at 16:38
  • Right. I totally misunderstood you. Looking back at the comments, it seems I had a stupid moment there :D Weird about grep, though. – Joseph R. Nov 05 '13 at 20:33
2

You can tell grep to obtain its patterns from a file using the -f switch (which is in the POSIX standard):

sort file1 | uniq \            # Avoid duplicate entries in file1
 | grep -f /dev/stdin file2 \  # Search in file2 for patterns piped on stdin
 | awk '{print $2}' \          # Print the second field (time) for matches
   > new_file                  # Redirect output to a new file

Note that if one IP address appears multiple times in file2, all its time entries will be printed.

This did the job in less than 2 seconds on a 5 million-line file on my system.

Joseph R.
  • 39,549
1

As you have titled your question bash programming I'll submit a semi bash example.

Pure bash:

You could read ip filter-file and then check line by line and match it against these. But on this volume really slow.

You could rather easy implement bubble –, select –, insertion –, merge sort etc. but, again, for this kind of volume it would be a goner and most likely worse then a compare by line. (Depends a lot on volume of filter file).

sort + bash:

Another option would be to sort the file with sort and process the input in-house by e.g. binary search. This, as well, would be much slower then the other suggestions posted here, but let's give it a try.


Firstly it is a question about bash version. By version 4 (?) we have mapfile which reads file to array. This is a lot faster then the traditional read -ra …. Combined with sort it could be scripted by something like (for this task):

mapfile arr <<< "$(sort -bk1,1 "$file_in")"

Then it is a question about having a search algorithm to find matches in this array. A simple way could be to use a binary search. It is efficient and on e.g. an array of 1.000.000 elements would give a fairly quick lookup.

declare -i match_index
function in_array_bs()
{
    local needle="$1"
    local -i max=$arr_len
    local -i min=0
    local -i mid
    while ((min < max)); do
        (( (mid = ((min + max) >> 1)) < max )) || break
        if [[ "${arr[mid]// *}" < "$needle" ]]; then
            ((min = mid + 1))
        else
            max=$mid
        fi
    done
    if [[ "$min" == "$max" && "${arr[min]// *}" == "$needle" ]]; then
        match_index=$min
        return 0
    fi
    return 1
}

Then you would say:

for x in "${filter[@]}"; do
    if in_array_bs "$x"; then
       … # check match_index+0,+1,+2 etc. to cover duplicates.

A sample script. (Not debugged) but merely as a starter. For lower volume where one would want only to depend on sort, it could be a template. But again s.l.o.w.e.r. b.y a. l.o.t.:

#!/bin/bash

file_in="file_data" file_srch="file_filter"

declare -a arr # The entire data file as array. declare -i arr_len # The length of "arr". declare -i index # Matching index, if any.

Time print helper function for debug.

function prnt_ts() { date +"%H:%M:%S.%N"; }

Binary search.

function in_array_bs() { local needle="$1" local -i max=$arr_len local -i min=0 local -i mid while ((min < max)); do (( (mid = ((min + max) >> 1)) < max )) || break if [[ "${arr[mid]// }" < "$needle" ]]; then ((min = mid + 1)) else max=$mid fi done if [[ "$min" == "$max" && "${arr[min]// }" == "$needle" ]]; then index=$min return 0 fi return 1 }

Search.

"index" is set to matching index in "arr" by in_array_bs().

re='^[^ ]+ +([^ ]+)' function search() { if in_array_bs "$1"; then while [[ "${arr[index]// *}" == "$1" ]]; do [[ "${arr[index]}" =~ $re ]] printf "%s\n" "${BASH_REMATCH[1]}" ((++index)) done fi }

sep="--------------------------------------------"

Timestamp start

ts1=$(date +%s.%N)

Print debug information

printf "%s\n%s MAP: %s\n%s\n"
"$sep" "$(prnt_ts)" "$file_in" "$sep" >&2

Read sorted file to array.

mapfile arr <<< "$(sort -bk1,1 "$file_in")"

Print debug information.

printf "%s\n%s MAP DONE\n%s\n"
"$sep" "$(prnt_ts)" "$sep" >&2

Define length of array.

arr_len=${#arr[@]}

Print time start search

printf "%s\n%s SEARCH BY INPUT: %s\n%s\n"
"$sep" "$(prnt_ts)" "$file_srch" "$sep" >&2

Read filter file.

re_neg_srch='^[ '$'\t'$'\n'']*$' debug=0 while IFS=$'\n'$'\t'-" " read -r ip time trash; do if ! [[ "$ip" =~ $re_neg_srch ]]; then ((debug)) && printf "%s\n%s SEARCH: %s\n%s\n"
"$sep" "$(prnt_ts)" "$ip" "$sep" >&2 # Do the search search "$ip" fi done < "$file_srch"

Print time end search

printf "%s\n%s SEARCH DONE\n%s\n"
"$sep" "$(prnt_ts)" "$sep" >&2

Print total time

ts2=$(date +%s.%N) echo $ts1 $ts2 | awk '{printf "TIME: %f\n", $2 - $1}' >&2

Runium
  • 28,811