Find pattern on multiple lines within BIG log files

Question

To investigate within logs, I am trying to find the very first time a vulnerability in a workflow has been exploited.

The pattern is on multiple lines.

The pattern would be

AAAAAAAAA
BBBBBBBBB
CCCCCCCCC

The problem is that

AAAAAAAAA

or

BBBBBBBBB

or

CCCCCCCCC

Can be found anywhere indivdually in the log without showing the vulnerability; it is the exact pattern in this exact order that will help me.

For example

grep -Ei "AAAAAAAAA|BBBBBBBBB|CCCCCCCCC" logfile does not help me since all the lines with individual occurence of AAAAAAAAA BBBBBBBBB CCCCCCCCC will be there.

How can I solve this?

There are quite a number of "multiline match" questions - this one for example Multiline Regexp (grep, sed, awk, perl) looks close to yours — steeldriver, Apr 03 '21 at 21:20
I don't think a multi-line regular expression is going to work in this context. since it would need to be AAAAA.*BBBBB.*CCCCC and each candidate AAAAA would force grep to span the rest of the "Big" file. — Philip Couling, Apr 03 '21 at 23:16

score 1 · Answer 1 · answered Apr 04 '21 at 06:36

Here's a way you can do it in python (I added to your example a bit to prove that you can still get the matches you desire even if there are random single lines of AAAAAAAAA, BBBBBBBBB, or CCCCCCCCC dispersed throughout the logfile) :

below are the contents of find_log_vulns.py

#! /usr/bin/python3
import re
test_string = """1234324
AAAAAAAAA
BBBBBBBBB
CCCCCCCCC
absdfjv4er4
AAAAAAAAA
BBBBBBBBB
CCCCCCCCC
123466666
AAAAAAAAA
ghrhvhhhfh
BBBBBBBBB
fjwjefjsjfjwjf
CCCCCCCCC
24wfsgggg
AAAAAAAAA
BBBBBBBBB
CCCCCCCCC
zzzz"""
matches = re.findall('AAAAAAAAA\nBBBBBBBBB\nCCCCCCCCC\n', test_string, re.MULTILINE)
print(matches)

The result I get from running the above:

$ ./find_log_vulns.py
['AAAAAAAAA\nBBBBBBBBB\nCCCCCCCCC\n', 'AAAAAAAAA\nBBBBBBBBB\nCCCCCCCCC\n', 'AAAAAAAAA\nBBBBBBBBB\nCCCCCCCCC\n']

As shown above, each match will be returned as an element in a list.

score 1 · Answer 2 · answered Apr 09 '21 at 13:09

using ripgrep:

rg -U 'A+\nB+\nC+' in
2:AAAAAAAAA
3:BBBBBBBBB
4:CCCCCCCCC
6:AAAAAAAAA
7:BBBBBBBBB
8:CCCCCCCCC
16:AAAAAAAAA
17:BBBBBBBBB
18:CCCCCCCCC

you can get rid of the line numbers, and so on. If you need separators between the matches you can do this:

rg -U 'A+\nB+\nC+' in | rg --passthru -e '(^A)' -r $'\n'A
AAAAAAAAA
BBBBBBBBB
CCCCCCCCC
AAAAAAAAA
BBBBBBBBB
CCCCCCCCC
AAAAAAAAA
BBBBBBBBB
CCCCCCCCC

score 1 · Answer 3 · answered Apr 09 '21 at 15:50

Using awk:

awk -v ptrn="AAAAAAAAA\0BBBBBBBBB\0CCCCCCCCC\0" '
BEGIN{ split(ptrn, tmp, "\0"); lngth=gsub("\0", "", ptrn ) }
$0 ~ tmp[++fieldNr]{ buf=(buf==""?"": buf OFS) NR":"$0 ;
                     if ( fieldNr == lngth ) { print buf; exit }
                     next
                   }
{ fieldNr=0; buf="" }' infile

this will give you the line number followed by the matched line content; here we used "Partial Regexp Match" using the patterns from the "ptrn" against the lines. see How do I find the text that matches a pattern? for other matching options.

we used NUL character \0 to separate patterns.

Sample input:

AAAAAAAAA
BBBBBBBBB
CCCCCCCCC
AAAAAAAAA
BBBBBBBBB
ccccccccc
123AAAAAAAAA
BBBBBBBBB123
123CCCCCCCCC3

Output:

8:123AAAAAAAAA 9:BBBBBBBBB123 10:123CCCCCCCCC3

bu5hman · Answer 4 · 2021-04-10T06:15:06.543

1

Just for fun with good old awk

cat file | wc -l
21287021

with > 3000,000 matches

time awk 'BEGIN{getline; a=$0; getline; b=$0}
       $0~/^C+$/ && a~/^A+$/ && b~/^B+$/{print "match starting on line "NR-2 }{a=b;b=$0}' file
real    0m12.644s
user    0m7.149s
sys     0m4.314s

Compared with rgon my machine

time rg -U 'A+\nB+\nC+' file
real    0m40.322s
user    0m16.503s
sys     0m17.246s

edited Apr 10 '21 at 06:15

answered Apr 09 '21 at 16:15

bu5hman

4,756

Find pattern on multiple lines within BIG log files

4 Answers4