Get the last occurrence of a pattern before another pattern

Question

In a file like this one :

...
Pattern2:TheWrongBar
foo 
Pattern2:TheRightBar
foo 
First Pattern
foo
...

I need to find the last occurrence of Pattern2 that is before First Pattern which would be in this case Pattern2:TheRightBar

My first idea is to get all the remaining file before First pattern with :

sed -e '/First Pattern/,$d' myfile | tac | grep -m1 "Pattern I need to get"

Isn't there a way to optimise this code?

I am going to try all the solutions and compare their speed, I will let you know my result and accept the answer accordingly — aze, Sep 29 '16 at 16:19
sed '/First Pattern/q' would be better than sed -e '/First Pattern/,$d' so sed stops reading at the first occurrence. — Stéphane Chazelas, Oct 04 '16 at 14:38

score 2 · Answer 1 · answered Sep 29 '16 at 14:05

2

With awk:

awk '/Pattern2/ {line=$0; next}; /First Pattern/ {print line; exit}' file.txt

/Pattern2/ {line=$0; next}: If the pattern Pattern2 is matched, saving the line in variable line, and going to the next line
/First Pattern/ {print line; exit}: if First Pattern is found, printing the variable line, and exiting

Example:

% cat file.txt                                                                 
...
Pattern2:TheWrongBar
foo 
Pattern2:TheRightBar
foo 
First Pattern
foo
...

% awk '/Pattern2/ {line=$0; next}; /First Pattern/ {print line; exit}' file.txt
Pattern2:TheRightBar

answered Sep 29 '16 at 14:05

heemayl

56,300

I tried using that in a script, replacing "First Pattern" with "$myvar" but that didn't work, with a straight value it works well though, do you know why? – aze Sep 29 '16 at 16:17
1

@aze Pass the patterns as variables: awk -v p1="$myvar" -v p2="Pattern2" '$0 ~ p2 {line=$0; next}; $0 ~ p1 {print line; exit}' file.txt – heemayl Sep 29 '16 at 16:39

don_crissti · Answer 2 · 2016-09-29T14:23:08.313

You could run

sed '/PATTERN2/h;/PATTERN1/!d;x;/PATTERN2/!d;q' infile

How it works:

sed '/PATTERN2/h         # if line matches PATTERN2 save it to hold buffer 
/PATTERN1/!d             # if it doesn't match PATTERN1 delete it
x                        # exchange buffers
/PATTERN2/!d             # if current pattern space doesn't match delete it
q' infile                # quit (auto-printing the current pattern space)

This would only exit if there's at least one line matching PATTERN2 before some line matching PATTERN1 so with an input like

1
2
PATTERN1
PATTERN2--1st
3
PATTERN2--2nd
PATTERN1
...

it will print

PATTERN2--2nd

If you wanted instead to exit on first match of PATTERN1 regardless, you would run

sed -n '/PATTERN2/h;/PATTERN1/!d;x;/PATTERN2/p;q' infile

which prints nothing with the above input (this one does exactly what your solution does).

I couldn't get it to work with "$var" instead of PATTERN1 either. Any idea on why? — aze, Sep 29 '16 at 16:24
@aze - variables don't expand inside single quotes, you have to use double quotes e.g. sed "/PATTERN2/h;/$var/!d;x;/PATTERN2/!d;q" infile but this also means $var doesn't contain reserved characters (those have to be escaped) — don_crissti, Sep 29 '16 at 16:29

score 0 · Answer 3 · answered Oct 04 '16 at 13:42

0

Turns out the most efficient way in my case was:

grep -m1 "First Pattern" my_file -B 10000000 | tac | grep -m1 "Pattern2"

Obviously the -B option cannot be used in some examples but grep is so much faster than awk or sed that I went with that solution. If the value for option -B gets higher, the search is way less efficient.

answered Oct 04 '16 at 13:42

aze

185

-B 10000000 (throwing a random number and hoping it works) isn't the right way to go about this. What happens if it occurs 10000001 lines after? – Miati Oct 04 '16 at 14:45
That's why I didn't chose my own answer as a proper solution but awk and sed are too slow for this to be efficient. I haven't found a way to get the full text before the match with grep. – aze Oct 05 '16 at 08:07
looks like you can with head --lines=+n. I updated my answer reflecting this method. Seems to work well – Miati Oct 05 '16 at 15:12
How much slower is "too slow" ? How many lines in a file (average) ? How did you test the speed of each solution ? Did you drop caches and sync in-between the runs ? – don_crissti Oct 05 '16 at 15:40

Miati · Accepted Answer · 2016-10-05T15:12:08.930

0

Finds the line count of "First Pattern", then uses head to display lines above it, pipes through tac and greps it.

head --lines=+"$(grep -nm1 "First Pattern" file | cut -d\: -f1)" file | tac | grep -m1 "Pattern2"

Eg.

head --lines=+6 file | tac | grep -m1 "Pattern2"

This is more reliable then using -m 1000000 in grep. Since speed is important to OP, I checked the run time and it also appears to be faster then all other current answers (on my system)

wc -l file
25910209 file

time awk '/Pattern2/ {line=$0; next}; /First Pattern/ {print line; exit}' file
Pattern2:TheRightBar

real  0m2.881s
user  0m2.844s
sys 0m0.036s

time sed '/Pattern2/h;/First Pattern/!d;x;/Pattern2/!d;q' file
Pattern2:TheRightBar

real  0m5.218s
user  0m5.192s
sys 0m0.024s

time (grep -m1 "First Pattern" file -B 10000000 | tac | grep -m1 "Pattern2")

real  0m0.624s
user  0m0.552s
sys 0m0.124s

time (head --lines=+"$(grep -nm1 "First Pattern" file | cut -d\: -f1)" file | tac | grep -m1 "Pattern2")
Pattern2:TheRightBar

real  0m0.586s
user  0m0.528s
sys 0m0.160s

edited Oct 05 '16 at 15:12

answered Oct 04 '16 at 14:44

Miati

3,150

Thanks a lot for the improvement, that really bothered me. And this also answers the "how much faster" question. If I have some time I will try to do some more tests. I wonder if sed and awk are really slower no matter what. – aze Oct 06 '16 at 09:31
Your solution is actually so much faster than mine. In my case, worst case scenario, I spent 2 secs searching the pattern while your solution took 0,6 secs. Fun fact: worst case scenario, my solution couldn't even find the good result since B wasn't large enough. – aze Oct 06 '16 at 15:25
Feel free to examine this url on why grep is fast - https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html . Essentially, "The key to making programs fast is to make them do practically nothing. ;-)" It may help explain why awk and sed are slower, they "do more", which makes them much more flexible - but also slower – Miati Oct 06 '16 at 15:52

Get the last occurrence of a pattern before another pattern

4 Answers4

Linked