How to extract multiple values from a file in a single pass?

Question

I have a huge log file (about 6GB) from a simulation. Among the millions of lines in that file, there are two lines that are frequently repeating for a given time:

...
Max value of omega = 3.0355
Time = 0.000001
....
Max value of omega = 4.3644
Time = 0.000013
...
Max value of omega = 3.7319
Time = 0.000025
...
...
...
Max value of omega = 7.0695
Time = 1.32125
...
... etc.

I would like to extract both "Max value of omega" and "Time" and save them in a single file as columns:

#time max_omega
0.000001 3.0355
0.000013 4.3644
0.000025 3.7319
...etc.

I proceeded as follows:

# The following takes about 15 seconds
grep -F 'Max value of omega' logfile | cut -d "=" -f 2 > max_omega_file.txt

, and the same for "Time"

# This also takes about 15 seconds
# Very important: match exactly 'Time =' because there other lines that contain the word 'Time'
grep -F 'Time =' logfile | cut -d "=" -f 2 > time.txt

Then I need to use the command paste to create a two-columns file: Time.txt as the first column and "max_omega_file.txt" as the second column.

As you can see, the time is doubled in the steps above. I wonder if there a single solution to achieve the same results in a single pass so I save some time?

αғsнιη · Accepted Answer · 2020-11-22T05:45:18.877

10

sed -n '/^Max/ { s/^.*=\s*//;h; };
        /^Time/{ s/^.*=\s*//;G; s/\n/ /;p; }' infile

match-run syntax /.../{ ... }:
commands within {...} will only run on the lines that matched with regex/pattern within /.../;
s/^.*=\s*//:
deletes everything up-to last = and whitespaces \s* also if there was any.
h:
copy the result into hold-space
G:
append the hold-space to pattern-space with embedded newline
s/\n/ /:
replace that embedded newline with space in the pattern-space

p:
print pattern-space; you can use P command here instead too.

0.000001 3.0355
0.000013 4.3644
0.000025 3.7319
1.32125 7.0695

A similar approach proposed by @stevesliva that is used s//<replace>/ which is shorthand to do substitution on the last match:

sed -n '/^Max.*=\s*/ { s///;h; };
        /^Time.*=\s*/{ s///;G; s/\n/ /;p; }' infile

edited Nov 22 '20 at 05:45

answered Nov 21 '20 at 14:38

αғsнιη

41,407

the ^Max regex is not sufficient because there also lines like 'Max iterations' ...etc. The same thing with Time, I want to find literally "Time =" to get the expected line. – adhrar_nmatrous Nov 21 '20 at 14:51
@adhrar_nmatrous you are free to change per your need – αғsнιη Nov 21 '20 at 14:53
2

This is the fastest solution so far. – adhrar_nmatrous Nov 21 '20 at 15:15
sed allows use of // to repeat last match which might allow some shorthand here. /^Max.*=\s*/s///h;/^Time.*=\s*/{s///G;s/\n/ /p;} – stevesliva Nov 22 '20 at 04:40
@stevesliva that's a cool feature! I was thinking the same that I wish sed could have this feature but never expect that really have it. amazing, will update my answer with your suggestion. thank you – αғsнιη Nov 22 '20 at 05:12

score 7 · Answer 2 · answered Nov 21 '20 at 13:59

7

I can't guarantee it will be faster, but you could do something like this in awk:

awk -F' = ' '$1=="Max value of omega" {omega = $2} $1=="Time" {print omega,$2}' file

answered Nov 21 '20 at 13:59

steeldriver

81,074

Hi. For time I need this literally to search for "Time =" (with the = sign) because the word "Time" exist somewhere else. Could you please explain what awk is trying to do? – adhrar_nmatrous Nov 21 '20 at 14:03
@adhrar_nmatrous it splits lines into = delimited fields then tests the values of the first field. So effectively the second condition matches files that start with exactly Time = – steeldriver Nov 21 '20 at 14:29
Thank you it works, but it's relatively slow comapred to sed. – adhrar_nmatrous Nov 21 '20 at 15:15

Ed Morton · Answer 3 · 2020-11-21T15:16:35.600

5

$ awk 'BEGIN{print "#time", "omega"} /^Max value of omega =/{omega=$NF; next} /^Time =/{print $NF, omega}' file
#time omega
0.000001 3.0355
0.000013 4.3644
0.000025 3.7319
1.32125 7.0695

but this will probably be faster:

$ grep -E '^(Max value of omega|Time) =' file |
    awk 'BEGIN{print "#time", "omega"} NR%2{omega=$NF; next} {print $NF, omega}'
#time omega
0.000001 3.0355
0.000013 4.3644
0.000025 3.7319
1.32125 7.0695

edited Nov 21 '20 at 15:16

answered Nov 21 '20 at 15:07

Ed Morton

31,617

Certainly, first reducing the data with a grep is the key. Better grep -A1 for one fixed string from the beginning (to get a few more seconds) and do your formatting based on NR%3 – thanasisp Nov 21 '20 at 16:31
@thanasisp Yeah, I considered that but then I'd have to introduce additional tests to awk to not just do NR%2 once but to do NR%3==1 and then NR%3==2 or introduce a variable to hold the NR%3 result first... it just seemed like any potential speed up from the grep would be lost by the additional awk comparisons and it just wasn't worth bothering with. – Ed Morton Nov 21 '20 at 16:49
It won't be lost, some lines of some MB will match from the total of 6GB. And yes the NR%3==1 or 2 hardcoded seems good. If 3 of 6 GB are matching, I wouldn't be sad, and think of one or two tests, awk is for formatting here, assuming real file inputs. – thanasisp Nov 21 '20 at 17:04
All I'm saying is that while the grep should (but may not be, idk) faster the awk will be slower, and overall I doubt if there'll be much difference in execution speed between the 2 approaches. I could be wrong of course but I don't care enough to test it. – Ed Morton Nov 21 '20 at 19:26
Overall, the second command (awk here) will spend a lot of its lifetime idle, waiting for the next line (for real cases, assuming 0-10% of the file lines matching). – thanasisp Nov 21 '20 at 19:34

score 0 · Answer 4 · answered Nov 21 '20 at 15:26

0

Something like

paste \
  <(<file awk -F= '$1 ~ /omega/ {print $2}') \
  <(<file awk -F= '$1 ~ /Time/ {print $2}')

I think even

<file grep -o '[[:digit:].]*' | paste - -

Or

<file cut -d= -f2 | paste - -

Would do

answered Nov 21 '20 at 15:26

D. Ben Knoble

512

score 0 · Answer 5 · answered Nov 22 '20 at 17:17

grep may search for multiple patterns in one go

-e PATTERNS, --regexp=PATTERNS
Use PATTERNS as the patterns. If this option is used multiple times or is combined with the -f (--file) option, search for all patterns given. This option can be used to protect a pattern beginning with “-”.

So

grep -F -e 'Max value of omega = ' -e 'Time = ' logfile

will reduce the size of the search space. Then you can post process with one of the other suggestions.

score 0 · Answer 6 · answered Nov 22 '20 at 21:44

an alternative perhaps simpler sed solution would be

sed -nr 'N;s/^Max value of omega = ([0-9.]+)\nTime = ([0-9.]+)$/\1 \2/p;D;' logfile

where 'N' adds a second line to pattern space, the 's/pattern/string/p' block seeks the two line pattern and prints out the two capture groups (\1 \2) separated by space, and finally D discards the first line from pattern space.

One advantage of this approach that I've used in past when seeking multi-line patterns is you can print out the capture groups in an arbitrary order, not necessarily the order they appear in the file. So that in your example if you wanted "Time" in the first column you can simply do this

sed -nr 'N;s/^Max value of omega = ([0-9.]+)\nTime = ([0-9.]+)$/\2 \1/p;D;' logfile

Note it now says "\2 \1" rather than "\1 \2".

How to extract multiple values from a file in a single pass?

6 Answers6