1

I am currently working on creating a bash script that is supposed to process large log files from one of my programs. When I first started the script took around 15 seconds to complete which wasn't bad but I wanted to improve it. I implemented a queue with mkfifo and reduced the parse time to 6 seconds. I wanted to ask you guys is there any way to improve the parsing speed of the script.

The current version of the script:

#!/usr/bin/env bash
# $1 is server log file
# $2 is client logs file directory

declare -A orders_array

fifo=$HOME/.fifoDate-$$ mkfifo $fifo

Queue for time conversion

exec 5> >(exec stdbuf -o0 date -f - +%s%3N >$fifo) exec 6< $fifo

Queue for ID extraction

exec 7> >(exec stdbuf -o0 grep -oP '[0-9a-f]-[0-9a-f]-[0-9a-f]-[0-9a-f]-[0-9a-f]*' >$fifo) exec 8< $fifo rm $fifo

while read line; do order_id=${line:52:36}; echo >&5 "${line:1:26}" read -t 1 -u6 converted_time orders_array[$order_id]=$converted_time done < <(grep -ah 'MarketOrderTransitions.*[MarketMessages::OrderExecuted]' $1)

while read line; do echo >&7 "$line" read -t 1 -u8 id echo >&5 "${line:1:26}" read -t 1 -u6 converted_time time_diff="$(($converted_time - orders_array[$id]))" echo "$id -> $time_diff ms" done < <(grep -ah 'Event received (OrderExecuted)' $2/market.log)

The basic task of the script is to extract timestamps of messages from client and server log files, find their matching message ids, and based on that calculate the number of milliseconds that passed between the server sending them and client recieving.

The first while loop finishes pretty fast (1.5 seconds), but the second part (my guess is because of grep) takes much longer.

The engine file in test has around 500k lines. I also have around 700 client log files (1.3 million lines in total).

The order id in server files is at a fixed location, but in the client log I have to grep for it.

Edit:

As suggested I will add examples of input files: Server:

[2022-12-07 07:36:18.209496] [MarketOrderTransitionsa4ec2abf-059f-4452-b503-ae58da2ce1ff] [info] [log_process_event] [MarketMessages::OrderExecuted]
[2022-12-07 07:36:18.209558] [MarketOrderTransitionsa4ec2abf-059f-4452-b503-ae58da2ce1ff] [info] [log_guard] [[True] (lambda at ../subprojects/market_session/private_include/MarketSession/MarketOrderTransitions.hpp:81:24)]
[2022-12-07 07:36:18.209564] [MarketOrderTransitionsa4ec2abf-059f-4452-b503-ae58da2ce1ff] [info] [log_state_change] [GatewayCommon::States::New --> GatewayCommon::States::Executed]
[2022-12-07 07:36:18.209567] [MarketOrderTransitionsa4ec2abf-059f-4452-b503-ae58da2ce1ff] [info] [log_action] [(lambda at ../subprojects/market_session/private_include/MarketSession/MarketOrderTransitions.hpp:57:25) for event: MarketMessages::OrderExecuted]
[2022-12-07 07:36:18.209574] [MarketOrderTransitionsa4ec2abf-059f-4452-b503-ae58da2ce1ff] [info] [log_process_event] [boost::sml::v1_1_0::back::on_entry<boost::sml::v1_1_0::back::_, MarketMessages::OrderExecuted>]

the id is in square brackets after MarketOrderTransitions (a4ec2abf-059f-4452-b503-ae58da2ce1ff)

Client

[2022-12-07 07:38:47.545433] [twap_algohawk] [info] [] [Event received (OrderExecuted): {"MessageType":"MarketMessages::OrderExecuted","averagePrice":"49.900000","counterPartyIds":{"activeId":"dIh5wYd/S4ChqMQSKMxEgQ**","executionId":"2295","inactiveId":"","orderId":"3dOKjIoURqm8JjWERtInkw**"},"cumulativeQuantity":"1200.000000","executedPrice":"49.900000","executedQuantity":"1200.000000","executionStatus":"Executed","instrument":[["Symbol","5"],["Isin","5"],["SecurityIDSource","4"],["Mic","MARS"]],"lastFillMarket":"MARS","leavesQuantity":"0.000000","marketSendTime":"07:38:31.972000000","orderId":"a4ec2abf-059f-4452-b503-ae58da2ce1ff","orderPrice":"49.900000","orderQuantity":"1200.000000","propagationData":[],"reportId":"Qx2k73f7QqCqcT0LTEJIXQ**","side":"Buy","sideDetails":"Unknown","transactionTime":"00:00:00.000000000"}]

The id in the client log is inside orderId tag (there is 2 of them and I use the second one)

The wanted output is:

98ddcfca-d838-4e49-8f10-b9f780a27470 -> 854 ms
5a266ca4-67c6-4482-9068-788a3520b2f3 -> 18 ms
2e8d28de-eac0-4776-85ab-c75d9719b7c6 -> 58950 ms
409034eb-4e55-4e39-901a-eba770d497c0 -> 56172 ms
5b1dc7e8-fae0-43d2-86ea-d3df4dbe810b -> 52505 ms
5249ac24-39d2-40f5-8adf-dcf0410aebb5 -> 17446 ms
bef18cb3-8cef-4d8a-b244-47fed82f21ea -> 1691 ms
7c53c950-23fd-497e-a011-c07363d5fe02 -> 18194 ms

I am in particular concerned only about the "order executed" messages in the log files

Dzamba
  • 11
  • 2
  • 3
    It's difficult to say much about how to this may be improved without seeing the input and the wanted output. I have a feeling this could be better implemented as a single awk program, but coming up with this program requires us to reverse engineer what the data looks like from your code. – Kusalananda Dec 12 '22 at 09:38
  • Edit: I added examples of input. Thanks for suggestion. – Dzamba Dec 12 '22 at 10:15
  • 3
    If you want speed, don't write shell scripts. There's very little point in trying to squeeze a few seconds out of a shell script since pretty much by definition the shell will be slower than just about any other language. – terdon Dec 12 '22 at 12:26
  • Agree with @terdon. A 1M-line logfile is too much for a shell script. You should/could do it in Python. – dr_ Dec 12 '22 at 16:10
  • Python ain't really a champ regarding speed either, but it should beat bash easily. If you need to go even faster, you'll probably have to write something in a compiled language. – Panki Dec 12 '22 at 18:36
  • See why-is-using-a-shell-loop-to-process-text-considered-bad-practice to learn about the speed and some other issues with your script. – Ed Morton Dec 13 '22 at 01:06
  • If you [edit] your question to explain where those numbers like 854 ms in your expected output come from and make sure the expected output is output you expect from the same input you posted, then we can help you. I posted a script to show where we're at so far in being able to help given the example and explanation of requirements you've provided so far. – Ed Morton Dec 13 '22 at 01:23
  • You say I am in particular concerned only about the "order executed" messages in the log files but you don't say why those messages are a concern. – Ed Morton Dec 13 '22 at 01:29

1 Answers1

1

To show where we're at so far - you've shown us 2 input files, Client, and Server, and told us where to find the ID in each. That would be this, using any awk:

$ cat tst.sh
#!/usr/bin/env bash

awk ' (NR == FNR) && match($0,/[MarketOrderTransitions[^]]+]/) { id = substr($0,RSTART+23,RLENGTH-24) print FILENAME, id } (NR > FNR) && match($0,/."orderId":"/) { id = substr($0,RLENGTH+1) sub(/"./,"",id) print FILENAME, id } ' "$@"

$ ./tst.sh Server Client
Server a4ec2abf-059f-4452-b503-ae58da2ce1ff
Server a4ec2abf-059f-4452-b503-ae58da2ce1ff
Server a4ec2abf-059f-4452-b503-ae58da2ce1ff
Server a4ec2abf-059f-4452-b503-ae58da2ce1ff
Server a4ec2abf-059f-4452-b503-ae58da2ce1ff
Client a4ec2abf-059f-4452-b503-ae58da2ce1ff

You've also said your expected output is a list of what looks like similar IDs with numbers next to them but those IDs don't seem to be related to the sample input you provided, and you haven't told us where those numbers came from.

Once you can express your requirements and provide a testable example in your question we can complete this script and it'll run orders of magnitude faster than your shell script.

One guess at what you might be trying to do would be the following, using GNU awk for time functions:

$ cat tst.sh
#!/usr/bin/env bash

awk ' { time = substr($0,2,26) } (NR == FNR) && match($0,/[MarketOrderTransitions[^]]+]/) { id = substr($0,RSTART+23,RLENGTH-24) orders_time[id] = time } (NR > FNR) && match($0,/."orderId":"/) { id = substr($0,RLENGTH+1) sub(/"./,"",id) time_diff = time2ms(time) - time2ms(orders_time[id]) print id " -> " time_diff " ms" }

function time2ms(time,      t,secs) {
    gsub(/[-:]/,&quot; &quot;,time)
    split(time,t,/[.]/)
    return ( mktime(t[1]) substr(t[2],1,3) )
}

' "$@"

$ ./tst.sh Server Client
a4ec2abf-059f-4452-b503-ae58da2ce1ff -> 149336 ms

but since your posted expected output doesn't seem to be related to your posted sample input, I've no idea if that's correct or not.

Ed Morton
  • 31,617