I am currently working on creating a bash script that is supposed to process large log files from one of my programs. When I first started the script took around 15 seconds to complete which wasn't bad but I wanted to improve it. I implemented a queue with mkfifo
and reduced the parse time to 6 seconds. I wanted to ask you guys is there any way to improve the parsing speed of the script.
The current version of the script:
#!/usr/bin/env bash
# $1 is server log file
# $2 is client logs file directory
declare -A orders_array
fifo=$HOME/.fifoDate-$$
mkfifo $fifo
Queue for time conversion
exec 5> >(exec stdbuf -o0 date -f - +%s%3N >$fifo)
exec 6< $fifo
Queue for ID extraction
exec 7> >(exec stdbuf -o0 grep -oP '[0-9a-f]-[0-9a-f]-[0-9a-f]-[0-9a-f]-[0-9a-f]*' >$fifo)
exec 8< $fifo
rm $fifo
while read line; do
order_id=${line:52:36};
echo >&5 "${line:1:26}"
read -t 1 -u6 converted_time
orders_array[$order_id]=$converted_time
done < <(grep -ah 'MarketOrderTransitions.*[MarketMessages::OrderExecuted]' $1)
while read line; do
echo >&7 "$line"
read -t 1 -u8 id
echo >&5 "${line:1:26}"
read -t 1 -u6 converted_time
time_diff="$(($converted_time - orders_array[$id]))"
echo "$id -> $time_diff ms"
done < <(grep -ah 'Event received (OrderExecuted)' $2/market.log)
The basic task of the script is to extract timestamps of messages from client and server log files, find their matching message ids, and based on that calculate the number of milliseconds that passed between the server sending them and client recieving.
The first while loop finishes pretty fast (1.5 seconds), but the second part (my guess is because of grep
) takes much longer.
The engine file in test has around 500k lines. I also have around 700 client log files (1.3 million lines in total).
The order id in server files is at a fixed location, but in the client log I have to grep for it.
Edit:
As suggested I will add examples of input files: Server:
[2022-12-07 07:36:18.209496] [MarketOrderTransitionsa4ec2abf-059f-4452-b503-ae58da2ce1ff] [info] [log_process_event] [MarketMessages::OrderExecuted]
[2022-12-07 07:36:18.209558] [MarketOrderTransitionsa4ec2abf-059f-4452-b503-ae58da2ce1ff] [info] [log_guard] [[True] (lambda at ../subprojects/market_session/private_include/MarketSession/MarketOrderTransitions.hpp:81:24)]
[2022-12-07 07:36:18.209564] [MarketOrderTransitionsa4ec2abf-059f-4452-b503-ae58da2ce1ff] [info] [log_state_change] [GatewayCommon::States::New --> GatewayCommon::States::Executed]
[2022-12-07 07:36:18.209567] [MarketOrderTransitionsa4ec2abf-059f-4452-b503-ae58da2ce1ff] [info] [log_action] [(lambda at ../subprojects/market_session/private_include/MarketSession/MarketOrderTransitions.hpp:57:25) for event: MarketMessages::OrderExecuted]
[2022-12-07 07:36:18.209574] [MarketOrderTransitionsa4ec2abf-059f-4452-b503-ae58da2ce1ff] [info] [log_process_event] [boost::sml::v1_1_0::back::on_entry<boost::sml::v1_1_0::back::_, MarketMessages::OrderExecuted>]
the id is in square brackets after MarketOrderTransitions (a4ec2abf-059f-4452-b503-ae58da2ce1ff)
Client
[2022-12-07 07:38:47.545433] [twap_algohawk] [info] [] [Event received (OrderExecuted): {"MessageType":"MarketMessages::OrderExecuted","averagePrice":"49.900000","counterPartyIds":{"activeId":"dIh5wYd/S4ChqMQSKMxEgQ**","executionId":"2295","inactiveId":"","orderId":"3dOKjIoURqm8JjWERtInkw**"},"cumulativeQuantity":"1200.000000","executedPrice":"49.900000","executedQuantity":"1200.000000","executionStatus":"Executed","instrument":[["Symbol","5"],["Isin","5"],["SecurityIDSource","4"],["Mic","MARS"]],"lastFillMarket":"MARS","leavesQuantity":"0.000000","marketSendTime":"07:38:31.972000000","orderId":"a4ec2abf-059f-4452-b503-ae58da2ce1ff","orderPrice":"49.900000","orderQuantity":"1200.000000","propagationData":[],"reportId":"Qx2k73f7QqCqcT0LTEJIXQ**","side":"Buy","sideDetails":"Unknown","transactionTime":"00:00:00.000000000"}]
The id in the client log is inside orderId tag (there is 2 of them and I use the second one)
The wanted output is:
98ddcfca-d838-4e49-8f10-b9f780a27470 -> 854 ms
5a266ca4-67c6-4482-9068-788a3520b2f3 -> 18 ms
2e8d28de-eac0-4776-85ab-c75d9719b7c6 -> 58950 ms
409034eb-4e55-4e39-901a-eba770d497c0 -> 56172 ms
5b1dc7e8-fae0-43d2-86ea-d3df4dbe810b -> 52505 ms
5249ac24-39d2-40f5-8adf-dcf0410aebb5 -> 17446 ms
bef18cb3-8cef-4d8a-b244-47fed82f21ea -> 1691 ms
7c53c950-23fd-497e-a011-c07363d5fe02 -> 18194 ms
I am in particular concerned only about the "order executed" messages in the log files
awk
program, but coming up with this program requires us to reverse engineer what the data looks like from your code. – Kusalananda Dec 12 '22 at 09:38854 ms
in your expected output come from and make sure the expected output is output you expect from the same input you posted, then we can help you. I posted a script to show where we're at so far in being able to help given the example and explanation of requirements you've provided so far. – Ed Morton Dec 13 '22 at 01:23I am in particular concerned only about the "order executed" messages in the log files
but you don't say why those messages are a concern. – Ed Morton Dec 13 '22 at 01:29