I have an awk script that processes very large files that look something like this:
K1353 SF3987.7PD833391.4 KARE
K1353 SF3987.2KD832231.4 MEAKE
K1332 IF4987.7RP832231.2 LEAOS
K1329 SF2787.7KD362619.3 NEDLE
K1353 SK3K84.3KD832231.3 PQAKM
The file is a fixed column file.
The script currently runs a program over some extracted fields and replaces them back in line – I am using awk. The performance is not as good as a simpler awk script; the bottleneck appears to be the syscall to the command.
For demonstration purposes I have just included 'rev' but it actually runs a custom program which translates these fields. The command is typically very quick to run though only takes two arguments either via STDIN or can read from a file. The real executing program is a 3rd party application/binary and I don't know the details of how it works.
BEGIN {
csmok="rev"
}
{
type = substr($0,1,1)
if (type == "K") {
RX=substr($0,6,9)
RY=substr($0,15,9)
cmd=sprintf("echo %s %s | %s", RX, RY, csmok)
cmd | getline output
close(cmd)
split(output,k," ")
sub(RX,k[1])
sub(RY,k[2])
print
}
}
and run like so:
$ awk -f process.awk file.dat
The files I process are sometimes large – 900,000 lines – and this takes a long time to perform. Slowness is when it breaks out to the system()/exec call.
How would I improve the run time?
I thought about trying to somehow make the script execute once like concatenating all of the extracted fields into a single command:
echo -e "SF3987.7 PD833391.4\nSF3987.2 KD832231.4\nIF4987.7 RP832231.2" | rev
OR
rev << EOF
SF3987.7 PD833391.4
SF3987.2 KD832231.4
IF4987.7 RP832231.2
EOF
Not quite sure how to achieve that, and then I am left with the processed output but not sure how to replace them back in to the right columns in the file.
The output should look very much like the input only those extracted fields will be translated by the external program:
K1353.193338DP7.7893FS4 KARE
K1353.132238DK2.7893FS4 MEAKE
K1332.132238PR7.7894FI2 LEAOS
K1329.916263DK7.7872FS3 NEDLE
K1353.132238DK3.48K3KS3 PQAKM
Alternatively, I'd like to know other ways to accomplish this in a GNU/Linux environment, but without using awk.