Improve performance when using "system" call (shell escape) processing large files in awk

Question

I have an awk script that processes very large files that look something like this:

K1353 SF3987.7PD833391.4  KARE
K1353 SF3987.2KD832231.4 MEAKE
K1332 IF4987.7RP832231.2 LEAOS
K1329 SF2787.7KD362619.3 NEDLE
K1353 SK3K84.3KD832231.3 PQAKM

The file is a fixed column file.

The script currently runs a program over some extracted fields and replaces them back in line – I am using awk. The performance is not as good as a simpler awk script; the bottleneck appears to be the syscall to the command.

For demonstration purposes I have just included 'rev' but it actually runs a custom program which translates these fields. The command is typically very quick to run though only takes two arguments either via STDIN or can read from a file. The real executing program is a 3rd party application/binary and I don't know the details of how it works.

BEGIN {
  csmok="rev"
}

{
  type = substr($0,1,1)

  if (type == "K") {

    RX=substr($0,6,9)
    RY=substr($0,15,9)

    cmd=sprintf("echo %s %s | %s", RX, RY, csmok)
    cmd | getline output
    close(cmd)
    split(output,k," ")
    sub(RX,k[1])
    sub(RY,k[2])
    print
  }

}

and run like so:

$ awk -f process.awk file.dat

The files I process are sometimes large – 900,000 lines – and this takes a long time to perform. Slowness is when it breaks out to the system()/exec call.

How would I improve the run time?

I thought about trying to somehow make the script execute once like concatenating all of the extracted fields into a single command:

echo -e "SF3987.7 PD833391.4\nSF3987.2 KD832231.4\nIF4987.7 RP832231.2" | rev

OR

rev << EOF
SF3987.7 PD833391.4
SF3987.2 KD832231.4
IF4987.7 RP832231.2
EOF

Not quite sure how to achieve that, and then I am left with the processed output but not sure how to replace them back in to the right columns in the file.

The output should look very much like the input only those extracted fields will be translated by the external program:

K1353.193338DP7.7893FS4  KARE
K1353.132238DK2.7893FS4 MEAKE
K1332.132238PR7.7894FI2 LEAOS
K1329.916263DK7.7872FS3 NEDLE
K1353.132238DK3.48K3KS3 PQAKM

Alternatively, I'd like to know other ways to accomplish this in a GNU/Linux environment, but without using awk.

Is it possible to create an awk function to do external translation task? — JJoao, May 10 '19 at 13:50
Assuming your external command can process several lines of input and writes the corresponding output immediately (or line buffered), you could use a coprocess either in a shell script or in GNU AWK. The coprocess is running permanently and connected to the main process via pipes to write and read data. — Bodo, May 10 '19 at 14:07
see https://unix.stackexchange.com/questions/86270/how-do-you-use-the-command-coproc-in-various-shells or https://www.gnu.org/software/gawk/manual/html_node/Two_002dway-I_002fO.html — Bodo, May 10 '19 at 14:47
the command I use doesn't wait / buffer in the foreground by default.. is there a trick to making it do this do you know? — Rendalf, May 10 '19 at 16:35
So, any response to my answer?   If it works for you, please click the checkmark. If it doesn’t work, please explain why not. — G-Man Says 'Reinstate Monica', May 15 '19 at 18:51

score 0 · Answer 1 · answered May 13 '19 at 21:02

Assuming that every line in your input (or at least every line that begins with “K”) is exactly 29 characters long, I was able to duplicate your desired output with

rev filename | paste filename - | awk '
{
        if (substr($0,1,1) == "K") {
                print substr($0,1,5) substr($0,39,17) substr($0,24,7)
        }
}'

This

Runs rev on your entire input file, all at once.
- This, obviously, processes every line in the file through the external program. You express concern about the overhead of creating a pipe and invoking an external program once for every line. Based on that concern, I believe that this is a prudent approach. However, if only a few lines in your input begin with “K”, and the cost of processing the other lines is high, then this may need to be changed.
- rev produces exactly one line of output for every line of input. My solution depends on that behavior in your external program.

Combines (using paste) the input file with the output of rev, line for line. For your sample data, this looks like

K1353 SF3987.7PD833391.4  KARE  ERAK  4.193338DP7.7893FS 3531K
K1353 SF3987.2KD832231.4 MEAKE  EKAEM 4.132238DK2.7893FS 3531K
K1332 IF4987.7RP832231.2 LEAOS  SOAEL 2.132238PR7.7894FI 2331K
K1329 SF2787.7KD362619.3 NEDLE  ELDEN 3.916263DK7.7872FS 9231K
K1353 SK3K84.3KD832231.3 PQAKM  MKAQP 3.132238DK3.48K3KS 3531K

awk reads the above lines. Each one contains one line from the input file concatenated with the output of rev for that line. awk then combines the desired pieces of each.

<rant>

Your question is a little incoherent. If I take your sample input data,

K1353 SF3987.7PD833391.4  KARE
K1353 SF3987.2KD832231.4 MEAKE
K1332 IF4987.7RP832231.2 LEAOS
K1329 SF2787.7KD362619.3 NEDLE
K1353 SK3K84.3KD832231.3 PQAKM

and feed it to this awk script:

{
    RX=substr($0,6,9)
    RY=substr($0,15,9)
    printf("/%s/%s/\n", RX, RY)
}

I get this output:

/ SF3987.7/PD833391./
/ SF3987.2/KD832231./
/ IF4987.7/RP832231./
/ SF2787.7/KD362619./
/ SK3K84.3/KD832231./

Note that the RX value includes the space between the first and second columns, and the RY value does not include the last character of the value in the second column (i.e., the digit after the second dot). This really doesn’t make sense, because the

        sprintf("echo %s %s | %s", RX, RY, csmok)

statement causes the initial space in RX to be lost.

Confusingly, this is consistent with the expected results at the bottom of your question, but not with the five paragraphs above that, where you talk about doing

echo -e "SF3987.7 PD833391.4\nSF3987.2 KD832231.4\nIF4987.7 RP832231.2" | rev

i.e., you include the digit after the second dot in the string that you send to rev.

And, you extract two non-overlapping (but contiguous) substrings from $0, and then you split the output from the rev command, all unnecessarily. I can duplicate your results with

BEGIN {
  csmok="rev"
}

{
  type = substr($0,1,1)

  if (type == "K") {

    RXY=substr($0,6,18)

    cmd=sprintf("echo %s | %s", RXY, csmok)
    cmd | getline output
    close(cmd)
    sub(RXY,output)
    print
  }

}

i.e., extracting one 18-character substring from $0 and not splitting the output string.

Please try to make the data in your questions sensible and internally consistent.

That said, you seem to understand that it’s not always necessary to post every single detail of your exact problem, accurately, in order to get a reasonable answer. In that spirit, please try to make your problem easier to understand without compromising its integrity. Your data hurt my eyes:

The first three characters of every line are “K13”. That makes it harder to see the characters that are different.
In three of the five lines, the first five characters (i.e., the entire first column value) are “K1353”.
The values in the second column are 18-character long nonsense jumbles of letters, digits and dots, so that makes them hard to read and comprehend.
Looking at values in the second column:
- In four lines out of five, it begins with “S”.
- In three lines, it begins with “SF”.
- In three lines, the third character is a “3”.
- In four lines, the tenth character is a “D”.
- In three lines, the ninth and tenth characters are “KD”.
- In four lines, the 11th and 12th characters are “83” and the 16th character is a “1”.
- In three lines, the 11th-16th characters are “832231”.

I would suggest that you post sample data like this:

ant 12345.hill  Adam
bat 31416.cave Bruce
cat 13579.meow Felix
dog 32768.bark Angus

With input data like this, your desired output could contain strings like “tac”, “97531”, “woem” and “xileF”, and it would be easy for a human being to look at them and see where they came from. Unlike “132238DK2”, which requires a person to spend six to eight minutes with a magnifying glass to find the source — almost like one of those “word search” puzzles. (And note that “132238DK” would not be uniquely traceable, because “KD832231” appears twice.)

</rant>

Improve performance when using "system" call (shell escape) processing large files in awk

1 Answers1