5

I have a large input file which contains 30M lines, new lines in \r\n. I decided to do something silly and compare the speed of counting all lines via read -r compared to xargs (stripping the \r first, because xargs does not seem to be able to split on multiple characters). Here are my two commands:

time tr -d '\r' < input.txt | xargs -P 1 -d '\n' -I {} echo "{}" | wc -l
time while read -r p || [ -n "$p" ]; do echo "$p"; done < input.txt | wc -l

Here, the second solution is much faster. Why is that?

Please note that I know that this is not a proper way to count lines of a file. This question is merely out of interest of my observation.

  • 3
    I just realized that this might have something to do with xargs spawning a /usr/bin/echo process for each line, while the second command is probably using the bash echo directive, instead of a process. I'll check that. – Frazier Thien Jul 21 '21 at 13:52
  • You also have the overhead of the tr. And how are you measuring the time here? Timing a pipe is complicated. – terdon Jul 21 '21 at 14:03
  • I would have expected that the overhead of tr should be minimal. Regarding timing, I just write time in front of both commands, so including the overhead of tr indeed. – Frazier Thien Jul 21 '21 at 14:06
  • 2
    Well, you're launching tr, xargs and /bin/echo on each invocation while the shell one doesn't call any external tools at all. That will likely explain it, but I'm not sure. Can you also show the actual results you get? It's hard to understand what "much faster" means without them. – terdon Jul 21 '21 at 14:15
  • Right, will do. I am now executing a test when I used /usr/bin/echo for the read and it does also seem to take ages now. I will added these measurements later on. – Frazier Thien Jul 21 '21 at 14:17
  • The tr in the first version is not doing anything -- it is waiting for input from the terminal. The xargs is reading from the redirection <, not the pipe |. – Paul_Pedant Jul 21 '21 at 17:27
  • @Paul_Pedant I fixed that. – Frazier Thien Jul 22 '21 at 08:04

1 Answers1

10

Yes, as you suspected, xargs -P 1 -d '\n' -I {} echo "{}", which is actually the same as xargs -rd '\n' -n1 as echo is the default command, will fork a process and execute the standalone echo in the child process while waiting for its termination in the parent for each line of input.

So that's a lot more work than even using the inefficient read builtin and the echo builtin all in the same shell process.

If instead of GNU xargs, you use busybox xargs which (at least in some configurations and recent versions) will call the busybox echo internally without forking a process, you'll notice it will be significantly faster than the bash loop.

For a more relevant comparison, you should compare:

tr -d '\r' | xargs -rd'\n' -n1

with

tr -d '\r' |
  while IFS= read -r line || [ -n "$line" ]; do
   /bin/echo "$line"
  done

Which are likely to give you similar outcomes as the bulk of the time will be spent forking processes and executing the standalone echo.

Here, on the output of seq 3e7 and measuring with pv -al > /dev/null (to measure throughput in average number of lines per second), I get around:

  • 1.12M/s for busybox xargs
  • 70k/s for bash loop with builtin echo
  • 860/s for GNU xargs
  • 850/s for bash loop with /bin/echo
  • Thank you Stéphane, It was indeed the overhead of spwaning the echo process for each line which was resulting the performance drop. I learned some interesting stuff in your answer (pv -al to measure performance, busybox xargs <-> GNU xargs, ...) as well! – Frazier Thien Jul 21 '21 at 14:53