7

I have some SLES 12 SP5 machines with grep version 2.16 and on a single machine I'm heavily using scripts which contained the following grep --quiet condition:

# $pid_list contains the result of pstree and $script_pid equals $$
if echo "$pid_list" | grep -qF "($script_pid)"; then
  continue
fi
if echo "$pid_list" | grep -qF "($script_pid)"; then
  echo "Error: grep has a bug!"
  continue
fi

I doubled it, because by a chance of ~0.1% the first condition failed, while the second identical condition succeeded?!

After changing the condition as follows, it works flawlessly (Full code here):

if echo "$pid_list" | grep -F "($script_pid)" >/dev/null; then
  continue
fi

Regarding the manual the quiet option should behave as I would expect it. It should even return true if an error happens:

Exit immediately with zero status if any match is found, even if an error was detected

So I'm confused why it fails sometimes. RAM and filesystem of the machine is fine. The grep binary has the correct file hash, too.

I searched for a commit, but the only one I found is from 2001, which should be part of 2.16 as it is from 2014.

Update1

I tried to use a subshell as @kamil suggested, but it still fails (the "Race condition"-error is displayed sometimes):

if (echo "$pid_list"; true) | grep -qF "($script_pid)"; then
  continue
elif (echo "$pid_list"; true) | grep -qF "($script_pid)"; then
  echo "Error: Race condition!"
  continue
fi

This instead works:

if echo "$pid_list" | grep -qF "($script_pid)" || [[ $? -eq 141 ]]; then
  continue
fi
mgutt
  • 467
  • 5
    would you have a text file containing an example pid_list and a concrete example of $script_pid? I know it sounds like it's just as given, but we're looking at an "unlikely" bug, so being able to have a fully reproducible test case that we can a couple hundred times until you're sure it would have failed on your machine is necessary. – Marcus Müller Apr 18 '23 at 08:41
  • 3
    continue has no point outside of a loop, so please show the loop in question. – muru Apr 18 '23 at 08:44
  • 2
    BTW: You don't need echo and grep for substring matching, the shell has built-in means, e.g. case "$pid_list" in ; *"($script_pid)"*) do_something ;; esac – Bodo Apr 18 '23 at 09:03
  • 1
    Show a complete (but minimal) script that exhibits the issue, including the exact values of $pid_list and a $script_pid. – ilkkachu Apr 18 '23 at 09:32
  • @MarcusMüller Link to full code added to the question. – mgutt Apr 18 '23 at 11:04

1 Answers1

13

Hypothesis: in the script you did set -o pipefail (if the shell interpreting the script is Bash; other shells may provide similar functionality). From man 1 bash:

pipefail
If set, the return value of a pipeline is the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands in the pipeline exit successfully. This option is disabled by default.

When there is a match, grep -q exits early and returns 0. Then echo may or may not return 141. It depends on if echo manages to write everything to the pipe buffer before grep closes its end of the pipe. Even for the same input things may go one way or another, it's a race condition. If 141 happens then the return value of the pipe will be 141 because of pipefail.

Assuming you really used set -o pipefail, I say the behavior you observed is absolutely not a bug of grep nor the shell. The bug is in your script.

My answer under the linked question provides solutions. The simplest one for you is:

if (/bin/echo "$pid_list"; true) | grep -qF "($script_pid)"; then

or

if ((echo "$pid_list"); true) | grep -qF "($script_pid)"; then

Why /bin/echo … or (echo …) instead of echo? Because your echo is most likely a shell builtin and when "it" gets SIGPIPE then it's really the whole subshell that gets SIGPIPE and gets killed. We don't want the subshell that is supposed to execute true to be killed. See the beginning of this answer.

Check the whole script; other conditionals may be buggy in the same way.

  • 1
    Yes, you are right. I'm using pipefail and then it should be as you explained. Thank you! – mgutt Apr 18 '23 at 10:44
  • 1
    It has been requested for shellcheck: https://github.com/koalaman/shellcheck/issues/1109 – mgutt Apr 18 '23 at 11:21
  • @mgutt Replacing echo ... | grep ... with shell builtins for substring matching would avoid the pipe and the resulting issue related to pipefail. – Bodo Apr 18 '23 at 12:12
  • @Bodo Thank you. I already read your above comment. But even if I use the binary operator =~ instead, I still could face this problem with other pipes to head or tail. – mgutt Apr 18 '23 at 14:22
  • @mgutt If your script depends on the exit code of the pipe in a similar way, then you might get the same problem. If you use head and/or tail in such a pipe, there are often ways to implement this in a different way. As these examples show, enabling pipefail for the whole script is not a good idea. See also https://mywiki.wooledge.org/BashPitfalls#set_-euo_pipefail – Bodo Apr 18 '23 at 14:37
  • @Kamil I tested (echo "$pid_list"; true) | grep -qF "($script_pid)", but it still fails randomly. Instead echo "$pid_list" | grep -qF "($script_pid)" || [[ $? -eq 141 ]] works. I will update my question, if you want to see the full code block. – mgutt Apr 18 '23 at 14:49
  • @mgutt OK, the answer now recognizes the problem. echo is a builtin and when "it" gets SIGPIPE, it's really the subshell that gets SIGPIPE and true is never executed. – Kamil Maciorowski Apr 18 '23 at 18:04
  • 3
    if errexit is also enabled, you'll need /bin/echo ... || true instead of /bin/echo ...; true. There is a misguided recommendation to use set -euo pipefail floating around that many people seem to be following these days even though errexit and pipefail introduce more problems than they fix. Here (set +o pipefail; echo ... | grep ...) would probably be better. But best would be not to set errexit and pipefail globally and do proper error handling and set pipefail only when needed. – Stéphane Chazelas Apr 18 '23 at 19:18
  • Yep, I disabled pipefail in the meantime. It's much easier to follow the principle "Does the most right pipe cmd do what I need?" instead of "Could there be a race condition anywhere in the pipe?". – mgutt Apr 19 '23 at 06:57