1

I'm running a long-running pipeline from bash, in the background:

find / -size +500M -name '*.txt' -mtime +90 |
   xargs -n1 gzip -v9 &

The 2nd stage of the pipeline takes a long time to complete (hours) since there are several big+old files.

In contrast, the 1st part of the pipeline completes immediately, and since the pipe isn't full, and it has completed, find exits successfully.

The parent bash process seems to wait properly for child processes.

I can tell this because there's no find (pid 20851) running according to either:

     ps alx | grep 20851
     pgrep -l find

There's no zombie process, nor there's any process with process-id 20851 to be found anywhere on the system.

The bash builtin jobs correctly shows the job as a single line, without any process ids:

[1]+  Running find / -size +500M -name '*.txt' -mtime +90 | xargs -n1 gzip -v9 &

OTOH: I stumbled by accident on a separate job control command (/bin/jobs) which shows:

[1]+ 20851 Running         find / -size +500M -name '*.txt' -mtime +90
     20852 Running         | xargs -n1 gzip -v9 &

and which is (wrongly) showing the already exited 20851 find process as "Running".

This is on CentOS (edit: More accurately: Amazon Linux 2 AMI) Linux. Turns out that /bin/jobs is a two line /bin/sh script:

#!/bin/sh
builtin jobs "$@"

This is surprising to me. How can a separate process, started from another program (sh), know the details of a process which is managed by another (bash) after that process has already completed and exited and is NOT a zombie?

Further: how can it know details (including pid) about the already exited process, when other methods on the system (ps, pgrep) can't?

Edits:

(1) As Uncle Billy noted in the comments, on this system /bin/sh and/bin/bash are the same (/bin/sh is a symlink to /bin/bash) but /bin/jobs is a script with a shebang line so it runs in a separate process.

(2) Also, thanks to Uncle Billy: an easier way to reproduce. /bin/jobs was a red herring. I mistakenly assumed it is the one producing the output. The surprising output apparently came from the bash builtin jobs when called with -l:

$ sleep 1 | sleep 3600 &
[1] 13616
$ jobs -l
[1]+ 13615 Running                 sleep 1
     13616 Running                 | sleep 3600 &
$ ls /proc/13615
ls: cannot access /proc/13615: No such file or directory

So process 13615 doesn't exist, but is shown as "Running" by bash builtin job control, which appears like a bug in jobs -l.

The presence on /bin/jobs which confused me to think it must be the culprit (it wasn't), seems confusing and questionable. I believe it should be removed from the system as it is useless (a sh script running in a separate process, which can't show jobs of the caller anyway).

arielf
  • 890
  • 1
    The second (dubious) format is shown by jobs -l, even when a process has exited: sleep 1 | sleep 3600 & ... after 2 secs jobs -l will show both as running, though the first sleep has terminated. What version of bash is that? What does alias | grep jobs say? –  Jan 30 '21 at 22:44
  • @UncleBilly nice reproductions. $ bash --version shows GNU bash, version 4.2.46(2)-release (x86_64-koji-linux-gnu) – arielf Jan 30 '21 at 23:07
  • Sorry if that's not an answer, but it's impossible to format anything in comments. What does type /bin/jobs say? Notice that on centos/rhel /bin/sh is still bash under another name. –  Jan 30 '21 at 23:11
  • $ type /bin/jobs shows: /bin/jobs is /bin/jobs I have no alias for jobs, and indeed /bin/sh is a symlink to /bin/bash on this system, so same program, different processes. Great questions! – arielf Jan 30 '21 at 23:13
  • Of course another program cannot have access to its parent's memory (the jobs table in this case) after an execve(2), as running /bin/jobs should be -- unless there's some kind of trick. Another possibility would be that either jobs or builtin is an exported function ;-) –  Jan 30 '21 at 23:28
  • 1
    Cannot reproduce. I run a fresh Amazon Linux 2 in VirtualBox from a newly downloaded image. There is /bin/jobs, its content is what you wrote; but it behaves like we all expect, not like in the question. So if things work for you as you described then it's not because of the Amazon Linux 2 itself (phew!). – Kamil Maciorowski Jan 31 '21 at 05:01
  • 1
    Sharing knowledge: jobs is specified by POSIX and POSIX explicitly requires it as a standalone executable. In this matter Amazon Linux 2 is more POSIX-compliant than e.g. Debian. Note in AL2 there is /usr/bin/cd as well. – Kamil Maciorowski Feb 01 '21 at 19:15

1 Answers1

1

FWIW, I can reproduce your case with:

rhel8$ /bin/jobs(){ jobs -l; }
rhel8$ sleep 1 | sleep 3600 &
[1] 2611
rhel8$ sleep 2
rhel8$ jobs
[1]+  Running                 sleep 1 | sleep 3600 &
rhel8$ /bin/jobs
[1]+  2610 Running                 sleep 1
      2611 Running                 | sleep 3600 &
rhel8$ pgrep 2610
    <nothing!>
rhel8$ ls /proc/2610
ls: cannot access '/proc/2610': No such file or directory
rhel8$ /bin/jobs
[1]+  2610 Running                 sleep 1
      2611 Running                 | sleep 3600 &
rhel8$ cat /bin/jobs
#!/bin/sh
builtin jobs "$@"

Or with (even lamer than the previous):

rhel8$ unset -f /bin/jobs
rhel8$ export JOBS=$(jobs -l)
rhel8$ builtin(){ echo "$JOBS"; }
rhel8$ export -f builtin
rhel8$ /bin/jobs
[1]+  2610 Running                 sleep 1
      2611 Running                 | sleep 3600 &
rhel8$ type /bin/jobs
/bin/jobs is /bin/jobs

Note: As already demonstrated, jobs -l in bash is displaying stale information, with pipeline processes which have already exited still shown as Running. IMHO this is a bug -- other shells like zsh, ksh or yash correctly show them as Done.

  • Seems like the bigger bug is the very existence of /bin/jobs (which made me so confused). What is the purpose of /bin/jobs if it can't show any jobs in the caller anyway? – arielf Feb 01 '21 at 18:11
  • 1
    its only purpose is to fullfill some posix standard requirement -- but the only people who knew its original rationale have either died, gone mad or no longer remember it (to paraphrase Lord Palmerston –  Feb 01 '21 at 22:14