0

Quite often, when working on the command line, I find myself specifying the same operation on a bunch of different instances specified by some input stream, and then wanting to recombine their outputs in some specific way.

Two use cases from yesterday:

  1. I wanted to see the Subjects of a bunch of SSL certificates bundled in a PEM file.

     cat mypemfile.crt |
     (openssl x509 -noout -text | fgrep Subject:)
    

    only shows me the first one. I need to split out the certificates and run the same command on each of them, then concatenate the results. csplit can split them, but to files only. This is a hassle.

    Why can't I just say

     cat mypemfile.crt |
     csplit-tee '/BEGIN/ .. /END/' |
     on-each ( openssl x509 -noout -text | fgrep Subject: ) |
     merge --cat
    

    ?

  2. We run a JupyterHub instance that splits off notebook servers as Docker containers. I want to watch their timestamped logs. It's easy enough for one container:

     sudo docker logs -t -f $container_id
    

    (The -t adds a timestamp, and the -f keeps the pipe open, like tail -f.)

    It's easy enough to list the logs of all containers, sorted by timestamps:

     sudo docker ps | awk '{print $1}' |
     while read container_id
     do
       sudo docker logs -t $container_id
     done |
     sort
    

    or

     sudo docker ps | awk '{print $1}' |
     xargs -n1 sudo docker logs -t |
     sort
    

    or

     sudo docker ps | awk '{print $1}' |
     parallel sudo docker logs -t {} |
     sort
    

    but none of that will let me use the -f option to watch the logs as they come in.

    Why can't I just use

     sudo docker ps | awk '{print $1}' |
     csplit-tee /./ |
     on-each (xargs echo | sudo docker logs -t -f) |
     merge --line-by-line --lexicographically
    

    or

     sudo docker ps | awk '{print $1}' |
     parallel --multipipe sudo docker logs -t -f {} |
     merge --line-by-line --lexicographically
    

    ?

Clearly, this requires

  1. Specific shell support. Maybe we'd need a distinct "multi-pipe" symbol.
  2. New tools (csplit-tee, on-each and merge) that split and merge pipelines.
  3. A fixed convention for how to specify, within tools, arbitrarily many input and output file descriptors such that this shell will treat them as parallel pipelines.

Has this been done? Or something equivalent that I can apply to my use cases?

Is it feasible without specific kernel support? I know the kernel typically has a fixed maximum number of open file descriptors, but an implementation can work around that by not blindly trying to open them all at once.

Is it feasible, knowing that GNU parallel is feasible?

muru
  • 72,889
  • I'm not entirely sure what csplit-tee, oneach and merge are supposed to do, but your openssl example seems to be doing awk -v cmd="openssl x509 -noout -text | fgrep Subject:" '/BEGIN/ { p = 1 } p { print | cmd } /END/{close (cmd)}' mypemfile.crt – muru Oct 28 '21 at 12:00
  • This is well above kernel level if I understand you correctly. Programs write to stdout (handle 1) and other programs read from stdin (handle 0). A pipe connects stdout from process to stdin by a different one. Handling that is usually done by the shell (parent process). BTW lsof can help visualise this. – Ned64 Oct 28 '21 at 12:00
  • Also, PowerShell with its object-oriented system can do some complex things easily compared to your run-of-the-mill text-based shells. – muru Oct 28 '21 at 12:01
  • That doesn't need any kernel extension and a T+Y program which splits an input, filters it through n filter processes and then joins back could be easily written in any C, perl, etc. But it cannot be generalized beyond some stupid demos, because it will run in the same deadlock+buffering issues that coprococesses (and similar features) have run into. Just do the first part with tee outputting to temporary files, and after all the parallel writers have finished, merge the temporary files. For the examples you're giving, that's more than enough. –  Oct 28 '21 at 12:16
  • Yes, this is helpful, thanks - the pee utility is a bit like my imaginary merge. I'm still hoping to find something more like my imaginary examples. – reinierpost Oct 31 '21 at 21:11
  • I don't see why deadlock would be a problem. Temporary files don't cut it in general because they aren't streams, and dealing files creates additional complexity I don't want to bother with. The answers given show that I can avoid that already e.g. with GNU parallel. – reinierpost Oct 31 '21 at 21:23

2 Answers2

2

With GNU Parallel:

cat mypemfile.crt |
  parallel --pipe -N1 --recstart '-----BEGIN' 'openssl x509 -noout -text | fgrep Subject:'

Untested:

sudo docker ps | awk '{print $1}' |
  sudo parallel -j0 --lb docker logs -t -f {}

sudo docker ps | awk '{print $1}' | sudo parallel -j0 --tag --lb docker logs -t -f {}

Ole Tange
  • 35,514
  • Yes, I guess parallel would resolve the multiple parallel process creation. The tricky part would be the output pipe. In theory, nothing will prevent lines from different processus to be interleaved. If the bottleneck is in the last process, the parallel processes will be paused waiting for the pipe to be read and what is read could be a part of what was written by a process and the whole written line interleaved with others. But I guess that casual needs doesn’t troublesome. – Frédéric Loyer Oct 29 '21 at 21:07
  • @FrédéricLoyer --lb = interleaved - but only complete lines. Remove --lb to avoid the interleaving: It will only output when the job completes. -u = mix lines - half lines may come from different processes. – Ole Tange Oct 29 '21 at 21:44
  • I think shell support would be more intuitive and "Unixy", but this is very compact; clearly I should explore GNU parallel more. – reinierpost Oct 31 '21 at 21:04
  • @reinierpost May I recommend https://www.lulu.com/shop/ole-tange/gnu-parallel-2018/paperback/product-23558902.html or https://doi.org/10.5281/zenodo.1146014 – Ole Tange Oct 31 '21 at 22:03
  • Thanks, I'll consider it! That link is broken now, but this one works for me: https://www.lulu.com/en/us/shop/ole-tange/gnu-parallel-2018/paperback/product-1jq6qq4v.html – reinierpost Nov 03 '21 at 07:43
0

No the shell can't do that. A pipe is just a stream from a source to a destination, usually in different processes.

Let's look at your first example:

cat mypemfile.crt |
(openssl x509 -noout -text | fgrep Subject:)

You want the input file to be split along lines containing BEGIN and END, but cat doesn't care about the content, and pipes don't have out of bound indication of record separators anyway.

You could achieve this with specific conventions in the source and destination of a pipe, but it would remove the main advantage of a pipe, to be able to combine programs as building blocks for tasks that were not anticipated by their authors.

In this specific example, it would be much easier to modify openssl to process multiple certificates in one stream than it would be to modify openssl to process multiple stream AND modify cat to provide these multiple streams.

RalfFriedl
  • 8,981
  • I know the shell can't do it - hence my question whether anyone has extended a shell so it can. Generic support for this fits the "Unix toolbox philosophy". – reinierpost Oct 30 '21 at 17:27