How can a "two-pass" script support reading input either from a file or from stdin?

Question

The following is a very simple example of what I mean by a "two-pass script":

#!/bin/bash

INPUTFILE=$1

grep    '^#' "$INPUTFILE"
grep -v '^#' "$INPUTFILE" | sort

This script (let's call it twopass.sh) takes the path, INPUTFILE, to a file as its sole argument. It will then, first, print out all the lines in INPUTFILE that begin with #, in their original order. And, second, it will print in sorted order all the lines in INPUTFILE that do not begin with #.

For example, if the file example.txt contains the following lines

# foo comes first
# bar comes second
# baz comes third
wobble
quux
wibble
frobozz

...then applying the twopass.sh script to it should produce the following:

% ./twopass.sh example.txt
# foo comes first
# bar comes second
# baz comes third
frobozz
quux
wibble
wobble

How can I modify this script so that it can also perform the same operation on stdin?

In other words, with the desired new version of the script, the line below should produce the same output as shown above:

./twopass.sh < example.txt

I am interested in answers to this question for both bash and zsh.

Two-pass script, would be a script that has two passes (is run twice). Did you mean a script that does two passes over a file? — ctrl-alt-delor, Apr 25 '20 at 09:33

Stéphane Chazelas · Accepted Answer · 2020-04-28T15:06:18.037

In the general case, to be able to process stdin more than once, you'd need either to be able to seek back after the first read to be able to read it again (which is not possible for all types of files e.g. pipes, sockets, terminals) or store that input into a regular file or memory where you know you can read it more than once.

It's easier with shells with builtin seeking and temporary file management support like zsh or ksh93.

#! /bin/zsh -
zmodload zsh/system || exit

if (($#)); then
  # arguments are provided. They are assumed to be file arguments
  # to process (use ./- for the file called -)
  grep -h -- '^#' "$@"
  grep -vh -- '^#' "$@" | sort
else
  # process stdin
  if (( (pos = systell(0)) >= 0 )); then
    # input is seekable
    grep '^#'
    sysseek $pos || {
      syserror -p "Cannot go back: "
      exit 1
    }
    grep -v '^#' | sort
  else
    # not seekable, store input in a temporary file using =(cat)
    () {
      grep -- '^#' $1
      grep -v -- '^#' $1
    } =(cat)
  fi
fi

(note that -h to skip outputting file names is a GNU grep extension; if your grep doesn't support it, you can replace that with cat -- "$@" | grep ...).

bash doesn't have support for seeking nor creating temp files, but you could have it call zsh, ksh93 or perl/python for that.

For your particular use case though, you could also do:

#! /bin/sh -
gawk -e '
  /^#/ {print; next}
  {print | "sort"}' -E /dev/null "$@"

The -e + -E trick needed to be able to process file names that contain = characters (note that a - argument is still interpreted by gawk as meaning stdin, not a file called -).

The sorted output above is guaranteed to be displayed after the comments as sort needs to have read all its input before it can start outputting anything. sort holds the data in memory or temp files.

Approaches like:

#! /bin/zsh -
{ cat -- "$@" > >(grep '^#' 4>&1 >&3) | grep -v '^#' | sort; } 3>&1

Or to be compatible with ksh93 or bash:

{
  cat -- "$@" |
   { tee >(grep '^#' 4>&1 >&3); } |
   grep -v '^#' |
   sort
} 3>&1

Where the output of cat is teeed to both grep and grep -v | sort should also work. The 4>&1 being used to guarantee that sort doesn't start outputting before grep has finished writing (as it also holds the pipe to grep -v open whilst running).

Why do you need any trick? Just prepend ./ to any filename and it will not be processed as an assignment. — , Apr 25 '20 at 07:29
@Isaac, it's still more effort as you need to only do that for non-empty relative paths. — Stéphane Chazelas, Apr 25 '20 at 07:31
Thank you. I'm learning tons from your answer. Can I ask you a couple of follow-up questions about the last approach, the one with starting with { cat -- "$@" > ...?
The first one is about the "associativity" of the >(grep '^#' 4>&1 >&3) subexpression. Does it bind more tightly to the expression on its left, or to the one on its right? (I hope this question even makes sense!)

The second one is, is there something analogous to this approach for bash? If I try the code straight from your post with bash (instead of zsh), only the lines beginning with # show up on the screen. — kjo, Apr 28 '20 at 14:30
@kjo, yes it relies on zsh's multios feature. In bash, you'd need to use tee — Stéphane Chazelas, Apr 28 '20 at 14:55
Thank you again! Even though it's very rude of me to keep pestering you with questions, my curiosity is greater than my manners, so please forgive me for asking you one more question about your code. I noticed that you terminate commands with ; when followed by a } on the same line. Why is that? If I run { date | wc } and { date | wc; }, the behavior appears to be the same. Is there a reason to prefer the latter? — kjo, Apr 28 '20 at 16:18
@kjo, { date | wc } (or {date | wc}) will only work in zsh, and not always. In POSIX shells, { and } are keywords. They need to be delimited and like for, do or done are not recognised as keywords everywhere. In POSIX shells, echo } is required to output }, like echo do is required to output do. That's not the case in zsh unless you're in sh emulation. — Stéphane Chazelas, Apr 28 '20 at 16:36

score -2 · Answer 2 · answered Apr 25 '20 at 02:23

-2

Simply sort the part of the output that you want sorted. grep -E '^#' "$INPUTFILE";(grep -E -v '^#' "$INPUTFILE" | sort )

answered Apr 25 '20 at 02:23

waltinator

4,865

You simply re-wrote, the script from the question, as one line. And added un-necessary parentheses. – ctrl-alt-delor Apr 25 '20 at 09:35

How can a "two-pass" script support reading input either from a file or from stdin?

2 Answers2

Linked