6

Currently I'm using the following zsh-snippet to select small batches of files for further processing

for f in $(ls /some/path/*.txt | head -2) ; do
  echo unpacking $f
  ./prepare.sh $f && rm -v $f
done

Is there a better alternative to $(ls ... | head -2) in zsh?

General overview of my task. I'm creating a data set to train a neural network. Details of that ML task are not important here. The task of the dataset creation requires me to manually process a large bunch of files. To do it, I've copied them to a separate directory. Then I randomly choose several files (first two from ls output in this example), call some preprocessing routine, review its results, move some of them to the data set being created and remove the rest. After this cleanup I again execute the command above.

Additionally, I'd like to improve my skills in shell programming and learn something new :)

The order in which these "first" files are chosen does not matter, since all of them will be processed in the end.

In other words, I'm working together with PC inside a for loop and want it to pause after several iterations and wait for me.

Pseudocode.

for f in /some/path/*.txt ; do
  echo unpacking $f
  ./prepare $f

if human wants to review ; then human is reviewing then cleans, and PC waits fi done

The reason for such weird procedure is that preprocessing of one "source" .txt file creates several dozens other files, I need to view all of them and select a few samples (usually 1-2), suitable to train a network.

I could run for f in /some/path/*.txt ; do ./prepare $f ; done but this command would create several hundreds of files, and this amount overwhelms.

wl2776
  • 163
  • 1
    What is the "first" file in your reading - how is that defined? The sorting of file lists may vary by program. Also parsing ls has its flaws - better avoid it. – FelixJN Jul 26 '22 at 13:48
  • Doesn't matter, any ordering will do the job. I've updated the question – wl2776 Jul 26 '22 at 13:50
  • Do I then correctly understand that your actual target is to run the process in parallel but with maximum two parallel instances at once? – FelixJN Jul 26 '22 at 13:55
  • No, my actual target is to make the loop for f in /some/path*.txt to pause after several iterations and wait while I allow it to continue. – wl2776 Jul 26 '22 at 13:58
  • Please add all information to your question. Please explain in more detail what you mean with "to make the loop for f in /some/path*.txt to pause after several iterations". Do you want to continue with the next two files etc. later? – Bodo Jul 26 '22 at 14:00

5 Answers5

11

Glob qualifiers

Glob qualifiers can replace most uses of ls or find to enumerate files. They're a unique feature of zsh.

For example, $(ls /some/path/*.txt | head -2) (enumerate files in lexicographic order, keep only the first two files) is equivalent1 to /some/path/*.txt(N[1,2]) in zsh. The N qualifier ensures that the list is empty if there are no matches, and the [from,to] qualifier limits the matches to the specified range.

Without the N qualifier, under default options, your script will exit with an error message if there are no matching files.

You can use the o or O qualifier to control the order of the files. For example, /some/path/*.txt(Nom[1,2]) takes the two newest files.

1 There are some slight differences, generally at zsh's benefit. Using ls tends to be problematic with file names containing special character such as spaces or newlines or invalid byte sequences, whereas zsh's built-in features work reliably on all file names. Error management is different in corner cases. Here, as you forgot the -d option to ls, you'd also have problems if some of those *.txt files were of type directory as ls would list their contents instead.


I don't see how taking two files helps with your overall objective, though. If you want to have a way to process all files, but allow a human to review the first few, you could show a step/continue/abort prompt. Something like this:

pause=1
for f in /some/path/*.txt ; do
  print -ru2 unpacking $f
  ./prepare $f

if ((pause)); then print -ru2 -- "$f output is ready for review." c= while [[ $c != [anq] ]]; do read -k1 "c?Process (N)ext, (A)ll, (Q)uit? " && c=${c:l} done echo case $c in a) pause=0;; q) break;; esac fi done

  • I like this solution very much. Sometimes I preprocess one file, sometimes 5, because amount of output differs. – wl2776 Jul 27 '22 at 11:31
  • As for showing step/continue/abort prompt, I wanted to leave at shell's prompt, instead of script's, to be able to do something in the same console, without launching another one. Therefore, I prefer to process a few files, then press up arrow on keyboard to retrieve command an repeat it. – wl2776 Jan 19 '24 at 07:01
6

You could use a counter with the for loop. This should work in any POSIX-compatible shell.

This is an equivalent to the first code snippet in the question.

i=0
for f in /some/path/*.txt ; do
    if [ "$((i += 1))" -gt 2 ] ; then
        break
    fi
    echo "unpacking $f"
    ./prepare.sh "$f" && rm -v "$f"
done

As written in Gilles' answer, zsh has features to implement this in a simpler way. See this answer for an explanation of the glob qualifiers.

for f in /some/path/*.txt(N[1,2]) ; do
    echo "unpacking $f"
    ./prepare.sh "$f" && rm -v "$f"
done

Instead of breaking the loop you can also wait for some input.

i=0
for f in /some/path/*.txt ; do
    if [ "$((i += 1))" -gt 2 ] ; then
        i=1
        printf "press Enter to continue"
        read dummy
    fi
    printf "unpacking %s\n" "$f"
    ./prepare.sh "$f" && rm -v "$f"
done

Note 1: In the conditional branch I use i=1 instead of i=0 because the script will already process the first file of the next group in this iteration.

Note 2: I put the counting and condition at the beginning of the loop instead of at the end, because at this point it is clear that there is another file to be processed. This avoids a pause after the last file.

Note 3: I edited the scripts and added quotes around $((i += 1)) because Stéphane Chazelas mentioned in a comment that in most POSIX compatible shells globbing and splitting will be performed on the result of the arithmetic expansion which can lead to unwanted effects especially when IFS contains decimal digits (which I consider an unusual case). Furthermore, if the pattern does not match any file, a loop iteration will be performed with f set to the literal file name pattern. This can result in error messages or unexpected results. To avoid this, make the loop body conditional with e.g. if [ -f "$f" ] .... If these special cases are not relevant for the intended (interactive) use, the script can be kept more simple.

Bodo
  • 6,068
  • 17
  • 27
  • 2
    This is an adequate sh answer, but the question is about zsh, which has better tools. – Gilles 'SO- stop being evil' Jul 26 '22 at 19:10
  • Note that your POSIX variants are still zsh-specific, as zsh is one of the few remaining shells where split+glob is not performed upon unquoted arithmetic expansions. In other shells, you'd need to quote those $(( ... )) especially if you can't guarantee $IFS won't contain ASCII decimal digits. If /some/path/*.txt won't match any file, in shells other than zsh, ./prepare.sh will be called with a literal /some/path/*.txt instead of aborting the shell, which may also be problematic. – Stéphane Chazelas Jul 28 '22 at 06:40
4

In Bash you could shove the filenames in an array and loop over an array slice. And loop over that to repeat:

#!/bin/bash
shopt -s nullglob
files=(/some/path/*.txt)
for (( i = 0; i < "${#files[@]}"; i += 2)); do
    for f in "${files[@]:i:2}"; do
        printf "processing %s\n" "$f";
    done
    read -p "press enter to continue with the next set (or end)..."
done 

Or restructured to avoid a possible prompt at the very end, at the cost of losing simplicity:

#!/bin/bash
shopt -s nullglob
files=(/some/path/*.txt)
i=0 
while true; do
    for f in "${files[@]:i:2}"; do
        printf "processing %s\n" "$f";
    done
    (( i += 2 ))
    (( i >= "${#files[@]}" )) && break
    read -p "press enter to continue with the next set..."
done
ilkkachu
  • 138,973
2

For completeness, one way to process files in batches in zsh is with:

files=( /some/path/*.txt(N) )
while (( $#files )) {
  process-up-to-5-files $files[1,5]
  files[1,5]=()
}

files[1,5]=() can also be written shift 5 files but with the caveat that shift fails if $files contains fewer than 5 elements.

1

I'd suggest using gnu-parallel

  1. Place your task into a script
#!/bin/bash
echo unpacking ${1}
/full/path/to/prepare.sh ${1} && rm -v ${1}
  1. Run it via gnu-parallel (actual name of executable may vary depending on your distribution and how you installed the program)
parallel --halt soon,success=5 /path/to/script {} ::: *.txt

parallel by default runs one job per CPU, --halt soon,success=5 means "stop processing after 5 jobs ran successfully, but let running jobs finish". {} replaces the filename and ::: is the separator to the argument list.

It will not wait for a "continue" from your side, but since you remove the original file, you can just restart the process and have no doubles.

FelixJN
  • 13,566
  • 1
    Processing all files defeats the purpose of the question. – Gilles 'SO- stop being evil' Jul 26 '22 at 19:10
  • 1
    @Gilles'SO-stopbeingevil' But this solution stops after 5 successful jobs. That gets quite close. – glglgl Jul 27 '22 at 08:39
  • 1
    @Gilles'SO-stopbeingevil' Well, the number of parallel jobs can be reduced to 2 via -j 2 and with --halt soon,success=1 it should stop after 2 (maybe 3 if there is a race condition between starting a job and counting successful finishes, I cannot tell). My numbers were just dummy suggestions, as I assumed the 2 files was just a generic number from OP's side. – FelixJN Jul 27 '22 at 10:59