277

Is using a while loop to process text generally considered bad practice in POSIX shells?

As Stéphane Chazelas pointed out, some of the reasons for not using shell loop are conceptual, reliability, legibility, performance and security.

This answer explains the reliability and legibility aspects:

while IFS= read -r line <&3; do
  printf '%s\n' "$line"
done 3< "$InputFile"

For performance, the while loop and read are tremendously slow when reading from a file or a pipe, because the read shell built-in reads one character at a time.

How about conceptual and security aspects?

cuonglm
  • 153,898
  • 2
    Related (the other side of the coin): How does yes write to file so quickly? – Wildcard Jan 25 '16 at 12:49
  • 13
    The read shell built-in doesn't read a single character at a time, it reads a single line at a time. http://wiki.bash-hackers.org/commands/builtin/read – Adam D. Jul 13 '16 at 01:30
  • 1
    @A.Danischewski: It depends on your shell. In bash, it reads one buffer size at a time, try dash for example. See also http://unix.stackexchange.com/q/209123/38906 – cuonglm Jul 13 '16 at 02:34
  • 5
    @cuonglm That's why you should not make claims like "tremendously slow" so positively. It depends. On a bunch of systems it is fast enough. – ᄂ ᄀ Jun 11 '21 at 07:58
  • 5
    @A.Danischewski, read reads words from a logical line one byte at a time though in the special case where the input is seekable, with some implementations, can read a block of bytes at a time and seek back to after the first unescaped newline character as an optimisation. – Stéphane Chazelas Sep 27 '21 at 10:10

5 Answers5

387

Yes, we see a number of things like:

while read line; do
  echo $line | cut -c3
done

Or worse:

for line in $(cat file); do
  foo=$(echo $line | awk '{print $2}')
  bar=$(echo $line | awk '{print $3}')
  doo=$(echo $line | awk '{print $5}')
  echo $foo whatever $doo $bar
done

(don't laugh, I've seen many of those).

Generally from shell scripting beginners. Those are naive literal translations of what you would do in imperative languages like C or python, but that's not how you do things in shells, and those examples are very inefficient (that latter one spawns six child processes for each input line), completely unreliable (potentially leading to security issues), and if you ever manage to fix most of the bugs, your code becomes illegible.

Conceptually

In C or most other languages, building blocks are just one level above computer instructions. You tell your processor what to do and then what to do next. You take your processor by the hand and micro-manage it: you open that file, you read that many bytes, you do this, you do that with it.

Shells are a higher level language. One may say it's not even a language. They're before all command line interpreters. The job is done by those commands you run and the shell is only meant to orchestrate them.

One of the great things that Unix introduced was the pipe and those default stdin/stdout/stderr streams that all commands handle by default.

In 50 years, we've not found better than that API to harness the power of commands and have them cooperate to a task. That's probably the main reason why people are still using shells today.

You've got a cutting tool and a transliterate tool, and you can simply do:

cut -c4-5 < in | tr a b > out

The shell is just doing the plumbing (open the files, setup the pipes, invoke the commands) and when it's all ready, it just flows without the shell doing anything. The tools do their job concurrently, efficiently at their own pace with enough buffering so as not one blocking the other, it's just beautiful and yet so simple.

Invoking a tool though has a cost (and we'll develop that on the performance point). Those tools may be written with thousands of instructions in C. A process has to be created, the tool has to be loaded, initialised, then cleaned-up, process destroyed and waited for.

Invoking cut is like opening the kitchen drawer, take the knife, use it, wash it, dry it, put it back in the drawer. When you do:

while read line; do
  echo $line | cut -c3
done < file

It's like for each line of the file, getting the read tool from the kitchen drawer (a very clumsy one because it's not been designed for that), read a line, wash your read tool, put it back in the drawer. Then schedule a meeting for the echo and cut tool, get them from the drawer, invoke them, wash them, dry them, put them back in the drawer and so on.

Some of those tools (read and echo) are built in most shells, but that hardly makes a difference here since echo and cut still need to be run in separate processes.

It's like cutting an onion but washing your knife and put it back in the kitchen drawer between each slice.

Here the obvious way is to get your cut tool from the drawer, slice your whole onion and put it back in the drawer after the whole job is done.

IOW, in shells, especially to process text, you invoke as few utilities as possible and have them cooperate to the task, not run thousands of tools in sequence waiting for each one to start, run, clean up before running the next one.

Further reading in Bruce's fine answer. The low-level text processing internal tools in shells (except maybe for zsh) are limited, cumbersome, and generally not fit for general text processing.

Performance

As said earlier, running one command has a cost. A huge cost if that command is not builtin, but even if they are builtin, the cost is big.

And shells have not been designed to run like that, they have no pretension to being performant programming languages. They are not, they're just command line interpreters. So, little optimisation has been done on this front.

Also, the shells run commands in separate processes. Those building blocks don't share a common memory or state. When you do a fgets() or fputs() in C, that's a function in stdio. stdio keeps internal buffers for input and output for all the stdio functions, to avoid to do costly system calls too often.

The corresponding even builtin shell utilities (read, echo, printf) can't do that. read is meant to read one line. If it reads past the newline character, that means the next command you run will miss it. So read has to read the input one byte at a time (some implementations have an optimisation if the input is a regular file in that they read chunks and seek back, but that only works for regular files and bash for instance only reads 128 byte chunks which is still a lot less than text utilities will do).

Same on the output side, echo can't just buffer its output, it has to output it straight away because the next command you run will not share that buffer.

Obviously, running commands sequentially means you have to wait for them, it's a little scheduler dance that gives control from the shell and to the tools and back. That also means (as opposed to using long running instances of tools in a pipeline) that you cannot harness several processors at the same time when available.

Between that while read loop and the (supposedly) equivalent cut -c3 < file, in my quick test, there's a CPU time ratio of around 40000 in my tests (one second versus half a day). But even if you use only shell builtins:

while read line; do
  echo ${line:2:1}
done

(here with bash), that's still around 1:600 (one second vs 10 minutes).

Reliability/legibility

It's very hard to get that code right. The examples I gave are seen too often in the wild, but they have many bugs.

read is a handy tool that can do many different things. It can read input from the user, split it into words to store in different variables. read line does not read a line of input, or maybe it reads a line in a very special way. It actually reads words from the input those words separated by $IFS and where backslash can be used to escape the separators or the newline character.

With the default value of $IFS, on an input like:

   foo\/bar \
baz
biz

read line will store "foo/bar baz" into $line, not " foo\/bar \" as you'd expect.

To read a line, you actually need:

IFS= read -r line

That's not very intuitive, but that's the way it is, remember shells were not meant to be used like that.

Same for echo. echo expands sequences. You can't use it for arbitrary contents like the content of a random file. You need printf here instead.

And of course, there's the typical forgetting of quoting your variable which everybody falls into. So it's more:

while IFS= read -r line; do
  printf '%s\n' "$line" | cut -c3
done < file

Now, a few more caveats:

  • except for zsh, that doesn't work if the input contains NUL characters while at least GNU text utilities would not have the problem.
  • if there's data after the last newline, it will be skipped
  • inside the loop, stdin is redirected so you need to pay attention that the commands in it don't read from stdin.
  • for the commands within the loops, we're not paying attention to whether they succeed or not. Usually, error (disk full, read errors...) conditions will be poorly handled, usually more poorly than with the correct equivalent. Many commands, including several implementations of printf also don't reflect their failure to write to stdout in their exit status.

If we want to address some of those issues above, that becomes:

while IFS= read -r line <&3; do
  {
    printf '%s\n' "$line" | cut -c3 || exit
  } 3<&-
done 3< file
if [ -n "$line" ]; then
    printf '%s' "$line" | cut -c3 || exit
fi

That's becoming less and less legible.

There are a number of other issues with passing data to commands via the arguments or retrieving their output in variables:

  • the limitation on the size of arguments (some text utility implementations have a limit there as well, though the effect of those being reached are generally less problematic)
  • the NUL character (also a problem with text utilities).
  • arguments taken as options when they start with - (or + sometimes)
  • various quirks of various commands typically used in those loops like expr, test...
  • the (limited) text manipulation operators of various shells that handle multi-byte characters in inconsistent ways.
  • ...

Security considerations

When you start working with shell variables and arguments to commands, you're entering a mine-field.

If you forget to quote your variables, forget the end of option marker, work in locales with multi-byte characters (the norm these days), you're certain to introduce bugs which sooner or later will become vulnerabilities.

When you may want to use loops

Using a shell loop to process text may make sense when your task involves what the shell is good at: launching external programs.

E.g. a loop like the following might make some sense:

while IFS= read -r line; do
    someprog -f "$line"
done < file-list.txt

though the simple case above where the input is passed unmodified to someprog could be also done with e.g. xargs:

<file-list.txt tr '\n' '\0' | xargs -r0 -n1 someprog -f 

Or with GNU xargs:

xargs -rd '\n' -n1 -a file-list.txt someprog -f
ilkkachu
  • 138,973
  • 37
    Clear (vividly), readable and extremely helpful. Thank you once again. This is actually the best explanation I have seen anywhere on the internet for the fundamental difference between shell scripting and programming. – Wildcard Oct 24 '15 at 18:05
  • 3
    It's posts like these that help beginners learn about Shell Scripts and see it's subtle differences. Should add referencing variable as ${VAR:-default_value} to ensure you don't get a null. and set -o nounset to yell at you when referencing a non-defined value. – unsignedzero Dec 04 '15 at 17:49
  • One small point missed in the answer is that the cut command should be piped from the entire loop, rather than from each individual print inside the loop. (And in the example where you process data not ending in a newline, you can pipe both the loop and the final print into the cut as one group of commands). I don't know if I'm missing some greater point here though. – Score_Under Jul 26 '16 at 15:24
  • @StéphaneChazelas: Umm, what is the appropriate substitute for for i in $(cat file); do ... done? – countermode Sep 13 '16 at 08:35
  • 2
    @countermode, cat file, awk < file '{...}', perl -ne '...' < file, sed '...' < file... cut ... < file | tr ... | paste .... for i in $(...) is even worse than a while read loop as that's invoking the split+glob operator which most people forget to tune before using and also means storing the whole file content in memory (several times, and in a very inefficient way with some shell implementations like bash). – Stéphane Chazelas Sep 13 '16 at 08:39
  • 5
    @ivan_pozdeev. Don't know what PowerShell does, but I can't see how the shell would be involved there. The structured data would have to be produced and understood by the applications that the shell interconnects. A byte stream like used by pipes that shells like bash use to interconnect applications are well capable of carrying any structured data. The problem here is the only widely supported structure is the line (and one implemented with serious limations: like using a byte delimiter, sometimes with limited length, sometimes with byte values not allowed). Or am I missing something? – Stéphane Chazelas Nov 11 '16 at 16:25
  • 13
    "In 45 years, we've not found better than that API to harness the power of commands and have them cooperate to a task." - actually, PowerShell, for one, has solved the dreaded parsing problem by passing around structured data rather than byte streams. The sole reason shells don't use it yet (the idea has been there for quite a while and has basically crystallized sometime around Java when the now-standard list & dictionary container types became mainstream) is their maintainers couldn't yet agree on the common structured data format to use (. – ivan_pozdeev Nov 11 '16 at 16:25
  • 3
    And it's because of that lack of any agreed-upon standards that programs have so much difficulty reliably understanding one another when arbitrary data are involved. What PowerShell does is "cut the middleman", providing that standard itself, with no encoding and decoding on the programs' part needed (and also provides dedicated facilities to parse/encode to interact with programs that still use unstructured streams). – ivan_pozdeev Nov 11 '16 at 17:05
  • 4
    It's basically what every programming language does with container data types (and Perl was specifically designed with the goal and core syntax features to parse unstructured data into structured to be able to process it more effectively and reliably than shell could), but for every language, this efficiently is always being limited to code blocks in that language linked together and nothing else. – ivan_pozdeev Nov 11 '16 at 17:06
  • 2
    PowerShell's designers were essentially the first ones to successfully apply that for arbitrary programs (there was an earlier effort to duplicate POSIX tools with XML as the interchange format but it didn't catch on due to XML not being a suitable format for streamed/binary/loosely structured data). – ivan_pozdeev Nov 11 '16 at 17:13
  • 3
    @StéphaneChazelas: Bump because of the end of this answer saying: "When you may want to use loops. TBD" : I've come back regularly on this answer (and others) and this section begs to be Determined ... I hope you find the time one day to complete it :) – Olivier Dulac Jan 20 '17 at 14:16
  • 3
    You can pipe a loop into something else, which is occasionally useful for things like for foo in *.bar; do some_command "$foo"; done | followed by some reasonable text processing pipeline, where some_command is a thing that doesn't like getting multiple arguments so you can't just hand it the glob directly. You could also do some_command "${foo%.bar}" if your command is afraid of file extensions for some bizarre reason. Most of these are design flaws in some_command, but the whole point of the shell is to provide the duct tape to work around those flaws. – Kevin Feb 24 '17 at 04:16
  • 16
    @OlivierDulac I think that's a bit of humour. That section will be forever TBD. – muru May 13 '18 at 14:25
  • 4
    But if you have a small text without any special character and you are alone in a room without any stack exchange user beside you, i think you should still write "while read line" there is no problem, ;-) that will make you a beginner for some seconds... But it will probably works! – Luciano Andress Martini Sep 06 '18 at 16:15
  • 4
    "It's like cutting an onion but washing your knife and put it back in the kitchen drawer between each slice." Great metaphor. I sit gobsmacked. After 20 years, I still code the shell like a beginner. Because "just get it done" was the imperative. No elegance, no efficiency in my work. – Ken Ingram Jan 25 '20 at 23:59
  • While the idea of the article is correct (for pure text processing), there are cases when loops are the best / acceptable solution even in the text processing. For example time-stamping output in bash is only 2-2.5 times slower than in perl or awk for 6M, 42000 lines file. – Alek May 31 '20 at 12:45
  • 2
    I also noticed that author made a mess of the shell loop. Don't know for the other shells, but in bash it's simply while IFS= read -r line || [ -n "$line" ]; do echo "$line" ...; done. Last condition is not needed. As for echo expands sequences - in bash it is false (without '-e'). File handle 3 introduced unnecessarily, it's done with cut .. </dev/null if at all needed. – Alek May 31 '20 at 12:53
  • 1
    @Alek, your || [ -n ... ] approach adds a trailing newline on output which wasn't there on input (it may or may not be desired). Whether bash's echo expands sequences depends on the environment and how it was built; for some other bash deployments, $lines like -Een would be a problem. You simply can't use echo for arbitrary data. See Why is printf better than echo? for details. Using a fd other than 0-2 is what you need in the general case, so that 0-2 are preserved within the loop, that covers one of the points discussed here. – Stéphane Chazelas May 31 '20 at 13:26
  • @StéphaneChazelas, || [ -n ... ] I know, but I think no one cares, in all my usage cases it is even desired. "fd other than 0-2 .." - bash preserves all fds when you run command <... >... 2>... the fds only changed for THAT command and then restored (just use cut <&-) . No need to use other fds. I can only agree with you on echo, as it lacks --. The statement would be: while IFS= read -r line || [ -n "$line" ]; do printf ... "$line" ...; done. As I said, sometimes it's suitable or even the best way; it CAN be reliable, safe, legible, performant enough for some text processing tasks. – Alek Jun 01 '20 at 18:00
  • 1
    if there's data after the last newline, it will be skipped - POSIX defines text files as consisting of "zero or more lines" and lines as "A sequence of zero or more non- characters plus a terminating character". A file which is not newline-terminated is thus not a text file as per this definition, so this point can be ignored. – Vilinkameni Aug 16 '23 at 18:21
  • 1
    In a 35 year programming career, I'd be struggling to think of a time that I wrote a shell script, but then was dissatisfiled because it was too slow. Since I'm not an awk or perl guru, I won't hesitate to use read, if it gets the job done. Sure, it's good to know what makes shell programs slow, but when was the last time any of us ran a shell program, was annoyed because it was slow, and it was slow because of the above supposed guidelines? If speed was important I probably wouldn't be using plain text files and/or scripting anyway. – xpusostomos Nov 24 '23 at 05:01
  • Also, awk is not a panacea either, in Kernigan & Pike's famous "UNIX Programming environment" he describes how awk was too slow for processing of his book. – xpusostomos Nov 24 '23 at 05:01
56

As far as conceptual and legibility goes, shells typically are interested in files. Their "addressable unit" is the file, and the "address" is the file name. Shells have all kinds of methods of testing for file existence, file type, file name formatting (beginning with globbing). Shells have very few primitives for dealing with file contents. Shell programmers have to invoke another program to deal with file contents.

Because of the file and file name orientation, doing text manipulation in the shell is really slow, as you've noted, but also requires an unclear and contorted programming style.

mikeserv
  • 58,310
35

There are some complicated answers, giving a lot of interesting details for the geeks among us, but it's really quite simple - processing a large file in a shell loop is just too slow.

I think the questioner is interested in a typical kind of shell script, which may start with some command-line parsing, environment setting, checking files and directories, and a bit more initialization, before getting on to its main job: going through a large line-oriented text file.

For the first parts (initialization), it doesn't usually matter that shell commands are slow – it's only running a few dozen commands, maybe with a couple of short loops. Even if we write that part inefficiently, it's usually going to take less than a second to do all that initialization, and that's fine - it only happens once.

But when we get on to processing the big file, which could have thousands or millions of lines, it is not fine for the shell script to take a significant fraction of a second (even if it's only a few dozen milliseconds) for each line, as that could add up to hours.

That's when we need to use other tools, and the beauty of Unix shell scripts is that they make it very easy for us to do that.

Instead of using a loop to look at each line, we need to pass the whole file through a pipeline of commands. This means that, instead of calling the commands thousands or millions of time, the shell calls them only once. It's true that those commands will have loops to process the file line-by-line, but they are not shell scripts and they are designed to be fast and efficient.

Unix has many wonderful built in tools, ranging from the simple to the complex, that we can use to build our pipelines. I would usually start with the simple ones, and only use more complex ones when necessary.

I would also try to stick with standard tools that are available on most systems, and try to keep my usage portable, although that's not always possible. And if your favourite language is Python or Ruby, maybe you won't mind the extra effort of making sure it's installed on every platform your software needs to run on :-)

Simple tools include head, tail, grep, sort, cut, tr, sed, join (when merging 2 files), and awk one-liners, among many others. It's amazing what some people can do with pattern-matching and sed commands.

When it gets more complex, and you really have to apply some logic to each line, awk is a good option - either a one-liner (some people put whole awk scripts in 'one line', although that's not very readable) or in a short external script.

As awk is an interpreted language (like your shell), it's amazing that it can do line-by-line processing so efficiently, but it's purpose-built for this and it's really very fast.

And then there's Perl and a huge number of other scripting languages that are very good at processing text files, and also come with lots of useful libraries.

And finally, there's good old C, if you need maximum speed and high flexibility (although text processing is a bit tedious). But it's probably a very bad use of your time to write a new C program for every different file-processing task you come across. I work with CSV files a lot, so I have written several generic utilities in C that I can re-use in many different projects. In effect, this expands the range of 'simple, fast Unix tools' that I can call from my shell scripts, so I can handle most projects by only writing scripts, which is much faster than writing and debugging bespoke C code each time!

Some final hints:

  • don't forget to start your main shell script with export LANG=C, or many tools will treat your plain-old-ASCII files as Unicode, making them much much slower
  • also consider setting export LC_ALL=C if you want sort to produce consistent ordering, regardless of the environment!
  • if you need to sort your data, that will probably take more time (and resources: CPU, memory, disk) than everything else, so try to minimize the number of sort commands and the size of the files they're sorting
  • a single pipeline, when possible, is usually most efficient – running multiple pipelines in sequence, with intermediate files, may be more readable and debug-able, but will increase the time that your program takes
ᄂ ᄀ
  • 364
  • 12
    Pipelines of many simple tools (specifically the mentioned ones, like head, tail, grep, sort, cut, tr, sed, ...) are often used unnecessarily, specifically if you already also have an awk instance in that pipeline which can do the tasks of those simple tools as well. Another issue to be considered is that in pipelines you cannot simply and reliably pass state information from processes at the front side of a pipeline to processes that appear on the rear side. If you use for such pipelines of simple programs an awk program you have a single state space. – Janis Mar 03 '15 at 05:54
  • 2
    Setting LC_ALL is over-used, as it overrides all possible locale settings, not just the ones that affect input parsing. To make sure that sort works consistently, LC_COLLATE=C is the appropriate setting. That said, LC_ALL=C.UTF-8 has identical sorting as C, and doesn't break tools that need to treat Unicode characters as such. – Martin Kealey Jun 30 '22 at 12:21
17

Yes, but...

The correct answer of Stéphane Chazelas is based on concept of delegating every text operation to specific binaries, like grep, awk, sed and others.

As is capable of doing a lot of things by itself, dropping forks may become quicker (even than running another interpreter for doing all job).

For sample, have a look on this post:

https://stackoverflow.com/a/38790442/1765658

and

https://stackoverflow.com/a/7180078/1765658

test and compare...

Of course

There is no consideration about user input and security!

Don't write web application under !!

But for a lot of server administration tasks, where could be used in place of , using built-ins bash could be very efficient.

My opinion:

Writing tools like bin utils is not same kind of work than system administration.

So not same people!

Where sysadmins have to know shell, they could write prototypes by using his preferred (and best known) tool.

If this new utility (prototype) is really useful, some other people could develop dedicated tool by using some more appropriated language.

ᄂ ᄀ
  • 364
  • 2
    Good example. Your approach is certainly more efficient than lololux one, but note how tensibai's answer (the right way to do this IMO, that is without using shell loops) is orders of magnitude faster than yours. And yours is a lot faster if you don't use bash. (over 3 times as fast with ksh93 in my test on my system). bash is generally the slowest shell. Even zsh is twice as fast on that script. You also have a few issues with unquoted variables and the usage of read. So you are actually illustrating a lot of my points here. – Stéphane Chazelas Aug 05 '16 at 14:58
  • 1
    @StéphaneChazelas I agree, [tag:bash] is probably the slowest [tag:shell] people could use today, but the most widely used anyway. – F. Hauri - Give Up GitHub Aug 05 '16 at 16:10
  • @StéphaneChazelas I've posted a [tag:perl] version on my answer – F. Hauri - Give Up GitHub Aug 05 '16 at 16:26
  • Thanks to @StéphaneChazelas I just spent 1h reading the answers here :) The drawback I personally have with perl (or anything else than bash at all) is that I may not have the interpreter on some servers, whereas I'm pretty sure I'll find bash, so I tend to use bash by default. I agree you should not develop a 'normal' user interface in shell btw. – Tensibai Aug 22 '16 at 12:05
  • 3
    @Tensibai, you will find POSIX sh, Awk, Sed, grep, ed, ex, cut, sort, join...all with more reliability than Bash or Perl. – Wildcard Nov 30 '16 at 09:48
  • @Wildcard I slightly disagree for Awk (and which one, mawk, gawk ?). I agree sh will be present, but I fail to see any major distribution where it is not provided by bash. (And side note If you have a look at the link in first comment, it just use grep, awk and take advantage of the path expansion of the shell, I think sh will do as well as bash). – Tensibai Nov 30 '16 at 10:00
  • @Tensibai, Ubuntu uses dash as the default /bin/sh. There are others. As for Awk, stick to the POSIX features (which is what I linked to) and the particular implementation won't matter. And yes, that answer with grep and awk uses no Bashisms and could have a #!/bin/sh shebang. – Wildcard Nov 30 '16 at 16:21
  • 2
    @Tensibai, of all systems concerned by U&L, most of them (Solaris, FreeBSD, HP/UX, AIX, most embedded Linux systems...) don't come with bash installed by default. bash is mostly found only on Apple macOS and GNU systems (I suppose that's what you call major distributions), though many systems also have it as an optional package (like zsh, tcl, python...) – Stéphane Chazelas Dec 01 '16 at 16:56
  • 2
    @Stephane well, Cisco nexus does use bash, checkpoint, f5, beware, bluecoat also (what I call embedded systems indeed) I was scoping distribution on Linux, but I hardly remember a hp/ux or Aix without bash, even the Aix emulation within os400 I've seen ad it. But anyway, the point was more 'there's a better chance to get bash than perl', I fully agree that staying posix compatible should be the main goal when you write 'portable' code. – Tensibai Dec 01 '16 at 17:51
  • @StéphaneChazelas I'm proudly to present my newConnector function, for reducing forks, present on github – F. Hauri - Give Up GitHub Feb 12 '18 at 14:01
-1

The accepted answer is good as it states clearly the drawbacks of parsing text files in the shell, but people have been cargo culting the main idea (mainly, that shell scripts deal poorly with text processing tasks) to criticize anything that uses a shell loop.

There is nothing inherently wrong with shell loops to the extent there is nothing wrong with loops in shell scripts or command substitutions outside of loops. It is certainly true that in most cases, you can replace them with more idiomatic constructs. For example, instead of writing

for i in $(find . -iname "*.txt"); do
...
done

write this:

for i in *.txt; do
...
done

In other scenarios, it is better to rely on more specialized tools such as awk, sed, cut, join, paste, datamash, miller, general purpose programming languages with good text processing capabilities (e.g. perl, python, ruby) or parsers for specific file types (XML, HTML, JSON)

Having said that, using a shell loop is the right call as long as you know:

  1. Performance is not a priority. Is it important that your script runs fast? Are you running a task once every few hours as a cron job? Then maybe performance is not an issue. Or if it is, run benchmarks to make sure your shell loop is not a bottleneck. Intuition or preconceptions about what tool is "fast" or "slow" cannot serve as a replacement of accurate benchmarks.
  2. Legibility is maintained. If you're adding too much logic in your shell loop that it becomes hard to follow, then you may need to rethink this approach.
  3. Complexity does not increase substantially.
  4. Security is preserved.
  5. Testability doesn't become an issue. Properly testing shell script is already difficult. If using external commands makes it more difficult to know when you have a bug in your code or you're working under incorrect assumptions about return values, then that's a problem.
  6. The shell loop has the same semantics as the alternative or the differences don't matter for what you're doing at the moment. For example, the find command above recurses into subdirectories and matches files whose names start with .. (Both are likely to have problems if you have files with spaces in their names.)

As an example demonstrating it is not an impossible task to satisfy the previous statements, this is the pattern used in an installer for a well-known commercial software:

i=1
MD5=... # embedded checksum
for s in $sizes
do
    checksum=`echo $VAR | cut -d" " -f $i`
    if <checksum condition>; then
       md5=`echo $MD5 | cut -d" " -f $i
       ...
done

This runs for a very small number of times, its purpose is clear, it is concise and doesn't increase complexity unnecessarily, no user-controlled input is used and therefore, security is not a concern. Does it matter that it is invoking additional processes in a loop? Not at all.

r_31415
  • 516
  • 3
    Your first two examples are not equivalent. find recurses into subdirectories, and for file in *.txt skips files whose names start with a dot. – Keith Thompson Dec 23 '22 at 20:17
  • You have to appreciate the irony of people downvoting, precisely making the point I warned about in the answer. If the reason is the previous comment, I didn't reply because that @KeithThompson's objection was not the point of this answer. – r_31415 Mar 15 '23 at 20:20
  • 1
    I fail to see the irony. You listed several things to consider when replacing find with a shell loop, but you didn't consider the behavior on files and directories whose name starts with a dot. There is a real difference in the behavior between your two code snippets, and that's very relevant to the point of your answer. (BTW, I didn't downvote.) – Keith Thompson Mar 15 '23 at 20:40
  • Yes, I know you didn't downvote. My answer was not a reply to your comment, which is much older than the recent downvote. Maybe you're nitpicking a little bit on the differences in those code snippets, but as I said, that was really not the point of my answer, so I didn't think it was necessary to change it. – r_31415 Mar 16 '23 at 08:58
  • 1
    You presented a code snippet using find, then you presented a second code snippet using *.txt. The two snippets are not equivalent, and you refuse to acknowledge that in your answer. Yes, the second snippet can be useful if you happen to know that there are no files with names starting with . and ending with .txt, but users should be aware of that issue. I didn't downvote before, but I will now. – Keith Thompson Mar 16 '23 at 19:19
  • I acknowledged both are not the same. Does it matter in the context of this answer? No. You can downvote all you want. I'm not even participating in the site anymore, so you're free to waste your time here on my behalf :) – r_31415 Mar 16 '23 at 20:30
  • 1
    You didn't acknowledge it in your answer, which is what matters. You list 5 things to consider, but you don't include the fact that *.txt doesn't match files starting with .. – Keith Thompson Mar 16 '23 at 20:40
  • I don't understand why you continue arguing about this irrelevant issue. I didn't acknowledge that in my answer or as a reply to your comment precisely because it was not worth my time, and I'm sure it is not worth your time either, so just let it go. – r_31415 Mar 16 '23 at 20:48
  • There is nothing inherently wrong with shell loops, but there is with using them for parsing files. The list of things which are wrong about using them for parsing files, along with the explanation of why they are wrong, is a part of the accepted answer. Sure, they might work, but with listed drawbacks. Often there is a simpler solution, which also avoids the drawbacks. – Vilinkameni Aug 18 '23 at 11:32