it seems Bash is a Turing-complete language
The concept of Turing completeness is entirely separate from many other concepts useful in a language for programming in the large: usability, expressiveness, understandabilty, speed, etc.
If Turing-completeness were all we required, we wouldn't have any programming languages at all, not even assembly language. Computer programmers would all just write in machine code, since our CPUs are also Turing-complete.
why is Bash used almost exclusively to write relatively simple scripts?
Large, complex shell scripts — such as the configure
scripts output by GNU Autoconf — are atypical for many reasons:
Until relatively recently, you couldn't count on having a POSIX-compatible shell everywhere.
Many systems, particularly older ones, do technically have a POSIX-compatible shell somewhere on the system, but it may not be in a predictable location like /bin/sh
. If you're writing a shell script and it has to run on many different systems, how then do you write the shebang line? One option is to go ahead and use /bin/sh
, but choose to restrict yourself to the pre-POSIX Bourne shell dialect in case it gets run on such a system.
Pre-POSIX Bourne shells don't even have built-in arithmetic; you have to call out to expr
or bc
to get that done.
Even with a POSIX shell, you're missing out on associative arrays and other features we've expected to find in Unix scripting languages since Perl first became popular in the early 1990s.
That fact of history means there is a decades-long tradition of ignoring many of the powerful features in modern Bourne family shell script interpreters purely because you can't count on having them everywhere.
This still continues to this day, in fact: Bash didn't get associative arrays until version 4, but you might be surprised how many systems still in use are based on Bash 3. Apple still ships Bash 3 with macOS in 2017 — apparently for licensing reasons — and Unix/Linux servers often run all but untouched in production for a very long time, so you might have a stable old system still running Bash 3, such as a CentOS 5 box. If you have such systems in your environment, you can't use associative arrays in shell scripts that have to run on them.
If your answer to that problem is that you only write shell scripts for "modern" systems, you then have to cope with the fact that the last common reference point for most Unix shells is the POSIX shell standard, which is largely unchanged since it was introduced in 1989. There are many different shells based on that standard, but they've all diverged to varying degrees from that standard. To take associative arrays again, bash
, zsh
, and ksh93
all have that feature, but there are multiple implementation incompatibilities. Your choice, then, is to only use Bash, or only use Zsh, or only use ksh93
.
If your answer to that problem is, "so just install Bash 4," or ksh93
, or whatever, then why not "just" install Perl or Python or Ruby instead? That is unacceptable in many cases; defaults matter.
None of the Bourne family shell scripting languages support modules.
The closest you can come to a module system in a shell script is the .
command — a.k.a. source
in more modern Bourne shell variants — which fails on multiple levels relative to a proper module system, the most basic of which is namespacing.
Regardless of programming language, human understanding starts to flag when any single file in a larger overall program exceeds a few thousand lines. The very reason we structure large programs into many files is so that we can abstract their contents to a sentence or two at most. File A is the command line parser, file B is the network I/O pump, file C is the shim between library Z and the rest of the program, etc. When your only method for assembling many files into a single program is textual inclusion, you put a limit on how large your programs can reasonably grow.
For comparison, it would be like if the C programming language had no linker, only #include
statements. Such a C-lite dialect would not need keywords such as extern
or static
. Those features exist to allow modularity.
POSIX doesn't define a way to scope variables to a single shell script function, much less to a file.
This effectively makes all variables global, which again hurts modularity and composability.
There are solutions to this in post-POSIX shells — certainly in bash
, ksh93
and zsh
at least — but that just brings you back to point 1 above.
You can see the effect of this in style guides on GNU Autoconf macro writing, where they recommend that you prefix variable names with the name of the macro itself, leading to very long variable names purely in order to reduce the chance of collision to acceptably near zero.
Even C is better on this score, by a mile. Not only are most C programs written primarily with function-local variables, C also supports block scoping, allowing multiple blocks within a single function to reuse variable names without cross-contamination.
Shell programming languages have no standard library.
It is possible to argue that a shell scripting language's standard library is the contents of PATH
, but that just says that to get anything of consequence done, a shell script has to call out to another whole program, probably one written in a more powerful language to begin with.
Neither is there a widely-used archive of shell utility libraries as with Perl's CPAN. Without a large available library of third-party utility code, a programmer must write more code by hand, so she is less productive.
Even ignoring the fact that most shell scripts rely on external programs typically written in C to get anything useful done, there's the overhead of all those pipe()
→fork()
→exec()
call chains. That pattern is fairly efficient on Unix, compared to IPC and process launching on other OSes, but here it's effectively replacing what you'd do with a subroutine call in another scripting language, which is far more efficient still. That puts a serious cap on the upper limit of shell script execution speed.
Shell scripts have little built-in ability to increase their performance via parallel execution.
Bourne shells have &
, wait
and pipelines for this, but that's largely only useful for composing multiple programs, not for achieving CPU or I/O parallelism. You're not likely to be able to peg the cores or saturate a RAID array solely with shell scripting, and if you do, you could probably achieve much higher performance in other languages.
Pipelines in particular are weak ways to increase performance via parallel execution. It only lets two programs run in parallel, and one of the two will likely be blocked on I/O to or from the other at any given point in time.
There are latter-day ways around this, such as xargs -P
and GNU parallel
, but this just devolves to point 4 above.
With effectively no built-in ability to take full advantage of multi-processor systems, shell scripts are always going to be slower than a well-written program in a language that can use all the processors in the system. To take that GNU Autoconf configure
script example again, doubling the number of cores in the system will do little to improve the speed at which it runs.
Shell scripting languages don't have pointers or references.
This prevents you from doing a bunch of things easily done in other programming languages.
For one thing, the inability to refer indirectly to another data structure in the program's memory means you're limited to the built-in data structures. Your shell may have associative arrays, but how are they implemented? There are several possibilities, each with different tradeoffs: red-black trees, AVL trees, and hash tables are the most common, but there are others. If you need a different set of tradeoffs, you're stuck, because without references, you don't have a way to hand-roll many types of advanced data structures. You're stuck with what you were given.
Or, it may be the case that you need a data structure that doesn't even have an adequate alternative built into your shell script interpreter, such as a directed acyclic graph, which you might need in order to model a dependency graph. I've been programming for decades, and the only way I can think of to do that in a shell script would be to abuse the file system, using symlinks as faux references. That's the sort of solution you get when you rely merely on Turing-completeness, which tells you nothing about whether the solution is elegant, fast, or easy to understand.
Advanced data structures are merely one use for pointers and references. There are piles of other applications for them, which simply can't be done easily in a Bourne family shell scripting language.
I could go on and on, but I think you're getting the point here. Simply put, there are many more powerful programming languages for Unix type systems.
This is a huge advantage, that could compensate for the mediocrity of the language itself in some cases.
Sure, and that's precisely why GNU Autoconf uses a purposely-restricted subset of the Bourne family of shell script languages for its configure
script outputs: so that its configure
scripts will run pretty much everywhere.
You will probably not find a larger group of believers in the utility of writing in a highly-portable Bourne shell dialect than the developers of GNU Autoconf, yet their own creation is written primarily in Perl, plus some m4
, and only a little bit of shell script; only Autoconf's output is a pure Bourne shell script. If that doesn't beg the question of how useful the "Bourne everywhere" concept is, I don't know what will.
So, is there a limit to how complex such programs can get?
Technically speaking, no, as your Turing-completeness observation suggests.
But that is not the same thing as saying that arbitrarily-large shell scripts are pleasant to write, easy to debug, or fast to execute.
Is is possible to write, say, a file compressor/decompressor in pure bash?
"Pure" Bash, without any calls out to things in the PATH
? The compressor is probably doable using echo
and hex escape sequences, but it would be fairly painful to do. The decompressor may be impossible to write that way due to the inability to handle binary data in shell. You'd end up calling out to od
and such to translate binary data to text format, shell's native way of handling data.
Once you start talking about using shell scripting the way it was intended, as glue to drive other programs in the PATH
, the doors open up, because now you're limited only to what can be done in other programming languages, which is to say you don't have limits at all. A shell script that gets all of its power by calling out to other programs in the PATH
doesn't run as fast as monolithic programs written in more powerful languages, but it does run.
And that's the point. If you need a program to run fast, or if it needs to be powerful in its own right rather than borrowing power from others, you don't write it in shell.
A simple video game?
Here's Tetris in shell. Other such games are available, if you go looking.
there are only very limited debugging tools
I would put debugging tool support down about 20th place on the list of features necessary to support programming in the large. A whole lot of programmers rely much more heavily on printf()
debugging than proper debuggers, regardless of language.
In shell, you have echo
and set -x
, which together are sufficient to debug a great many problems.
sh
scriptconfigure
which is used as part of the build process for a great many un*x packages is not 'relatively simple'. – user4556274 Jul 23 '16 at 12:52m4
macros. – Kusalananda Jul 23 '16 at 13:04configure
scripts are also slow, do a whole bunch of useless work and have been the subject of some amusing rants. Of course the shell can be used for large programs, but then again people also have made computers out of Conway's Game of Life and Minecraft, and there are also programming languages like Brainf**k and Hexagony. Apparently some people just like to build something out of really small and confusing atoms. You can even sell games with that idea... – ilkkachu Jul 23 '16 at 13:26