49

The purpose of this question is to answer a curiosity, not to solve a particular computing problem. The question is: Why are POSIX mandatory utilities not commonly built into shell implementations?

For example, I have a script that basically reads a few small text files and checks that they are properly formatted, but it takes 27 seconds to run, on my machine, due to a significant amount of string manipulation. This string manipulation makes thousands of new processes by calling various utilities, hence the slowness. I am pretty confident that if some of the utilities were built in, namely grep, sed, cut, tr, and expr, then the script would run in a second or less (based on my experience in C).

It seems there would be a lot of situations where building these utilities in would make the difference between whether or not a solution in shell script has acceptable performance.

Obviously, there is a reason it was chosen not to make these utilities built in. Maybe having one version of a utility at a system level avoids having multiple unequal versions of that utility being used by various shells. I really can't think of many other reasons to keep the overhead of creating so many new processes, and POSIX defines enough about the utilities that it does not seem like much of a problem to have different implementations, so long as they are each POSIX compliant. At least not as big a problem as the inefficiency of having so many processes.

psmears
  • 465
  • 3
  • 8
Kyle
  • 655
  • 15
    If 27 seconds is too slow you could use Python, Perl or some other semi-compiled language. Alternatively post the slow parts of your script and ask for improvements. It might be that you're using three or four commands where one (faster one) might do. – Chris Davies Feb 23 '17 at 21:48
  • 9
    Shells weren't really made for heavy-duty tasks, unfortunately and the world has changed a lot since the times when you could get away with just a shell script. I agree with roaima - every reasonable sysadmin should go for Python or Perl and not expect the shell to handle everything – Sergiy Kolodyazhnyy Feb 23 '17 at 22:08
  • As for your question why they aren't built in , well, the only reason I can think of is comparability with older systems, but that's just a speculation – Sergiy Kolodyazhnyy Feb 23 '17 at 22:09
  • 3
    Some string manipulation is built in to bash. –  Feb 23 '17 at 22:18
  • 17
    The primary purpose of the shell is to run other programs, not manipulate data directly. Over the years, some external programs or features provided by them (globbing, arithmetic, printf, etc) have been incorporated into shells when they were deemed useful enough. – chepner Feb 23 '17 at 22:22
  • 8
    If you post your script to codereview.stackexchange.com, I'm sure the reviewers could make some suggestions to speed your script up drastically (or at least point out why it should be written in Python/etc instead of shell). – chepner Feb 23 '17 at 22:33
  • 2
    This is indirectly addressed here: What is the difference between a builtin command and one that is not?. Also relevant: Are BusyBox commands truly built in? But for your specific case, the most relevant is Why is using a shell loop to process text considered bad practice? I'd love to see your script, whether here or on http://codereview.stackexchange.com. I'm sure it can be easily made performant. – Wildcard Feb 24 '17 at 00:28
  • 5
    @Kyle: awk is a mandatory utility in POSIX, and especially well suited (that is, very fast) to implement scripts that you might otherwise implement using sed, cut, tr, grep, and expr in a shell script. – Nominal Animal Feb 24 '17 at 01:09
  • 3
    What about OTHER shells? Some of my friends use zsh instead of bash. So you'd expect zsh developers to duplicate ALL of unix as well? I personally sometimes use tclsh (yes, it's a programming language but is also a shell just like bash which is also a programming language but is also a shell). What about people who use C? Do you expect the C compiler itself to implement all of UNIX? (OK C is not a shell but it is a programming language that can exec all of the standard unix commands). Do you expect all programming languages to do this? Do you expect all shells to do this? – slebetman Feb 24 '17 at 07:58
  • 1
    @slebetman: Or (t)csh? You'd be replacing one standard, well-tested (one hopes!) implementation of say awk with multiple implementations. I'd also suggest that it's more than possible that the time taken for the OP's problem might be due to inefficient programming, rather than any defect in calling. Certainly I've seen the same task accomplished in sub-seconds rather than minutes, simply by using resources efficiently. – jamesqf Feb 24 '17 at 18:31
  • @jamesqf In fact, there are times when I know I don't want the built-in implementation of a command to be used, because of some subtle difference between the shell and standalone behavior. So I have to explicitly specify the path to the standalone command to be sure the shell doesn't screw me up. – Monty Harder Feb 24 '17 at 19:48
  • 1
    @Kyle Overcoming this limitation of shells is one of the design goals of Perl. It will seem familiar if you're already used to awk, sed, sh and friends. – jpaugh Feb 24 '17 at 20:00
  • I think all this comes down to modularity, as MTilsted more–or–less stated. Your choice of engine, whether shell script or compiled or semi-compiled program, depends on whether you require speed or modularity. – can-ned_food Feb 25 '17 at 07:07
  • I'm going to toss in PHP into the mix of suggested languages for command-line scripting. The PHP CLI is absolutely underutilized for shell scripts. If you are operating in a web server environment, then PHP is likely already on the host and, if not, it's available via your friendly package manager. PHP's built-in opcode cache puts most scripts on-par with natively compiled application code in terms of performance. Also, PHP code is generally easy to read, comprehend, and maintain. I can't say the same for some of the suggested languages. – CubicleSoft Feb 25 '17 at 15:03
  • Shells and shell commands are excellent for I/O bound computations. If you have CPU bound computations you should look into your choice of commands invoked. – Thorbjørn Ravn Andersen Feb 26 '17 at 22:24
  • 1
    Also in the old days memory was so precious that you only put the bare necessities in the shell. Many Unix design decisions make much better sense when you consider that computers were much slower and smaller back then. – Thorbjørn Ravn Andersen Feb 26 '17 at 22:47

8 Answers8

69

Why are POSIX mandatory utilities not built into shell?

Because to be POSIX compliant, a system is required1 to provide most utilities as standalone commands.

Having them builtin would imply they have to exist in two different locations, inside the shell and outside it. Of course, it would be possible to implement the external version by using a shell script wrapper to the builtin, but that would disadvantage non shell applications calling the utilities.

Note that BusyBox took the path you suggested by implementing many commands internally, and providing the standalone variant using links to itself. One issue is while the command set can be quite large, the implementations are often a subset of the standard so aren't compliant.

Note also that at least ksh93, bash and zsh go further by providing custom methods for the running shell to dynamically load builtins from shared libraries. Technically, nothing then prevents all POSIX utilities to be implemented and made available as builtins.

Finally, spawning new processes has become quite a fast operation with modern OSes. If you are really hit by a performance issue, there might be some improvements to make your scripts run faster.

1 POSIX.1-2008

However, all of the standard utilities, including the regular built-ins in the table, but not the special built-ins described in Special Built-In Utilities, shall be implemented in a manner so that they can be accessed via the exec family of functions as defined in the System Interfaces volume of POSIX.1-2008 and can be invoked directly by those standard utilities that require it (env, find, nice, nohup, time, xargs).

jlliagre
  • 61,204
  • 4
    This is the right answer, but I would just add that as the interface of these utilities is generally via stdin / stdout anyway, that even if every one of them were also implemented as a built-in routine in bash, it would effectively still need to fork itself and create pipes for each command in a pipeline anyway, so there would be only marginal gains – Chunko Feb 24 '17 at 01:32
  • 2
    @Chunko Yes. subshells are lighter than fork/exec'ed processes though. – jlliagre Feb 24 '17 at 02:47
  • @jlliagre: Depends on the OS. On linux there's very, very little difference between threads and processes. Indeed, unlike other unixen linux don't have separate implementations of process and threads. On linux you can start with a process and close enough resources (stdio etc) to become a thread and you can also start with a thread and open enough resources to become a process. They're the same thing only with different default start state. – slebetman Feb 24 '17 at 07:53
  • 3
    @slebetman You are missing my point. Subshells are neither threads nor exec'ed processes, regardless of whether they are running on Linux or not. Subshells are just their parent's clone, created by a fork not followed by exec; fork is nowadays a very lightweight operation compared to exec. – jlliagre Feb 24 '17 at 08:30
  • 3
    I measured busybox nofork builtins as having on the order of 10x less overhead than noexec builtins, which in turn had ~5x less overhead than fork+exec of a separate binary. Definitions as per http://unix.stackexchange.com/a/274322/29483 It's interesting that busybox doesn't nofork everything, although I know some busybox code is shortened by not cleaning up memory, and just relies on being a short-lived process. – sourcejedi Feb 24 '17 at 09:57
  • @Chunko note my test compared echo >/dev/null with cat </dev/null. Busybox was able to nofork the echo command, despite the redirection. (It backed up stdout to another FD, then restored it afterwards). – sourcejedi Feb 24 '17 at 10:08
  • 1
    @jlliagre: On linux a fork creates a process. The point you're perhaps missing is that on Linux they've optimised processes so much that the developers have determined that there is not further advantage creating anything more lightweight. Basically in linux a process is as lightweight as a thread. – slebetman Feb 24 '17 at 12:26
  • 1
    @slebetman Why are you again introducing threads in this discussion while they are irrelevant here? As far as I know, no mainstream shell is exposing or using multi-threading. Not only on Linux but on whatever OS implementing it, a successful fork creates a process. This process is running the very same command as its parent which is often useless outside the specific case of subshells. That's the reason why most child processes quickly call exec after their birth. – jlliagre Feb 24 '17 at 13:04
  • @jlliagre: OK. Then I don't understand why you'd say subshells are more lightweight than processes when a subshell spawns a process. – slebetman Feb 24 '17 at 19:45
  • 1
    @slebetman In modern OSes, fork without exec can take advantage of Copy-on-Write, which can be much faster. – jpaugh Feb 24 '17 at 19:56
  • 1
    @slebetman Creating a subshell doesn't really spawn a new process. It just clones the existing one without using exec. This is much lighter due to the copy on write fork used by Linux and Unix implementations, as jpaugh already commented. – jlliagre Feb 24 '17 at 21:50
  • 1
    In poking around HP-UX's /sbin; a lot of "standalone" utilities were shell scripts that just invoked the builtin. – Joshua Feb 26 '17 at 01:50
  • 1
    @Joshua, do you have examples? I'm not expecting POSIX utilities to be in /sbin. – jlliagre Feb 26 '17 at 12:52
  • @jlliagre: HP-UX has /bin -> /usr/bin so all POSIX utilities needed for early boot are in /sbin. I'm pretty sure that everying POSIX says must live in /bin is duplicated there. Particular examples: echo, false, true, test, ulimit, :, [. There where quite a few more I can't remember off hand that were rather extensive shell scripts too. – Joshua Feb 26 '17 at 15:15
  • @Joshua I guess they are needed for single-user mode and/or maybe /sbin/sh is statically linked so is more resilient to disk issues. I have no HP-UX system to check. – jlliagre Feb 26 '17 at 17:43
  • @jlliagre: Indeed. /sbin/sh is not statically linked for disk issues. /sbin/sh is statically linked because /lib -> /usr/lib and the need to keep / small. (The stuff in /sbin uses only a small subset of libc ...). Also, HP-UX had no concept of rescue boot. – Joshua Feb 26 '17 at 21:03
  • @Joshua You misunderstood why I talk about disk issue resiliency. Standalone binaries can directly run without any prerequisite, like extra file to load or dynamic loader configuration settings, so they are inherently more resilient to disk issues. Statically linked binaries do not help keeping a file system small. On the opposite statically linked binaries are bigger because they need to embed the libraries they use. Not sure about what you mean with "no rescue boot". You can rescue the system from single-user mode, don't you? – jlliagre Feb 26 '17 at 21:51
  • 1
    @jlliagre: It takes several hundred statically linked binaries to weigh the same as the standard library. The fact that they're duplicates of part of the shared library is irrelevant because the shared library is in /usr not /. Rescue boot = boot from removable media to repair the existing system rather than copy a new system over it. – Joshua Feb 27 '17 at 01:34
  • 1
    @Joshua Your first statement doesn't make sense so I either you are confusing terms or the /sbin binaries are statically linked with a stripped down libc. Quoting HP-UX admin guide /sbin: Contains statically linked versions of critical programs needed at boot time or when important shared libraries have become corrupted. Rescue boot doesn't generally requires an external media on HP-UX and so is done by using single-user mode, but when that one fails, CD/DVD, tape and network boot are obviously possible with HP-UX like any other Unix, might require an extra piece of software (Ignite-UX?) – jlliagre Feb 27 '17 at 08:50
  • @jlliagre: ld garbage collects when linking. – Joshua Feb 27 '17 at 16:15
  • 1
    @Joshua Got it, stripped down at link time. Thanks! Being smaller also makes them withstand disk corruption even better. – jlliagre Mar 03 '17 at 21:13
11

Shell scripts are not expected to run with that type of speed. If you want to improve the speed of your script, try it in perl. If that is still too slow, then you'll have to move to a statically typed language such as java or c, or write a C module for perl that runs the parts which are too slow.

Shell is the first level of prototyping, if you can prove the concept with shell, then move to a better scripting language which can do more bounds checking which would take acres of shell.

A Unix OS is expected to include many small programs which do well defined tasks which make up a larger picture. This is a good thing as it compartmentalises bigger programs. Take a look at qmail, for example and compare that with sendmail. qmail is made of many programs:

http://www.nrg4u.com/qmail/the-big-qmail-picture-103-p1.gif

Exploiting the network daemon would not help you exploit the queue manager.

Ed Neville
  • 1,340
  • The OP specifically did NOT ask for suggestions on improving the speed of the code. The question was why certain utilities are not built-ins like cd or pwd. – Stephen C Feb 23 '17 at 22:07
  • 4
    True. The answer was to express the difference between monolithic and compartmentalised and show a reason in this favour. – Ed Neville Feb 23 '17 at 22:13
  • Related: https://askubuntu.com/a/291926/11751 – user Feb 24 '17 at 19:25
  • 1
    @StephenC cd is a builtin – and it actually has to be, because changing the working directory in a subprocess doesn't affect parent processes. – Jonas Jan 30 '19 at 07:59
9

From the BASH reference manual,

Builtin commands are necessary to implement functionality impossible or inconvenient to obtain with separate utilities.

As I'm sure you've heard, the UNIX philosophy relies heavily on multiple applications that all have limited functionality. Each built-in has a very good reason why it is built in. Everything else is not. I think a more interesting class of questions is along the lines of, "why exactly is pwd built-in?"

Stephen C
  • 825
8

The guys at AT&T asked themselves the same thing

If you look at the history of the AT&T Software Toolkit (currently lying dormant on github since the core team left), this is exactly what they did with the AT&T Korn shell, a.k.a. ksh93.

Performance was always part of the motivation for the ksh93 maintainers, and when building ksh you can choose to build many common POSIX utilities as dynamically loaded libraries. By binding these commands to a directory name like /opt/ast/bin, you could control which version of the command would be used, based on the position of that directory name in $PATH.

Examples:

cat chmod chown cksum cmp cp cut date expr fmt head join ln
mkdir mkfifo mktemp mv nl od paste rm tail tr uniq uuencode wc

The full list can be found in the github ast repository.

Note that most of the ast tools have their own provenance and would differ strongly from the more common gnu implementations. The AT&T Research team abided by official standards, which was the way to achieve interoperability when you could not share code.

6

So we didn't marshal resources into optimizing the original tool, to meet every specific desire. I guess what we need to explain is how much this specific desire would have cost to implement.

POSIX defines enough about the utilities that it does not seem like much of a problem to have different implementations.

this is a bad assumption :-P.

Post-POSIX systems continue to become more powerful and convenient for good reasons; as an after-the-fact standard it never actually catches up.

Ubuntu started an effort to switch to a stripped-down POSIX shell for scripts, to optimize the old System V init boot process. I'm not saying it failed, but it did trigger many bugs that had to be cleaned up: "bashisms", scripts which ran under /bin/sh while assuming that bash features were available.

POSIX sh is not a good general-purpose programming language. Its primary purpose is to work well as an interactive shell. As soon as you start to save your commands to a script, be aware that you approach a Turing tarpit. E.g. it's not possible to detect failures in the middle of a normal pipeline. bash added set -o pipefail for this, but this is not in POSIX.

Similar useful but unstandardized features are provided by almost every utility more complex than true.

For the class of task you outline, you can draw a rough line to Awk, Perl, and nowadays Python. Different tools were created, and evolved independently. Would you expect e.g. GNU Awk to be subsumed into a libutilposixextended?

I'm not saying we now have one universally better approach I can point you to. I have a soft spot for Python. Awk is surprisingly powerful, although I've been frustrated by some features being specific to GNU Awk. But the point is that processing large numbers of strings individually (presumably from lines of the files) was not a design goal of the POSIX shell.

sourcejedi
  • 50,249
  • I wonder if there would be any difficulty with a shell which would assume that any command executed from a configurable list of locations would be treated as a built-in in cases where the shell understood everything about the command? If a script performs cat -@fnord foo the shell should decide that since it doesn't know what -@ means it would need to invoke the actual command, but given just cat <foo >bar the shell shouldn't need to spawn another process. – supercat Feb 26 '17 at 17:48
  • 1
    @supercat complexity. – sourcejedi Feb 26 '17 at 18:09
2

There is also the question of: Which shell would you build it into?

Most Unix/Linux systems have multiple different shells which are developed independently(sh/bash/korn/???). If you build the tools into the shell, you would end up with a different implementation of these tools for each shell. This would cause overhead, and you might end up with different features/bugs in for example grep, depending on which shell you used to invoke it.

MTilsted
  • 121
  • zsh is pretty popular in some circles these days. csh/tcsh has historically had a large following, but I don't think you see much of it today. And there's a whole bundle of lesser-known shells... – user Feb 24 '17 at 19:28
  • Modularity. With builtins, you'd need to recompile or re-install the shell each time a change was made to one of those builtins. – can-ned_food Feb 25 '17 at 07:02
1

Many have answered well. I intend only to compliment those answers. I think the UNIX philosophy is that a tool should do one thing and do it well. If one tries to make an all encompassing tool, that's lot's more places for failure. Limiting functionality in this way makes a tool set that's reliable.

Also, consider, if functionality like sed or grep were built into the shell, would it be as easy to invoke from the command line when you'd like it?

In closing, consider, some of the functionality you're desiring to be in BASH, is in BASH. For example, the ability for RE matching in BASH is implemented using the =~ binary operator (see Shell Grammar in the Manual Page for more, specifically, reference the discussion of the [[ ]] construct for if). As a very quick example, say I'm searching a file for 2 hex digits:

while read line; do
    if [[ $line =~ 0x[[:xdigit:]]{2} ]]; then
        # do something important with it
    fi
done < input_file.txt

As for sed-like functionality, look under Parameter Expansion in the Expansion heading of the same man page. You'll see a wealth of things you can do that are reminiscent of sed. I most often use sed to make some substitution type change to text. Building off of the above:

# this does not take into account the saving of the substituted text
# it shows only how to do it
while read line; do
    ${line/pattern/substitution}
done < input_file.txt

In the end though, is the above "better" than?

grep -E "[[:xdigit:]]{3}" input_file.txt
sed -e 's/pattern/substitution/' input_file.txt
  • An argument against the last question can be found under https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice – phk Mar 18 '17 at 19:02
1

This is, I guess, a historical accident.

When UNIX was created in the late 1960s and early 1970s, computers did not have nearly as much memory as they do today. It would have been possible, at the time, to implement all this functionality as shell builtins, but due to memory limitations, they would have had to limit the amount of functionality that they could implement, or risk out of memory and/or swap trashing problems.

On the other hand, by implementing the given functionality as separate programs, and by making the two required system calls for starting a new process as light as possible, they could create a scripting environment that does not have those problems and that still runs at reasonable speed.

Of course, once those things are implemented as separate processes, people will start them from programs that are not shells, and then they have to remain like that, or suddenly all this software starts breaking.

That's not to say you can't implement some functionality twice, however, and indeed some shells implement some functionality that's supposed to be an external program as a shell builtin; e.g., bash implements the echo command as a builtin, but there's also a /usr/bin/echo