What is word splitting? Why is it important in shell programming?

Question

I'm getting confused about the role word splitting plays in zsh. I have not been exposed to this concept when programming in C, Python or MATLAB, and this has triggered my interest of why word splitting seems to be something specific to shell programming.

I have read about word splitting on this and other sites before, but haven't found a clear explanation of the concept. Wikipedia has a definition of word splitting but does not seem to have references on how it applies to Unix shells.

Here's an example of my confusion in zsh:

In the Z Shell FAQ, I read the following:

3.1: Why does $var where var="foo bar" not do what I expect?

In most Bourne-shell derivatives, multiple-word variables such as var="foo bar" are split into words when passed to a command or used in a for foo in $var loop. By default, zsh does not have that behaviour: the variable remains intact. (This is not a bug! See below.) The option SH_WORD_SPLIT exists to provide compatibility.

However, in the Z Shell Manual, I read the following:

SH_WORD_SPLIT (-y) <K> <S>

Causes field splitting to be performed on unquoted parameter expansions. Note that this option has nothing to do with word splitting. (See Parameter Expansion.)

Why does it say that SH_WORD_SPLIT has nothing to do with word splitting? Isn't word splitting precisely what this is all about?

Gilles 'SO- stop being evil' · Accepted Answer · 2021-12-29T18:49:25.220

Early shells had only a single data type: strings. But it is common to manipulate lists of strings, typically when passing multiple file names as arguments to a program. Another common use case for splitting is when a command outputs a list of results: the command's output is a string, but the desired data is a list of strings. To store a list of file names in a variable, you would put spaces between them. Then a shell script like this

files="foo bar qux"
myprogram $files

called myprogram with three arguments, as the shell split the string $files into words. At the time, spaces in file names were either forbidden or widely considered Not Done.

The Korn shell introduced arrays: you could store a list of strings in a variable. The Korn shell remained compatible with the then-established Bourne shell, so bare variable expansions kept undergoing word splitting, and using arrays required some syntactic overhead. You would write the snippet above

files=(foo bar qux)
myprogram "${files[@]}"

Zsh had arrays from the start, and its author opted for a saner language design at the expense of backward compatibility. In zsh (under the default expansion rules) $var does not perfom word splitting; if you want to store a list of words in a variable, you are meant to use an array; and if you really want word splitting, you can write $=var.

files=(foo bar qux)
myprogram $files

These days, spaces in file names are something you need to cope with, both because many users expect them to work and because many scripts are executed in security-sensitive contexts where an attacker may be in control of file names. So automatic word splitting is often a nuisance; hence my general advice to always use double quotes, i.e. write "$foo", unless you understand why you need word splitting in a particular use case. (Note that bare variable expansions undergo globbing as well.)

In my answer, I used the term “word splitting”. This is also called “field splitting”, because what constitutes a word (also called field) can be configured by setting the IFS variable: any character in IFS is considered a word separator, and a word is a sequence of characters that are not word separators. By default, IFS contains basic whitespace characters (ASCII space, tab and newline — not carriage return, unbreakable space, etc.). The zsh manual uses “word splitting” only to refer to a step in parsing shell code, which has nothing to do with the field/word splitting that is part of the expansion that happens after variable and command substitutions.

Thanks Gilles, this is really helpful! Is it correct to say that roughly speaking word splitting converts strings of the form "word1 word2 word3" into lists/arrays of the form "word1" "word2" "word3"? I have also updated the OP with a specific source of confusion in zsh. — Amelio Vazquez-Reina, Dec 13 '11 at 01:56
@intrpc "Word splitting" is not splitting on natural language words but on $IFS characters. Hence "field splitting" is a better name. But "word splitting" is often used for this concept in shell literature. The zsh documentation is quibbling on words. — Gilles 'SO- stop being evil', Dec 14 '11 at 10:11
See also rc (the plan9 shell, also ported to Unix) for an even better design than zsh when it comes to variables and arrays. — Stéphane Chazelas, Feb 08 '13 at 23:17
This answer addresses the question as originally asked, but not in its current form. It could use updating. — pyrocrasty, Dec 29 '21 at 16:01

score 6 · Answer 2 · answered Mar 11 '18 at 09:10

In this specific case of Zsh, word splitting is defined slightly differently than field splitting.

Consider prog a b c, it will pass in three arguments no matter how you set IFS. This is word splitting.

If you do A="a b c"; prog $A, it will pass in three arguments if IFS includes space or one argument otherwise. This is field splitting.

Definitions here are subtle. What the Zsh document is trying to say is that, even if you disable that option, prog a b c will still get separate arguments (which is what people always expect).

Bart Schaefer, a long-time zsh developer, confirms it is indeed the intended meaning of that text. — Stéphane Chazelas, Mar 11 '18 at 20:57

score 3 · Answer 3 · answered Dec 12 '11 at 23:33

Word splitting is not really shell specific.

Most programs that need to parse text input use some form of word splitting as a first step. It is done before identifying from these "words", numbers, operators, strings, tokens and whatever similar entities they need to process.

What is specific with the shells is that they have to properly build the argument list of commands called (C argc/argv, python sys.argv), including passing arguments with embedded spaces, empty arguments, custom delimiters and so on. Many shells use the IFS variable to allow some flexibility there.

What is word splitting? Why is it important in shell programming?

3 Answers3

Linked