267

I consistently see answers quoting this link stating definitively "Don't parse ls!" This bothers me for a couple of reasons:

  1. It seems the information in that link has been accepted wholesale with little question, though I can pick out at least a few errors in casual reading.

  2. It also seems as if the problems stated in that link have sparked no desire to find a solution.

From the first paragraph:

...when you ask [ls] for a list of files, there's a huge problem: Unix allows almost any character in a filename, including whitespace, newlines, commas, pipe symbols, and pretty much anything else you'd ever try to use as a delimiter except NUL. ... ls separates filenames with newlines. This is fine until you have a file with a newline in its name. And since I don't know of any implementation of ls that allows you to terminate filenames with NUL characters instead of newlines, this leaves us unable to get a list of filenames safely with ls.

Bummer, right? How ever can we handle a newline terminated listed dataset for data that might contain newlines? Well, if the people answering questions on this website didn't do this kind of thing on a daily basis, I might think we were in some trouble.

The truth is though, most ls implementations actually provide a very simple api for parsing their output and we've all been doing it all along without even realizing it. Not only can you end a filename with null, you can begin one with null as well or with any other arbitrary string you might desire. What's more, you can assign these arbitrary strings per file-type. Please consider:

LS_COLORS='lc=\0:rc=:ec=\0\0\0:fi=:di=:' ls -l --color=always | cat -A
total 4$
drwxr-xr-x 1 mikeserv mikeserv 0 Jul 10 01:05 ^@^@^@^@dir^@^@^@/$
-rw-r--r-- 1 mikeserv mikeserv 4 Jul 10 02:18 ^@file1^@^@^@$
-rw-r--r-- 1 mikeserv mikeserv 0 Jul 10 01:08 ^@file2^@^@^@$
-rw-r--r-- 1 mikeserv mikeserv 0 Jul 10 02:27 ^@new$
line$
file^@^@^@$
^@

See this for more.

Now it's the next part of this article that really gets me though:

$ ls -l
total 8
-rw-r-----  1 lhunath  lhunath  19 Mar 27 10:47 a
-rw-r-----  1 lhunath  lhunath   0 Mar 27 10:47 a?newline
-rw-r-----  1 lhunath  lhunath   0 Mar 27 10:47 a space

The problem is that from the output of ls, neither you or the computer can tell what parts of it constitute a filename. Is it each word? No. Is it each line? No. There is no correct answer to this question other than: you can't tell.

Also notice how ls sometimes garbles your filename data (in our case, it turned the \n character in between the words "a" and "newline" into a ?question mark...

...

If you just want to iterate over all the files in the current directory, use a for loop and a glob:

for f in *; do
    [[ -e $f ]] || continue
    ...
done

The author calls it garbling filenames when ls returns a list of filenames containing shell globs and then recommends using a shell glob to retrieve a file list!

Consider the following:

printf 'touch ./"%b"\n' "file\nname" "f i l e n a m e" |
    . /dev/stdin
ls -1q

f i l e n a m e  
file?name

IFS="
" ; printf "'%s'\n" $(ls -1q)

'f i l e n a m e'
'file
name'

POSIX defines the -1 and -q ls operands so:

-q - Force each instance of non-printable filename characters and <tab>s to be written as the question-mark ( '?' ) character. Implementations may provide this option by default if the output is to a terminal device.

-1 - (The numeric digit one.) Force output to be one entry per line.

Globbing is not without its own problems - the ? matches any character so multiple matching ? results in a list will match the same file multiple times. That's easily handled.

Though how to do this thing is not the point - it doesn't take much to do after all and is demonstrated below - I was interested in why not. As I consider it, the best answer to that question has been accepted. I would suggest you try to focus more often on telling people what they can do than on what they can't. You're a lot less likely, as I think, to be proven wrong at least.

But why even try? Admittedly, my primary motivation was that others kept telling me I couldn't. I know very well that ls output is as regular and predictable as you could wish it so long as you know what to look for. Misinformation bothers me more than do most things.

The truth is, though, with the notable exception of both Patrick's and Wumpus Q. Wumbley's answers (despite the latter's awesome handle), I regard most of the information in the answers here as mostly correct - a shell glob is both more simple to use and generally more effective when it comes to searching the current directory than is parsing ls. They are not, however, at least in my regard, reason enough to justify either propagating the misinformation quoted in the article above nor are they acceptable justification to "never parse ls."

Please note that Patrick's answer's inconsistent results are mostly a result of him using zsh then bash. zsh - by default - does not word-split $(command substituted) results in a portable manner. So when he asks where did the rest of the files go? the answer to that question is your shell ate them. This is why you need to set the SH_WORD_SPLIT variable when using zsh and dealing with portable shell code. I regard his failure to note this in his answer as awfully misleading.

Wumpus's answer doesn't compute for me - in a list context the ? character is a shell glob. I don't know how else to say that.

In order to handle a multiple results case you need to restrict the glob's greediness. The following will just create a test base of awful file names and display it for you:

{ printf %b $(printf \\%04o `seq 0 127`) |
sed "/[^[-b]*/s///g
        s/\(.\)\(.\)/touch '?\v\2' '\1\t\2' '\1\n\2'\n/g" |
. /dev/stdin

echo '`ls` ?QUOTED `-m` COMMA,SEP'
ls -qm
echo ; echo 'NOW LITERAL - COMMA,SEP'
ls -m | cat
( set -- * ; printf "\nFILE COUNT: %s\n" $# )
}

OUTPUT

`ls` ?QUOTED `-m` COMMA,SEP
??\, ??^, ??`, ??b, [?\, [?\, ]?^, ]?^, _?`, _?`, a?b, a?b

NOW LITERAL - COMMA,SEP
?
 \, ?
     ^, ?
         `, ?
             b, [       \, [
\, ]    ^, ]
^, _    `, _
`, a    b, a
b

FILE COUNT: 12

Now I'll safe every character that isn't a /slash, -dash, :colon, or alpha-numeric character in a shell glob then sort -u the list for unique results. This is safe because ls has already safed-away any non printable characters for us. Watch:

for f in $(
        ls -1q |
        sed 's|[^-:/[:alnum:]]|[!-\\:[:alnum:]]|g' |
        sort -u | {
                echo 'PRE-GLOB:' >&2
                tee /dev/fd/2
                printf '\nPOST-GLOB:\n' >&2
        }
) ; do
        printf "FILE #$((i=i+1)): '%s'\n" "$f"
done

OUTPUT:

PRE-GLOB:
[!-\:[:alnum:]][!-\:[:alnum:]][!-\:[:alnum:]]
[!-\:[:alnum:]][!-\:[:alnum:]]b
a[!-\:[:alnum:]]b

POST-GLOB:
FILE #1: '?
           \'
FILE #2: '?
           ^'
FILE #3: '?
           `'
FILE #4: '[     \'
FILE #5: '[
\'
FILE #6: ']     ^'
FILE #7: ']
^'
FILE #8: '_     `'
FILE #9: '_
`'
FILE #10: '?
            b'
FILE #11: 'a    b'
FILE #12: 'a
b'

Below I approach the problem again but I use a different methodology. Remember that - besides \0null - the / ASCII character is the only byte forbidden in a pathname. I put globs aside here and instead combine the POSIX specified -d option for ls and the also POSIX specified -exec $cmd {} + construct for find. Because find will only ever naturally emit one / in sequence, the following easily procures a recursive and reliably delimited filelist including all dentry information for every entry. Just imagine what you might do with something like this:

#v#note: to do this fully portably substitute an actual newline \#v#
#v#for 'n' for the first sed invocation#v#
cd ..
find ././ -exec ls -1ldin {} + |
sed -e '\| *\./\./|{s||\n.///|;i///' -e \} |
sed 'N;s|\(\n\)///|///\1|;$s|$|///|;P;D'

###OUTPUT

152398 drwxr-xr-x 1 1000 1000        72 Jun 24 14:49
.///testls///

152399 -rw-r--r-- 1 1000 1000         0 Jun 24 14:49
.///testls/?
            \///

152402 -rw-r--r-- 1 1000 1000         0 Jun 24 14:49
.///testls/?
            ^///

152405 -rw-r--r-- 1 1000 1000         0 Jun 24 14:49
.///testls/?
        `///
...

ls -i can be very useful - especially when result uniqueness is in question.

ls -1iq | 
sed '/ .*/s///;s/^/-inum /;$!s/$/ -o /' | 
tr -d '\n' | 
xargs find

These are just the most portable means I can think of. With GNU ls you could do:

ls --quoting-style=WORD

And last, here's a much simpler method of parsing ls that I happen to use quite often when in need of inode numbers:

ls -1iq | grep -o '^ *[0-9]*'

That just returns inode numbers - which is another handy POSIX specified option.

mikeserv
  • 58,310
  • 4
    "ls is fast". Shell globbing is even faster :-). And you can resolve globs without a loop too. echo *. Works perfectly fine. – phemmer May 12 '14 at 02:34
  • @Patrick - shell globbing is NOT faster. Compare the time it takes you to get output identical to ls -R with a sell glob and the time it takes ls to do it. – mikeserv May 12 '14 at 04:02
  • 15
    @mikeserv Ok I did. Shell glob is 2.48 times faster. time bash -c 'for i in {1..1000}; do ls -R &>/dev/null; done' = 3.18s vs time bash -c 'for i in {1..1000}; do echo **/* >/dev/null; done' = 1.28s – phemmer May 12 '14 at 04:05
  • 4
    @mikeserv Nobody tells anyone not to parse find because find has a -print0 argument which uses a null character to delimit the files. A null character cannot be in a filename, thus there's no possibility of ever confusing it. – phemmer May 12 '14 at 04:12
  • @Patrick - GNU find has a -print0 but its use is not portable code. And I demonstrate above that it is not necessary. – mikeserv May 12 '14 at 04:13
  • 3
    @mikeserv you are absolutely correct as -print0 is not defined in POSIX. However I have never seen anyone saying it is common practice to use newline-delimited find output as reliable file delimitation. – phemmer May 12 '14 at 04:17
  • 32
    In regards to your most recent update, please stop relying on visual output as determining that your code works. Pass your output to an actual program and have the program try and perform an operation on the file. This is why I was using stat in my answer, as it actually checks that each file exists. Your bit at the bottom with the sed thing does not work. – phemmer May 12 '14 at 04:20
  • 71
    You can't be serious. How can jumping through all the hoops your question describes be easier or simpler or in any way better than simply not parsing ls in the first place? What you're describing is very hard. I'll need to deconstruct it to understand all of it and I'm a relatively competent user. You can't possibly expect your average Joe to be able to deal with something like this. – terdon May 12 '14 at 04:40
  • @Patrick you cannot pass a\nb to stat! – mikeserv May 12 '14 at 04:40
  • @terdon - what hoops? set -- $(ls -1q | uniq) is all it takes. – mikeserv May 12 '14 at 04:41
  • 4
    @mikeserv Uh, yes you can. touch foo$'\n'bar; stat --format '<%n>' foo*. – phemmer May 12 '14 at 04:49
  • 4
    Not it ain't: touch a$'\n'b a$'\t'b 'a b'; set -- $(ls -1q | uniq); for i; do ls "$i"; done. That will match the a b file twice because of the shell glob issues. – terdon May 12 '14 at 04:49
  • @terdon - you do still have to handle $IFS of course - just like I said in the beginning. IFS="$(printf \\n)" touch a$'\n'b a$'\t'b 'a b'; set -- $(ls -1q | uniq); for i; do ls "$i"; done though what your shell might do to the IFS I don't know - it's better to do an actual newline - as I demonstrate. – mikeserv May 12 '14 at 04:56
  • @Patrick - that's not \n that's an actual newline. You can printf 'stat "%b"\n' "$@" |. /dev/stdin – mikeserv May 12 '14 at 04:57
  • 8
    That makes no difference. Nor does using an actual newline. The dupes happen at the shell globbing level. In any case, even if this did work, it is still one hell of a hoop to go through just cause you don't want to do for f in *; do ...; done. – terdon May 12 '14 at 05:06
  • @terdon - I know where they happen. And I'll update it again to show that - I don't know what you're doing. – mikeserv May 12 '14 at 05:09
  • 1
    I'm copy/pasting your suggestion directly into my terminal and hitting enter. Did you try it? – terdon May 12 '14 at 05:10
  • What happens if there are additional files hard linked to those files? That might mess up the "find by inode" trickery. – sth May 12 '14 at 11:38
  • 60
    -1 for using a question to pick an argument. All of the reasons parsing ls output is wrong were covered well in the original link (and in plenty of other places). This question would have been reasonable if OP were asking for help understanding it, but instead OP is simply trying to prove his incorrect usage is ok. – R.. GitHub STOP HELPING ICE May 12 '14 at 13:05
  • @sth - I don't want to see the same file twice. – mikeserv May 12 '14 at 22:23
  • 3
    @R. - you've got a valid point, but somebody had to say it - and I didn't know how else to do it. And this wasn't just 'picking an argument' I just really dislike misinformation. – mikeserv May 12 '14 at 22:24
  • 1
    @Patrick and you tried your shell globbing after a reboot to cut out file system cache influence? And did so for the first command as well, right? – 0xC0000022L May 12 '14 at 23:28
  • @Patrick - that time thing is pretty good. I personally prefer the shells array and a glob over ls or find or any of the rest but recursive stuff is more difficult. I guess youre using some special shell option or something for recursive globbing - i try to do things as portably as i might so that didnt occur to me. Well done. – mikeserv May 13 '14 at 02:23
  • 10
    You've got to be kidding me. Why would anyone parse text when they can get a list of files/properties directly? – Navin May 13 '14 at 05:15
  • 11
    @mikeserv: It's not misinformation. It's 100% correct. The fact that you refuse to believe it does not change the fact that it's correct. – R.. GitHub STOP HELPING ICE May 13 '14 at 10:33
  • No, @R.. - it is misinformation. I guess i will have to show you once and for all. It is, in fact, incorrect. – mikeserv May 13 '14 at 11:40
  • @R.. It is misinformation - I do not merely refuse to believe it - I demonstrably disprove it. – mikeserv May 13 '14 at 14:07
  • 19
    @mikeserv It's not just that parsing ls is bad. Doing for something in $(command) and relying on word-splitting to get accurate results is bad for the large majority of command's which don't have simple output. –  May 13 '14 at 14:53
  • 1
    @BroSlow - agreed. But that is a far different statement than those I often read. – mikeserv May 13 '14 at 15:01
  • 10
    @mikeserv: You have a stream of text produced by ls that can contain any bytes except the null byte or the slash. There is fundamentally no way to recover that into a list of filenames. The transformation ls does is non-reversible, even if it doesn't replace any nonprintable characters. When it replaces non-printable characters, you're in an even worse situation. – R.. GitHub STOP HELPING ICE May 13 '14 at 16:04
  • @R.. That is a very good point about [:print:]'s complement - for which you've got my vote. I considered the same but didn't care to include it as I could have done by adding that to $IFS temporarily and/or simply setting another variable and adding to the [$glob]. The point is though - ls provides the marker reliably - and I can't see what more you'd need. – mikeserv May 13 '14 at 16:09
  • 11
    @mikeserv: Your claim is simply wrong. If you have a sequence of strings and you concatenate them using a separator that can appear in the individual strings, there is no way to recover the original list of strings. To solve this problem you would need a reversible form of escaping, which ls does not provide. If it did provide such a feature you could write a very complex script to recover the filenames, but it's utter nonsense when the shell gives you a trivially-correct way to do the same thing with no danger of misinterpreting the results. – R.. GitHub STOP HELPING ICE May 13 '14 at 16:11
  • @R.. The shell recovers the strings when the glob is resolved - so long as the pathname exists and the marker is there it is a trivial thing to do. I agree that a shell glob is probably a better way to go about it - but that is a non-trivial thing to do so recursively and/or with any of ls's sort options - which is not to mention retrieving inode numbers. Certainly you must agree that ls -1i | grep -o '^ *[0-9]*' is a simple and non-complex way to parse ls anyway. – mikeserv May 13 '14 at 16:16
  • 9
    @mikeserv: No it doesn't. I think this is the core of your misunderstanding. The shell never concatenates the filenames to begin with. In the shell, globs expand each result to a separate shell word. Thus usages like for i in * ; do ... ; done are safe, whereas usages like for i in $(echo *) ; do ... ; done are not (the latter has a concatenation step followed by a separate word-splitting step). – R.. GitHub STOP HELPING ICE May 13 '14 at 16:18
  • @R.. I don't do $(echo *) I do: set -- 'string'["$glob"]'string' - there is 0 concatenation done by anything but the shell. It is essentially the same - the -vx output is included above. It appears perhaps you've misunderstood? – mikeserv May 13 '14 at 16:20
  • 7
    @mikeserv: No, I just gave the most trivial example to explain the point that concatenation does not occur. The correct usage of set -- with globs also avoids any concatenation and word splitting. The incorrect usage of set -- with the output of ls does involve concatenation (inherent in the way ls writes output: as a stream of bytes, not a list of strings) and word splitting. – R.. GitHub STOP HELPING ICE May 13 '14 at 16:28
  • @R.. Can you demonstrate this? I think you're wrong, but I'd be interested to see otherwise. It does involve concatenation - the shell's own. Admittedly, and for that reason, it does succumb under ARGLEN limits - but that can be handled with xargs - or even just with a heredocument. It is a stream of bytes that ls writes - and for each non-printable we're provided the marker for a glob. I am very curious about your specifying it an incorrect usage of set -- though. It seems to me its as correct as any other. – mikeserv May 13 '14 at 16:32
  • @R.. You know, the shell comes prepackaged with a means of wordsplitting via set -- - it's $IFS and $* for parsing argument arrays. – mikeserv May 13 '14 at 16:38
  • 5
    @mikeserv: The set command, like all commands to the shell, receives a list of arguments (ala argv[]) that come from shell words on the original command line. set itself does not do any word splitting. This is all described in POSIX XCU Chapter 2. Word-splitting is applied to the command line for set, like any other command, but it happens before glob expansion. – R.. GitHub STOP HELPING ICE May 13 '14 at 17:48
  • @R.. It's the shell that does the splitting - and set - as a builtin - is the shell. It is also set that is specifically designed to parse arguments - split or not - according to those handed it by $*. I've read all of that, by the way. There are a lot of topics for which my knowledge is lacking, but this isn't among them. Regardless, I don't see how that is relevant to set -- 'string'["$glob"]'string' – mikeserv May 13 '14 at 17:59
  • 3
    @mikeserv: Then tell me what you want me to demonstrate. Patrick already gave you examples of directory contents which your method fails to parse (because they are indistinguishable by it). For any method you're using (please pick one for the sake of being specific) I'm happy to provide you a trivial example of a directory it fails to properly parse. The fact that you can construct specific examples which you think are "hard" and successfully parse them has no bearing on whether your method works in general. – R.. GitHub STOP HELPING ICE May 13 '14 at 20:00
  • @R.. No, he didn't. What Patrick demonstrates first of all is his shell failing to glob portably, which is its default setting and which is why is I say his answer is misleading. In the Bash portion he demonstrates the greedy properties of the globs matching more than once if possible, which I've handled. It does generally work - above include output from searching the whole tree. So I'm not sure to what it is you refer. http://zsh.sourceforge.net/FAQ/zshfaq03.html – mikeserv May 13 '14 at 21:10
  • 6
    Why parse ls? Seriously, the amount of work you have to do indicates that this is a bad idea. This is what find -print0, xargs -r0, stat, bash while IFS= read -rd $'\0' loops, etc. are for. – Aaron Davies May 15 '14 at 16:55
  • @AaronDavies - if we're not talking about portable stuff, why wouldn't I just do ls --quoting-style=shell-always? I do show a portable xargs 0-delim method above - it works for find as well. – mikeserv May 16 '14 at 00:24
  • 1
    @mikesev what am i supposed to do with the quoted output? what shell tools can read it? – Aaron Davies May 16 '14 at 17:14
  • 2
    @mikeserv i don't see him actually doing anything with the output in his answer, he just mentions the option. what i mean is, something like ls $(ls --quoting-style=shell-always) doesn't work at all. did you have something ls --quoting-style=shell-always|xargs ls in mind? – Aaron Davies May 16 '14 at 21:51
  • @AaronDavies - Well, xargs is what I had in mind - though I think I prefer the c-style escapes. Something like the following could be used with xargs printf %b\\0 - though I think I'd still have to backslash protect 'single-quotes - to recursively return a zero-delimited array of only the largest file in all child directories: ls -1bpRS ././ | sed -n ':d;\|^[.]*/\./|{s|..||;h;n;:sd;\|/$|{n;bsd};\|^$|b;G;s|\(.*\)\n\(.*[^/]\)/*:|\2/\1|p}' – mikeserv May 16 '14 at 22:02
  • @AaronDavies - here's a much simpler version of that - ls -1bpRS ././ | sed -n '\|^\.*/\./|{s/..//;h;:sd;n;\|/$|bsd;/./{H;g;s|:\n|/|p}}' – mikeserv May 16 '14 at 23:25
  • 8
    This is all too insanely complicated to be reliable or maintainable. You have a chance with things like find . -print0 | xargs -0, but little else. Don’t use the shell for complicated things, or not only will you later hate yourself for having done this, so will everyone else, too. – tchrist May 17 '14 at 19:44
  • @tchrist - strange that one with your moniker should be so concerned with popularity. In any case - please follow the POSIX link and look yourself at the ls output specs - as it seems to me, ls is designed to be parsed. You might also consider changing IFS=<tab> since <tab> in filenames is already protected. – mikeserv May 17 '14 at 19:50
  • 4
    "Never do X" does not imply that "X can never be done unambiguously". That said, with respect to Wumpus's answer, you're sorely misunderstanding it. It isn't refuting that ? is a glob character; it's refuting your unstated assumption that, absent any ?s inserted, all filenames will match themselves (and only themselves) when interpreted as glob expressions. – Charles Duffy Jul 29 '15 at 16:35
  • 4
    That is to say: A file named [x], with literal square brackets, is a counterexample to this claim, because the filename [x] is not matched by the glob expression [x]. Thus, the glob expression [x]? will not match the filename $'[x]\n'. – Charles Duffy Jul 29 '15 at 16:36
  • 1
    If you want to see the effects of this magnified, by the way, you might consider working with the nullglob shell option enabled. – Charles Duffy Jul 29 '15 at 16:40
  • @mikeserv : Just a minor issue: The -1 option for ls is unnecessary in your examples, because it is the default for those cases you are using. Compare for instance a plain ls (multi-column output) vs. ls|cat (single-column output). – user1934428 Jul 10 '20 at 08:06
  • 2
    As I tell my juniors... if you need to break the rules the do it and add an extensive comment explaining why you needed to break them. If writing your code without breaking the rules is quicker then writing the comment then don't break the rules. -1 because you want to justify breaking the rules by claiming they were not a rule in the first place. – Philip Couling Oct 06 '21 at 17:57
  • I can't find the previous bounty notice from muru at the moment, but it appears that GNU coreutils' ls will soon have a --zero option: https://fossies.org/linux/coreutils/ChangeLog – Jeff Schaller Oct 20 '21 at 12:37
  • 1
    @JeffSchaller it already does, version 9.0 was released on September 24 with that feature (but isn’t available in many distributions yet). – Stephen Kitt Oct 20 '21 at 12:38
  • (Available in Arch Linux already, though) – muru Oct 20 '21 at 12:58

10 Answers10

233

I am not at all convinced of this, but let's suppose for the sake of argument that you could, if you're prepared to put in enough effort, parse the output of ls reliably, even in the face of an "adversary" — someone who knows the code you wrote and is deliberately choosing filenames designed to break it.

Even if you could do that, it would still be a bad idea.

Bourne shell1 is a bad language. It should not be used for anything complicated, unless extreme portability is more important than any other factor (e.g. autoconf).

I claim that if you're faced with a problem where parsing the output of ls seems like the path of least resistance for a shell script, that's a strong indication that whatever you are doing is too complicated to be a shell script and you should rewrite the entire thing in Perl, Python, Julia, or any of the other good scripting languages that are readily available. As a demonstration, here's your last program in Python:

import os, sys
for subdir, dirs, files in os.walk("."):
    for f in dirs + files:
      ino = os.lstat(os.path.join(subdir, f)).st_ino
      sys.stdout.write("%d %s %s\n" % (ino, subdir, f))

This has no issues whatsoever with unusual characters in filenames -- the output is ambiguous in the same way the output of ls is ambiguous, but that wouldn't matter in a "real" program (as opposed to a demo like this), which would use the result of os.path.join(subdir, f) directly.

Equally important, and in stark contrast to the thing you wrote, it will still make sense six months from now, and it will be easy to modify when you need it to do something slightly different. By way of illustration, suppose you discover a need to exclude dotfiles and editor backups, and to process everything in alphabetical order by basename:

import os, sys
filelist = []
for subdir, dirs, files in os.walk("."):
    for f in dirs + files:
        if f[0] == '.' or f[-1] == '~': continue
        lstat = os.lstat(os.path.join(subdir, f))
        filelist.append((f, subdir, lstat.st_ino))

filelist.sort(key = lambda x: x[0]) for f, subdir, ino in filelist: sys.stdout.write("%d %s %s\n" % (ino, subdir, f))


1 Yes, extended versions of the Bourne shell are readily available nowadays: bash and zsh are both considerably better than the original. The GNU extensions to the core "shell utilities" (find, grep, etc.) also help a lot. But even with all the extensions, the shell environment is not improved enough to compete with scripting languages that are actually good, so my advice remains "don't use shell for anything complicated" regardless of which shell you're talking about.

"What would a good interactive shell that was also a good scripting language look like?" is a live research question, because there is an inherent tension between the conveniences required for an interactive CLI (such as being allowed to type cc -c -g -O2 -o foo.o foo.c instead of subprocess.run(["cc", "-c", "-g", "-O2", "-o", "foo.o", "foo.c"])) and the strictures required to avoid subtle errors in complex scripts (such as not interpreting unquoted words in random locations as string literals). If I were to attempt to design such a thing, I'd probably start by putting IPython, PowerShell, and Lua in a blender, but I have no idea what the result would look like.

zwol
  • 7,177
  • 5
    This is good. Does that for in | for in speak of recursion? I'm not sure. Even if it is it can't be more than one, right? This is the only answer that makes sense to me so far. – mikeserv May 12 '14 at 22:58
  • 15
    No recursion, just nested for-loops. os.walk is doing some seriously heavy lifting behind the scenes, but you don't have to worry about it any more than you have to worry about how ls or find work internally. – zwol May 12 '14 at 23:04
  • Here's the documentation for os.walk. – zwol May 12 '14 at 23:08
  • 10
    Technically, os.walk returns a generator object. Generators are Python's version of lazy lists. Every time the outer for-loop iterates, the generator is invoked and "yields" the contents of another subdirectory. Equivalent functionality in Perl is File::Find, if that helps. – zwol May 12 '14 at 23:12
  • 8
    You should be aware that I 100% agree with the document you are criticizing and with Patrick and Terdon's answers. My answer was intended to provide an additional, independent reason to avoid parsing ls output. – zwol May 13 '14 at 17:02
  • 2
    I should note that in various comments, mikeserv's primary reason for parsing ls was that he can do some additional preprocessing (like sorting or filtering with grep) before the traversal. This alternative does not currently do that. – Izkata May 13 '14 at 18:36
  • 1
    @Izkata That sort of thing is easy to add. For instance, to exclude all Emacs editor backups from processing: if f.endswith("~"): continue – zwol May 13 '14 at 20:00
  • What'd you do, Zack! When I accepted this answer it was not a "I completely agree answer" - just so everyone knows. That was added after my acceptance of it. There was no agreement before. I didn't accept it because it flattered me - I accepted it because I believe it's correct. If I wanna do the stuff above, maybe string slinging isn't the best approach. Also I like @terdon's answer - it is informative, and it was influential to my figuring it out. Patrick's was also, but it is misleading because it represents, in the majority, a basic zsh compatibility issue more than otherwise. – mikeserv May 13 '14 at 20:23
  • 27
    This is very misleading. Shell isn't a good programming language, but only because it isn't a programming language. It's a scripting language. And it's a good scripting language. – mrr May 13 '14 at 21:38
  • @zwol, this might sound stupid but what do you recommend as an alternative to the shell? If I need to list files or change a file name, I open a terminal and run ls or mv. Maybe I'm misunderstanding something in your last comment? – user1717828 Sep 30 '15 at 18:21
  • 1
    @user1717828 It's OK for interactive use-- it's not great, but it'll do. I am only talking about scripting here. – zwol Sep 30 '15 at 19:32
  • 2
    By the way, why sys.stdout.write rather than print? – k_g Jun 11 '17 at 09:18
  • @k_g Because I prefer not to use print. Just a personal thing, not a Statement. – zwol Jun 11 '17 at 13:22
  • 2
    Any algorithm is complicated when the process it is modeling is not completely or correctly understood. That is a specious and subjective conclusion to make regarding the usefulness of Bourne Shell or BASH. Shell scripting is important for working around the command line. I don't think it was intended to be comparable with an environment created totally for algorithmic processing. – Ken Ingram Sep 19 '19 at 21:25
  • 2
    And we keep the meme alive, when asking howing to do a thing in X get told you should be using Y. What about platforms that cannot run python, eg android? – unixandria Nov 13 '20 at 16:10
  • @unixandria If you can run bash, you can run the interpreter for at least one other language that is less terrible than bash. Python is only an example. – zwol Apr 06 '21 at 17:41
  • 4
    @MilesRout Is there a single principle that can be used to distinguish between a "programming language" and a "scripting language", which holds true across time, space, and users? I doubt it very much. If I enter git some_command, I don't care if it is implemented in C or Perl or Bash. And in fact some git commands started life as one and were rewritten as another. Some languages can be either interpreted or compiled. Is Lisp a "scripting language" when interpreted and a "programming language" when compiled, even if the code is identical? – iconoclast Sep 06 '21 at 19:29
  • 2
    @MilesRout Also, the claim that Bash is good doesn't address the criticisms raised in this answer. Can you support that claim? What makes it good, despite the problems described? – iconoclast Sep 06 '21 at 19:31
  • "What would a good interactive shell that was also a good scripting language look like?" - It would look like eshell. – Vlatko Šurlan Oct 28 '21 at 06:54
  • 1
    @iconoclast It's a comment, not an essay. It's pretty well established what a 'scripting language' is in my opinion. :) – mrr Nov 29 '21 at 05:04
  • 3
    @MilesRout if it's well established then give me a single principle that distinguishes between them. And anyway, why would putting it in a different category shield it from all criticism. That makes no sense. – iconoclast Nov 29 '21 at 05:43
  • 7
    @iconoclast For the record, I assert that Bourne shell is both a bad programming language and a bad scripting language, no matter how you choose to define those terms. It's just bad, period. – zwol Nov 29 '21 at 14:00
  • @zwol Yes, I can't argue with you. I found your answer very helpful. I'm not disagreeing with you about the flaws in the Bourne etc. shells. I'm disagreeing with Miles that (1) there's a clear & settled distinction between programming languages and scripting languages, and (2) that classifying the shells as "scripting languages" somehow shields them from your criticism. – iconoclast Nov 30 '21 at 03:05
  • 1
    @iconoclast Firstly, I have no idea why you're nitpicking over a 7-year-old comment. That out of the way, I think it's pretty clear what the distinction is and I think you're just pretending not to know it. You know what shells are used for, you know what scripting is, you know what scripts are, you know what a scripting language is and you can clearly see the difference between writing a program and writing a script to kludge the results of other programs together. Nobody is "shielding it from all criticism", just inappropriate criticism. Criticising bash because it's not good for writing ... – mrr Nov 30 '21 at 06:49
  • 1
    @iconoclast ... large programs of the type you'd write in C++ is completely missing the point. Criticism of bash should be directed at what it's actually used for and what it's intended to be used for. Criticising bash for not being a good language to write large programs in is like criticising Python for not being a good language to write device drivers in: it's not false but it is totally missing the point of the language. Saying 'bash is a bad language' because it's bad for writing complicated programs is simply unfair and not very useful. The problem with the question is not that the ... – mrr Nov 30 '21 at 06:52
  • @iconoclast ... asker is using bash at all, but that he's trying to do something complex in Bash. There are lots of uses for bash where it absolutely shines, where the equivalent Python would be 5x as long and where the C equivalent would still be getting CVEs 7 years after the question was asked. :) – mrr Nov 30 '21 at 06:53
  • 5
    @MilesRout I don't particularly wish to continue this argument any further, but just to make my position absolutely clear, I will acknowledge that there are plenty of programs that would be 5x longer in Python than in any variant of Bourne shell, but I assert that most of them should be written in Python (or some other non-terrible scripting language) even so. Because they will, despite being longer, be easier to write correctly, easier to read and confirm their correctness, and easier to modify in the future. – zwol Nov 30 '21 at 14:55
221

That link is referenced a lot because the information is completely accurate, and it has been there for a very long time.


ls replaces non-printable characters with glob characters yes, but those characters aren't in the actual filename. Why does this matter? 2 reasons:

  1. If you pass that filename to a program, that filename doesn't actually exist. It would have to expand the glob to get the real file name.
  2. The file glob might match more than one file.

For example:

$ touch a$'\t'b
$ touch a$'\n'b
$ ls -1
a?b
a?b

Notice how we have 2 files which look exactly the same. How are you going to distinguish them if they both are represented as a?b?


The author calls it garbling filenames when ls returns a list of filenames containing shell globs and then recommends using a shell glob to retrieve a file list!

There is a difference here. When you get a glob back, as shown, that glob might match more than one file. However when you iterate through the results matching a glob, you get back the exact file, not a glob.

For example:

$ for file in *; do printf '%s' "$file" | xxd; done
0000000: 6109 62                                  a.b
0000000: 610a 62                                  a.b

Notice how the xxd output shows that $file contained the raw characters \t and \n, not ?.

If you use ls, you get this instead:

for file in $(ls -1q); do printf '%s' "$file" | xxd; done
0000000: 613f 62                                  a?b
0000000: 613f 62                                  a?b

"I'm going to iterate anyway, why not use ls?"

Your example you gave doesn't actually work. It looks like it works, but it doesn't.

I'm referring to this:

 for f in $(ls -1q | tr " " "?") ; do [ -f "$f" ] && echo "./$f" ; done

I've created a directory with a bunch of file names:

$ for file in *; do printf '%s' "$file" | xxd; done
0000000: 6120 62                                  a b
0000000: 6120 2062                                a  b
0000000: 61e2 8082 62                             a...b
0000000: 61e2 8083 62                             a...b
0000000: 6109 62                                  a.b
0000000: 610a 62                                  a.b

When I run your code, I get this:

$ for f in $(ls -1q | tr " " "?") ; do [ -f "$f" ] && echo "./$f" ; done
./a b
./a b

Where'd the rest of the files go?

Let's try this instead:

$ for f in $(ls -1q | tr " " "?") ; do stat --format='%n' "./$f"; done
stat: cannot stat ‘./a?b’: No such file or directory
stat: cannot stat ‘./a??b’: No such file or directory
./a b
./a b
stat: cannot stat ‘./a?b’: No such file or directory
stat: cannot stat ‘./a?b’: No such file or directory

Now lets use an actual glob:

$ for f in *; do stat --format='%n' "./$f"; done
./a b
./a  b
./a b
./a b
./a b
./a
b

With bash

The above example was with my normal shell, zsh. When I repeat the procedure with bash, I get another completely different set of results with your example:

Same set of files:

$ for file in *; do printf '%s' "$file" | xxd; done
0000000: 6120 62                                  a b
0000000: 6120 2062                                a  b
0000000: 61e2 8082 62                             a...b
0000000: 61e2 8083 62                             a...b
0000000: 6109 62                                  a.b
0000000: 610a 62                                  a.b

Radically different results with your code:

for f in $(ls -1q | tr " " "?") ; do stat --format='%n' "./$f"; done
./a b
./a b
./a b
./a b
./a
b
./a  b
./a b
./a b
./a b
./a b
./a b
./a b
./a
b
./a b
./a b
./a b
./a b
./a
b

With a shell glob, it works perfectly fine:

$ for f in *; do stat --format='%n' "./$f"; done
./a b
./a  b
./a b
./a b
./a b
./a
b

The reason bash behaves this way goes back to one of the points I made at the beginning of the answer: "The file glob might match more than one file".

ls is returning the same glob (a?b) for several files, so each time we expand this glob, we get every single file that matches it.


How to recreate the list of files I was using:

touch 'a b' 'a  b' a$'\xe2\x80\x82'b a$'\xe2\x80\x83'b a$'\t'b a$'\n'b

The hex code ones are UTF-8 NBSP characters.

phemmer
  • 71,831
  • Of course it might match more than one file - it's a glob! but so is his recommended solution! – mikeserv May 12 '14 at 02:01
  • 6
    @mikeserv actually his solution doesn't return a glob. I just updated my answer to clarify that point. – phemmer May 12 '14 at 02:02
  • His solution is a glob - he suggests iterating over a glob. His exact words are: "If you just want to iterate over all the files in the current directory, use a for loop and a glob:" – mikeserv May 12 '14 at 02:03
  • Yeah - the filename is resolved after you resolve it in the shell. Of course - that's what you have to do with a glob. But if I have to do that anyway, I'd much rather have ls generate the data - sorted certain ways, recursively, and *very* fast - than I would otherwise. And this answer only distracts from the actual question - *why not parse ls*? for f in $(ls -1q)... is faster and more reliable than is for f in GLOB... – mikeserv May 12 '14 at 02:06
  • 2
    @mikeserv updated again. See the section: In regards to just "I'm going to iterate anyway, why not use ls?" – phemmer May 12 '14 at 02:16
  • 1
    What is your shell? I ask because the only shell in which I get a similar result is zsh - all others such as bash, sh, and dash correctly resolve the pathnames as is POSIX specified. – mikeserv May 12 '14 at 02:21
  • 2
    @mikeserv zsh. When I switch to bash it's just as bad. I've updated for bash. – phemmer May 12 '14 at 02:26
  • 1
    Ahh. So there are multiples and that, at least, might be a reason. Not the rest. Add a copy/paste of the command that creates these file names. 10/1 odds I can handle it easily. If not, I'll accept the answer. – mikeserv May 12 '14 at 02:28
  • 23
    "Not the rest"? It's inconsistent behavior, and unexpected results, how is that not a reason? – phemmer May 12 '14 at 02:32
  • 4
    @mikeserv You could avoid the duplicates with something like for f in $(ls -1q | tr " " "?" | sed 's/^/"/; s/$/"/') ; do echo "$f"; done. But why not just for f in *; do echo "$f"; done? – terdon May 12 '14 at 02:32
  • @mikeserv sorry, didn't see your updated comment. I've added a command at the bottom of the answer that will recreate the full list of files I was using. – phemmer May 12 '14 at 03:20
  • 2
    Well, @terdon I can avoid the duplicates like set $(ls -1q | uniq) and mostly I would use a shell glob - but I do not appreciate the spread of misinformation. And what if I want to do a recursive ls? Doing the same in the shell is slow. I still don't see a real reason not to parse ls - and no one has shown me differently. – mikeserv May 12 '14 at 04:17
  • 1
    @Patrick - and no, zsh's failing to a follow a proscribed standard such as word splitting and shell glob expansion is not my failure. – mikeserv May 12 '14 at 04:20
  • 14
    @mikeserv Did you not see my comment on your question? Shell globbing is 2.5 times faster than ls. I also requested that you test your code as it does not work. What does zsh have to do with any of this? – phemmer May 12 '14 at 04:29
  • @Patrick - what are you talking about? What code doesn't work? And zsh is the shell with which you demonstrated more than half of your answer. So far the only argument you've come up with that I can see that is worth regarding is the multiples in the output. But I handled that with uniq. As for shell globs being faster, I'm inclined to agree. But it's not easy to get a recursive output or any of the other things ls can do - and it is not an argument in favor of *Don't parse ls*. – mikeserv May 12 '14 at 04:37
  • 1
    @Patrick - you need to note in your answer that 75% of it is generated with zsh without the compatibility variable SH_WORD_SPLIT set. It's misleading as is - and that's the whole problem in the first place. http://zsh.sourceforge.net/FAQ/zshfaq03.html – mikeserv May 12 '14 at 05:06
  • 31
    @mikeserv No, it all still applies even to bash. Though I'm done with this question as you're not listening to what I'm saying. – phemmer May 12 '14 at 05:37
  • 9
    You know what, I think I'll upvote this answer and clarify in mine that I agree with everything it says. ;-) – zwol May 13 '14 at 17:03
  • Note that to make things worse, most implementations of ls do not quote strange file names or replace weird characters with question marks. And even GNU ls didn't do so for a long time. So even if you rely on this behaviour, your code is wrong. – FUZxxl Aug 17 '21 at 17:13
67

The output of ls -q isn't a glob at all. It uses ? to mean "There is a character here that can't be displayed directly". Globs use ? to mean "Any character is allowed here".

Globs have other special characters (* and [] at least, and inside the [] pair there are more). None of those are escaped by ls -q.

$ touch x '[x]'
$ ls -1q
[x]
x

If you treat the ls -1q output there are a set of globs and expand them, not only will you get x twice, you'll miss [x] completely. As a glob, it doesn't match itself as a string.

ls -q is meant to save your eyes and/or terminal from crazy characters, not to produce something that you can feed back to the shell.

55

Let's try and simplify a little:

$ touch a$'\n'b a$'\t'b 'a b'
$ ls
a b  a?b  a?b
$ IFS="
"
$ set -- $(ls -1q | uniq)
$ echo "Total files in shell array: $#"
Total files in shell array: 4

See? That's already wrong right there. There are 3 files but bash is reporting 4. This is because the set is being given the globs generated by ls which are expanded by the shell before being passed to set. Which is why you get:

$ for x ; do
>     printf 'File #%d: %s\n' $((i=$i+1)) "$x"
> done
File #1: a b
File #2: a b
File #3: a    b
File #4: a
b

Or, if you prefer:

$ printf ./%s\\0 "$@" |
> od -A n -c -w1 |
> sed -n '/ \{1,3\}/s///;H
> /\\0/{g;s///;s/\n//gp;s/.*//;h}'
./a b
./a b
./a\tb
./a\nb

The above was run on bash 4.2.45.

terdon
  • 242,166
  • 2
    I upvoted this. It's good to see your own code bite you. But just because I got it wrong doesn't mean it can't be done right. I showed you a very simple way to do it this morning with ls -1qRi | grep -o '^ *[0-9]*' - that's parsing ls output, man, and it's the fastest and best way of which I know to get a list of inode numbers. – mikeserv May 12 '14 at 22:56
  • 48
    @mikeserv: It could be done right, if you have the time and patience. But the fact is, it is inherently error-prone. You yourself got it wrong. while arguing about its merits! That's a huge strike against it, if even the one person fighting for it fails to do it correctly. And chances are, you'll probably spend still more time getting it wrong before you get it right. I dunno about you, but most people have better to do with their time than fiddle around for ages with the same line of code. – cHao May 13 '14 at 01:06
  • @cHao - i didnt argue its merits - i protested its propaganda. – mikeserv May 13 '14 at 01:47
  • 20
    @mikeserv: The arguments against it are well-founded and well-deserved. Even you have shown them to be true. – cHao May 13 '14 at 01:50
  • 2
    @cHao - i disagree. There is a not-so-fine line between a mantra and a wisdom. – mikeserv May 13 '14 at 01:51
42

The answer is simple: The special cases of ls you have to handle outweigh any possible benefit. These special cases can be avoided if you don't parse ls output.

The mantra here is never trust the user filesystem (the equivalent to never trust user input). If there's a method that will work always, with 100% certainty, it should be the method you prefer even if ls does the same but with less certainty. I won't go into technical details since those were covered by terdon and Patrick extensively. I know that due to the risks of using ls in an important (and maybe expensive) transaction where my job/prestige is on the line, I will prefer any solution that doesn't have a grade of uncertainty if it can be avoided.

I know some people prefer some risk over certainty, but I've filed a bug report.

Braiam
  • 35,991
36

The reason people say never do something isn't necessarily because it absolutely positively cannot be done correctly. We may be able to do so, but it may be more complicated, less efficient both space- or time-wise. For example it would be perfectly fine to say "Never build a large e-commerce backend in x86 assembly".

So now to the issue at hand: As you've demonstrated you can create a solution that parses ls and gives the right result - so correctness isn't an issue.

Is it more complicated? Yes, but we can hide that behind a helper function.

So now to efficiency:

Space-efficiency: Your solution relies on uniq to filter out duplicates, consequently we cannot generate the results lazily. So either O(1) vs. O(n) or both have O(n).

Time-efficiency: Best case uniq uses a hashmap approach so we still have a O(n) algorithm in the number of elements procured, probably though it's O(n log n).

Now the real problem: While your algorithm is still not looking too bad I was really careful to use elements procured and not elements for n. Because that does make a big difference. Say you have a file \n\n that will result in a glob for ?? so match every 2 character file in the listing. Funnily if you have another file \n\r that will also result in ?? and also return all 2 character files.. see where this is going? Exponential instead of linear behavior certainly qualifies as "worse runtime behavior".. it's the difference between a practical algorithm and one you write papers in theoretical CS journals about.

Everybody loves examples right? Here we go. Make a folder called "test" and use this python script in the same directory where the folder is.

#!/usr/bin/env python3
import itertools
dir = "test/"
filename_length = 3
options = "\a\b\t\n\v\f\r"

for filename in itertools.product(options, repeat=filename_length):
        open(dir + ''.join(filename), "a").close()

Only thing this does is generate all products of length 3 for 7 characters. High school math tells us that ought to be 343 files. Well that ought to be really quick to print, so let's see:

time for f in *; do stat --format='%n' "./$f" >/dev/null; done
real    0m0.508s
user    0m0.051s
sys 0m0.480s

Now let's try your first solution, because I really can't get this

eval set -- $(ls -1qrR ././ | tr ' ' '?' |
sed -e '\|^\(\.\{,1\}\)/\.\(/.*\):|{' -e \
        's//\1\2/;\|/$|!s|.*|&/|;h;s/.*//;b}' -e \
        '/..*/!d;G;s/\(.*\)\n\(.*\)/\2\1/' -e \
        "s/'/'\\\''/g;s/.*/'&'/;s/?/'[\"?\$IFS\"]'/g" |
uniq)

thing here to work on Linux mint 16 (which I think speaks volumes for the usability of this method).

Anyhow since the above pretty much only filters the result after it gets it, the earlier solution should be at least as quick as the later (no inode tricks in that one- but those are unreliable so you'd give up correctness).

So now how long does

time for f in $(ls -1q | tr " " "?") ; do stat --format='%n' "./$f" >/dev/null; done

take? Well I really don't know, it takes a while to check 343^343 file names - I'll tell you after the heat death of the universe.

Voo
  • 825
  • 7
    Of course, as mentioned in comments under another answer, the statement "...you've demonstrated you can create a solution that parses ls and gives the right result..." is actually not true. – Wildcard Jan 10 '16 at 11:12
  • uniq only removes (or counts or selects) adjacent duplicate lines, so any implementation that isn't O(1) space and O(n) time is brain-dead. – dave_thompson_085 Aug 26 '23 at 05:22
  • As @dave_thompson_085 notes, uniq only removes adjacent duplicates. In practice that means the input has to be sorted lexicographically, which obviates one of the few advantages of ls over a simple glob - that you can sort by other fields, such as mtime or inode-type. – Martin Kealey Oct 21 '23 at 03:57
29

OP's Stated Intention Addressed

preface and original answer's rationaleupdated on 2015-05-18

mikeserv (the OP) stated in latest update to his question: "I do consider it a shame though that I first asked this question to point out a source of misinformation, and, unfortunately, the most upvoted answer here is in large part misleading."

Well, okay; I feel it was rather a shame that I spent so much time trying to figure out how to explain my meaning only to find that as I re-read the question. This question ended up "[generating] discussion rather than answers" and ended up weighing in at ~18K of text (for the question alone, just to be clear) which would be long even for a blog post.

But StackExchange is not your soapbox, and it's not your blog. However, in effect, you have used it as at least bit of both. People ended up spending a lot of time answering your "To-Point-Out" instead of answering people's actual questions. At this point I will be flagging the question as not a good fit for our format, given that the OP has stated explicitly that it wasn't even intended to be a question at all.

At this point I'm not sure whether my answer was to the point, or not; probably not, but it was directed at some of your questions, and maybe it can be a useful answer to someone else; beginners take heart, some of those "do not"s turn into "do sometimes" once you get more experienced. :)

As a General Rule...

please forgive remaining rough edges; i having spent far too much time on this already... rather than quote the OP directly (as originally intended) i will try to summarize and paraphrase.

[largely reworked from my original answer]
upon consideration, i believe that i mis-read the emphasis that the OP was placing on the questions i answered; however, the points addressed were brought up, and i have left the answers largely intact as i believe them to be to-the-point and to address issues that i've seen brought up in other contexts as well regarding advice to beginners.

The original post asked, in several ways, why various articles gave advice such as «Don't parse ls output» or «You should never parse ls output», and so forth.

My suggested resolution to the issue is that instances of this kind of statement are simply examples of an idiom, phrased in slightly different ways, in which an absolute quantifier is paired with an imperative [e.g., «don't [ever] X», «[you should] always Y», «[one should] never Z»] to form statements intended to be used as general rules or guidelines, especially when given to those new to a subject, rather than being intended as absolute truths, the apparent form of those statements notwithstanding.

When you're beginning to learn new subject matter, and unless you have some good understanding of why you might need to do else-wise, it's a good idea to simply follow the accepted general rules without exception—unless under guidance from someone more experienced that yourself. With rising skill and experience you become further able to determine when and if a rule applies in any particular situation. Once you do reach a significant level of experience, you will likely understand the reasoning behind the general rule in the first place, and at that point you can begin to use your judgement as to whether and to what level the reasons behind the rule apply in that situation, and also as to whether there are perhaps overriding concerns.

And that's when an expert, perhaps, might choose to do things in violation of "The Rules". But that wouldn't make them any less "The Rules".

And, so, to the topic at hand: in my view, just because an expert might be able to violate this rule without getting completely smacked down, i don't see any way that you could justify telling a beginner that "sometimes" it's okay to parse ls output, because: it's not. Or, at least, certainly it's not right for a beginner to do so.

You always put your pawns in the center; in the opening one piece, one move; castle at the earliest opportunity; knights before bishops; a knight on the rim is grim; and always make sure you can see your calculation through to the end! (Whoops, sorry, getting tired, that's for the chess StackExchange.)

Rules, Meant to Be Broken?

When reading an article on a subject that is targeted at, or likely to be read by, beginners, often you will see things like this:

  • "You should not ever do X."
  • "Never do Q!"
  • "Don't do Z."
  • "One should always do Y!"
  • "C, no matter what."

While these statements certainly seem to be stating absolute and timeless rules, they are not; instead this is a way of stating general rules [a.k.a. "guidelines", "rules of thumb", "the basics", etc.] that is at least arguably one appropriate way to state them for the beginners that might be reading those articles. However, just because they are stated as absolutes, the rules certainly don't bind professionals and experts [who were likely the ones who summarized such rules in the first place, as a way to record and pass on knowledge gained as they dealt with recurring issues in their particular craft.]

Those rules certainly aren't going to reveal how an expert would deal with a complex or nuanced problem, in which, say, those rules conflict with each other; or in which the concerns that led to the rule in the first place simply don't apply. Experts are not afraid to (or should not be afraid to!) simply break rules that they happen to know don't make sense in a particular situation. Experts are constantly dealing with balancing various risks and concerns in their craft, and must frequently use their judgement to choose to break those kind of rules, having to balance various factors and not being able to just rely on a table of rules to follow. Take Goto as an example: there's been a long, recurring, debate on whether they are harmful. (Yeah, don't ever use gotos. ;D)

A Modal Proposition

An odd feature, at least in English, and I imagine in many other languages, of general rules, is that they are stated in the same form as a modal proposition, yet the experts in a field are willing to give a general rule for a situation, all the while knowing that they will break the rule when appropriate. Clearly, therefore, these statements aren't meant to be equivalent to the same statements in modal logic.

This is why i say they must simply be idiomatic. Rather than truly being a "never" or an "always" situation, these rules usually serve to codify general guidelines that tend to be appropriate over a wide range of situations, and that, when beginners follow them blindly, are likely to result in far better results than the beginner choosing to go against them without good reason. Sometimes they codify rules simply leading to substandard results rather than the outright failures accompanying incorrect choices when going against the rules.

So, general rules are not the absolute modal propositions they appear to be on the surface, but instead are a shorthand way of giving the rule with a standard boilerplate implied, something like the following:

unless you have the ability to tell that this guideline is incorrect in a particular case, and prove to yourself that you are right, then ${RULE}

where, of course you could substitute "never parse ls output" in place of ${RULE}. :)

Oh Yeah! What About Parsing ls Output?

Well, so, given all that... i think it's pretty clear that this rule is a good one. First of all, the real rule has to be understood to be idiomatic, as explained above...

But furthermore, it's not just that you have to be very good with shell scripting to know whether it can be broken, in some particular case. It's, also, that it's takes just as much skill to tell you got it wrong when you are trying to break it in testing! And, I say confidently that a very large majority of the likely audience of such articles (giving advice like «Don't parse the output of ls!») can't do those things, and those that do have such skill will likely realize that they figure it out on their own and ignore the rule anyway.

But... just look at this question, and how even people that probably do have the skill thought it was a bad call to do so; and how much effort the author of the question spent just getting to a point of the current best example! I guarantee you on a problem that hard, 99% of the people out there would get it wrong, and with potentially very bad results! Even if the method that is decided on turns out to be a good one; until it (or another) ls parsing idea becomes adopted by IT/developer folk as a whole, withstands a lot of testing (especially the test of time) and, finally, manages to graduate to a 'common technique' status, it's likely that a lot of people might try it, and get it wrong... with disastrous consequences.

So, I will reiterate one last time.... that, especially in this case, that is why "never parse ls output!" is decidedly the right way to phrase it.

[UPDATE 2014-05-18: clarified reasoning for answer (above) to respond to a comment from OP; the following addition is in response to the OP's additions to the question from yesterday]

[UPDATE 2014-11-10: added headers and reorganized/refactored content; and also: reformatting, rewording, clarifying, and um... "concise-ifying"... i intended this to simply be a clean-up, though it did turn into a bit of a rework. i had left it in a sorry state, so i mainly tried to give it some order. i did feel it was important to largely leave the first section intact; so only two minor changes there, redundant 'but' removed, and 'that' emphasized.]

† I originally intended this solely as a clarification on my original; but decided on other additions upon reflection

‡ see https://unix.stackexchange.com/tour for guidelines on posts

  • 2
    Never isn't idiomatic. This is not an answer to anything. – mikeserv May 17 '14 at 17:52
  • 1
    Hmm. Well, I didn't know whether this answer would be satisfying but I absolutely didn't expect it to be controversial. And, I didn't (mean to) argue that 'never' was per se idiomatic; but that "Never do X!" is an idiomatic use. I see two general cases that can show that 'Never/don't parse ls!' is correct advice: 1. demonstrate (to your satisfaction) that every use-case where one might parse ls output has another available solution, superior in some way, without doing so. 2. show that, in the cited cases, the statement is not a literal one. – shelleybutterfly May 18 '14 at 06:50
  • Looking at your question again, I see that you first mention "don't ..." rather than "never ..." which is well into your analysis, so I'll clarify on that point as well. At this point there's already a solution of the first type, which is apparently demonstrated/explained to your satisfaction, so I won't delve into there much. But I'll try and clarify my answer a bit: like I say, I wasn't trying to be controversial (or confrontational!) but to point out how those statements are generally intended. – shelleybutterfly May 18 '14 at 06:53
  • I actually posted it since I didn't see it given as an explanation; and, since I have more than once had someone hung up on my choice of a solution because 'but "everyone" says never to do that!' when I made a (non-conventional) judgement that it was nonetheless appropriate given the circumstances. [example: "You can't/shouldn't use C++ in real-time safety-critical software!"] The more thoroughly you understand something, and the more experience you have with it, the more often you will see a superior solution that 'breaks the rules', but that doesn't mean a beginner should do so willy-nilly. – shelleybutterfly May 18 '14 at 07:10
  • 1
    I should clean up that post. Still, never is not the right way to phrase it. It's a little ridiculous that people think theyre qualified to tell others never or dont - just tell them you dont think it will work and why, but you do know what will work and why. ls is a computer utility - you can parse computer output. – mikeserv May 18 '14 at 14:28
  • it's not that I don't see what you're saying, I've seen people really hung up on 'rules' before; but it's almost with a religious fervor that they will defend the rule. as for the 'never/don't' thing, I think I figured out pretty young that when people say things like that, you just add "without good reason" to the end. [Kinda like the "under the sheets" of imperative statements, I guess. :)] So, "Never parse ls output... (without good reason)". And for people that still don't know what they don't know, sometimes 'never' is in a limited sense true: never do that until you know better. :) <3 – shelleybutterfly May 18 '14 at 15:08
  • but I have a purely speculative hunch that if you found 10, say, 'very experienced' shell scripters that put up a blog post or article explaining why not to parse ls; output people that were ate least 'very' experienced with shell scripting whether 'never' meant absolutely never, or just as a general rule, that fewer than half would argue very hard with someone experienced that they shouldn't do it. I think at the level of being able to figure out what you've figured out, once they could tell you know what you were doing, they'd say "hey, you've got enough skill to judge yourself". – shelleybutterfly May 18 '14 at 15:15
  • 1
    Well, i reversed my downvote because, at the very least, youre right about the flagging thing. Ill try to clean it up tonight or tomorrow. My thought is i'll move most of the code examples to an answer i guess. But it still doesnt, as far as im concerned, excuse the inaccuracies in that oft-cited blog post. I wish people would stop citing the bash manual altogether - at least not til after theyve cited the POSIX specs... – mikeserv May 18 '14 at 17:44
  • 1
    Great points about getting caught up in dogma. The most helpful answers for such questions on Stack Exchange are "Here's how, if you want to. And here's why you generally shouldn't, and some examples of how to do it better." I see a lot of "why would you ever" or "don't do that" answers and comments, which are fine but not OP answers. There are many times people stumble upon questions with a similar issue, due to a very special case scenario, and don't need the sanctimony. – Beejor Dec 22 '18 at 22:10
  • 7
    This wonderful comment explains this conversation in a nutshell. I'll quote it: Why do you think they say "don't look down the barrel of a gun" instead of "don't look down the barrel of a gun unless it's empty"? Or "don't try to stick your fingers into the power outlet" instead of "don't try to stick your fingers into the power outlet unless they're too big to go in or unless you've shut off the power"? etc. – Wildcard May 22 '19 at 05:35
  • 1
    @Wildcard -- I think your comment almost perfectly captures the overriding principle of this entire behemoth post in one short quote. – DryLabRebel Jun 23 '23 at 00:14
20

Is it possible to parse the output of ls in certain cases? Sure. The idea of extracting a list of inode numbers from a directory is a good example - if you know that your implementation's ls supports -q, and therefore each file will produce exactly one line of output, and all you need are the inode numbers, parsing them out of ls -Rai1q output is certainly a possible solution. Of course, if the author hadn't seen advice like "Never parse the output of ls" before, he probably wouldn't think about filenames with newlines in them, and would probably leave off the 'q' as a result, and the code would be subtly broken in that edge case - so, even in cases where parsing ls's output is reasonable, this advice is still useful.

The broader point is that, when a newbie to shell scripting tries to have a script figure out (for instance) what's the biggest file in a directory, or what's the most recently modified file in a directory, his first instinct is to parse ls's output - understandable, because ls is one of the first commands a newbie learns.

Unfortunately, that instinct is wrong, and that approach is broken. Even more unfortunately, it's subtly broken - it will work most of the time, but fail in edge cases that could perhaps be exploited by someone with knowledge of the code.

The newbie might think of ls -s | sort -n | tail -n 1 | awk '{print $2}' as a way to get the biggest file in a directory. And it works, until you have a file with a space in the name.

OK, so how about ls -s | sort -n | tail -n 1 | sed 's/[^ ]* *[0-9]* *//'? Works fine until you have a file with a newline in the name.

Does adding -q to ls's arguments help when there's a newline in the filename? It might look like it does, until you have 2 different files that contain a non-printable character in the same spot in the filename, and then ls's output doesn't let you distinguish which of those was biggest. Worse, in order to expand the "?", he probably resorts to his shell's eval - which will cause problems if he hits a file named, for instance,

foo`/tmp/malicious_script`bar

Does --quoting-style=shell help (if your ls even supports it)? Nope, still displays ? for nonprintable characters, so it's still ambiguous which of multiple matches was the biggest. --quoting-style=literal? Nope, same. --quoting-style=locale or --quoting-style=c might help if you just need to print the name of the biggest file unambiguously, but probably not if you need to do something with the file afterwards - it would be a bunch of code to undo the quoting and get back to the real filename so that you can pass it to, say, gzip.

And at the end of all that work, even if what he has is safe and correct for all possible filenames, it's unreadable and unmaintainable, and could have been done much more easily, safely, and readably in python or perl or ruby.

Or even using other shell tools - off the top of my head, I think this ought to do the trick:

find . -type f -printf "%s %f\0" | sort -nz | awk 'BEGIN{RS="\0"} END{sub(/[0-9]* /, "", $0); print}'

And ought to be at least as portable as --quoting-style is.

godlygeek
  • 8,053
  • Oh true about the size - i probably could do that if i tried - should i? Im kinda tired or this whole thing - i like your answer because you dont say can't or dont or never but actually give examples of maybe why not and comparable how else - thank you. – mikeserv May 16 '14 at 16:44
  • I think if you tried, you'd discover it's much harder than you think. So, yes, I'd recommend trying. I'll be happy to keep giving filenames that will break for you as long as I can think of them. :) – godlygeek May 16 '14 at 16:50
  • Comments are not for extended discussion; this conversation has been moved to chat. – terdon Aug 23 '14 at 10:24
  • @mikeserv and godlygeek, I have moved this comment thread to chat. Please don't have long discussions like this in the comments, that's what chat is for. – terdon Aug 23 '14 at 10:28
  • @mikeserv, ...insofar as that chat room is frozen, I can't do this there, but one comment you made deserves followup. Re: You can't grep the internet, but you can grep ls -- one certainly cannot grep ls. One can grep one implementation of one specific vendor's take on ls, but if one wants one's script to still work a decade later (on whatever operating system one has migrated to by then), it can't be dependent on only that one vendor's one version; hence, coding to the POSIX standard, not to any given implementation. – Charles Duffy Apr 28 '21 at 13:45
3

I "don't parse ls" because:

  • Filenames can contain any ASCII character except / and NUL (0x00). ls ouputs a multi-character representations of strange characters. This must be reversed (undone) before the filename can be passed to another program.

  • ls outputs SPACE (" "), NewLine (^J), and other "form control" characters in filenames literally. Special care must be exercised in subsequent processing. All variables must be quoted.

  • After a certain length of time, ls's date representation changes from 3 fields ("mmm dd HH:MM") to 1 field ("yyyy"), and all subsequent fields are renumbered.

And, the #1 reason NOT to get information about files by "parsing ls": There is a better way!

The find command can be used to select files, and with the -print0 option, produce a list of filenames (strange and form control characters intact), separated by NUL 0x00 bytes.

The xargs command, with the "-0" option consumes the list of NUL separated filenames and passes them (again, intact) to the command specified on the xargs command line. The command could even be a bash script.

The stat command, given a list of filenames, can output any file information, in a format you can specify.

Read man find xargs stat.

For giggles, read man ls and try to see how you can guarantee parsibility

waltinator
  • 4,865
1

Added for completeness sake

There's a trick called “slash dot saves the day” which consists in using /./ as a magic anchor for determining if a newline was produced as a record separator or due to embedded newlines in the path. When you use it in a glob as argument of ls -d, you'll be able to parse the output accurately.

Limitations

  • Requires the use of a glob, so ARG_MAX might kick in at some point.

  • Not all options of ls can be parsed. For example ls -dl adds symlink ‑> target to the output, which makes the boundary between the symlink and the target ambiguous.

  • Might be useless for non-POSIX compliant implementations of ls.


Here's an example that converts the output of ls -dnL, ls -drt, etc... into a TSV with C-style escaping:

{ command -p ls -dnL ././* /./*; echo /./; } |

LANG=C command -p awk -F '/[.]/' ' function tsv_escape(s) { gsub(/\/,"&&",s) gsub(/\n/,"\n",s) gsub(/\t/,"\t",s) gsub(/\r/,"\r",s) return s } { if ( NF == 2 ) { if ( NR > 1 ) print fields "/./" tsv_escape(filename) fields = $1 filename = $2 gsub(/[[:space:]]+/, "\t", fields) } else filename = filename "\n" $0 } '

note: I added an echo /./ so that awk doesn't need an END block

As you can see, you need to start a relative path with ././, and an absolute path with /./.


I have to say, this trick would only be useful for systems that don't have GNU tools nor perl/python/zsh/etc... which is probably a non-existent use-case nowadays.

Fravadona
  • 561
  • I understand the "just don't do it" downvote(s) but OP's was desperately trying to find a solution for parsing ls, which is what I provide here. – Fravadona Aug 25 '23 at 17:43
  • 4
    honestly, the original "question" reads more like the author trying to prove a point, even if that means descending to the deepest abyss of insanity. That is to say, they're not looking to find a solution, but they have something they want to sell as one. And that something is a Rube Goldberg machine containing psychedelic drugs. And it's radioactive. For a careful reader, it might work as a warning, but honestly I'd rather never point this Q&A to any novice reader, just to be sure no-one gets the mistaken idea that doing what they present in the question is a good idea... – ilkkachu Aug 25 '23 at 18:52
  • 1
    That, combined with the fact that rather more reasonable answers exist, much better solutions exist, and that the whole thing has basically been beaten to death multiple times over already, means it's not really surprising one would get a negative response with any answer... – ilkkachu Aug 25 '23 at 18:55
  • 1
    You yourself basically said that using e.g. Perl would be better, and also using a shell glob there gets really close to the usual solution of just using a shell glob to get the filenames in the first place. Though one still can't get e.g. the oldest/newest file with a glob in Bash, but then again, I don't think it's exactly obvious how ls -td ././* could be applied to that either (I suppose you could pipe the output of your awk to tail/head, and then use another awk or whatever to undo the escaping, but that starts to sound "a bit" complicated too.) – ilkkachu Aug 25 '23 at 19:02
  • 1
    But you're right in that ././, while valid as part of a pathname, is a byte string that will never be produced from a shell glob or other such directory listing! So points for that. (I may have seen something like .// before, same idea.) But still, the output format you get is basically filenames separated by \n././, with a ././ at the start and a \n at the end, and while unambiguous, it's a bit messy, and doesn't seem to help in getting extra fields from ls -l along without breaking. (and you need the shell glob). But I guess it's not as insane as the question, so have a upvote. – ilkkachu Aug 25 '23 at 19:08
  • @ikkachu Thank you for the comments. I would say that ls -ld ././* is impossible to parse due to the -> displayed for logical links, but there's no issue with ls -dlnH ././* or ls -dlnL ././*. About the serializing/unserializing, awk will probably be able to do most of the required logic and output the results in a format that the shell or xargs can understand. – Fravadona Aug 25 '23 at 20:42
  • It comes down to, I believe, which solution is easiest to remember and most difficult to get wrong. I don't know about you, but reading the output from ls -dlnH ././* seems a bit harder to remember and to get right than just iterating over the list resulting from expanding the glob as usual. – Kusalananda Aug 26 '23 at 12:29
  • @Kusalananda You're right, there's no point in using ls for just replacing a glob; but for getting the last modified file on a HP-UX or AIX that doesn't have perl is an other matter (6 years ago we still had a HP-UX like that at work...) – Fravadona Aug 26 '23 at 13:34
  • I'm pretty sure zsh exists on AIX, so you could use a glob there too, *(.om[1]). – Kusalananda Aug 26 '23 at 14:31
  • Well if you have a system like that, and a job to do, then the question is also: Which is easier: a) to install Perl or zsh, b) write a C program to do it, d) change the filenames so that they include the timestamp in a format that makes lexical sorting give the time order, e) change the filenames so that they don't include newlines and just decide that ls -t |tail -1 works well enough, or f) try to work on persuading ls to give a parseable output. – ilkkachu Aug 26 '23 at 15:34
  • Also, a few versions of ls print control chars as question marks when printing to a terminal. But I just noticed I had a copy of Busybox which did it even for output going to a pipe, and well, if there are other systems where ls does that, it blows any chances of unambigous output from it right out of the water. (Whatever the "question" there above seems to think.) – ilkkachu Aug 26 '23 at 15:42
  • @ilkkachu there sure exits quite a few non-POSIX compliant implementations of ls; on such systems even calling sh might yield a few surprises… – Fravadona Aug 26 '23 at 16:21