I consistently see answers quoting this link stating definitively "Don't parse ls
!" This bothers me for a couple of reasons:
It seems the information in that link has been accepted wholesale with little question, though I can pick out at least a few errors in casual reading.
It also seems as if the problems stated in that link have sparked no desire to find a solution.
From the first paragraph:
...when you ask
[ls]
for a list of files, there's a huge problem: Unix allows almost any character in a filename, including whitespace, newlines, commas, pipe symbols, and pretty much anything else you'd ever try to use as a delimiter except NUL. ...ls
separates filenames with newlines. This is fine until you have a file with a newline in its name. And since I don't know of any implementation ofls
that allows you to terminate filenames with NUL characters instead of newlines, this leaves us unable to get a list of filenames safely withls
.
Bummer, right? How ever can we handle a newline terminated listed dataset for data that might contain newlines? Well, if the people answering questions on this website didn't do this kind of thing on a daily basis, I might think we were in some trouble.
The truth is though, most ls
implementations actually provide a very simple api for parsing their output and we've all been doing it all along without even realizing it. Not only can you end a filename with null, you can begin one with null as well or with any other arbitrary string you might desire. What's more, you can assign these arbitrary strings per file-type. Please consider:
LS_COLORS='lc=\0:rc=:ec=\0\0\0:fi=:di=:' ls -l --color=always | cat -A
total 4$
drwxr-xr-x 1 mikeserv mikeserv 0 Jul 10 01:05 ^@^@^@^@dir^@^@^@/$
-rw-r--r-- 1 mikeserv mikeserv 4 Jul 10 02:18 ^@file1^@^@^@$
-rw-r--r-- 1 mikeserv mikeserv 0 Jul 10 01:08 ^@file2^@^@^@$
-rw-r--r-- 1 mikeserv mikeserv 0 Jul 10 02:27 ^@new$
line$
file^@^@^@$
^@
See this for more.
Now it's the next part of this article that really gets me though:
$ ls -l
total 8
-rw-r----- 1 lhunath lhunath 19 Mar 27 10:47 a
-rw-r----- 1 lhunath lhunath 0 Mar 27 10:47 a?newline
-rw-r----- 1 lhunath lhunath 0 Mar 27 10:47 a space
The problem is that from the output of
ls
, neither you or the computer can tell what parts of it constitute a filename. Is it each word? No. Is it each line? No. There is no correct answer to this question other than: you can't tell.Also notice how
ls
sometimes garbles your filename data (in our case, it turned the\n
character in between the words "a" and "newline" into a ?question mark......
If you just want to iterate over all the files in the current directory, use a
for
loop and a glob:
for f in *; do
[[ -e $f ]] || continue
...
done
The author calls it garbling filenames when ls
returns a list of filenames containing shell globs and then recommends using a shell glob to retrieve a file list!
Consider the following:
printf 'touch ./"%b"\n' "file\nname" "f i l e n a m e" |
. /dev/stdin
ls -1q
f i l e n a m e
file?name
IFS="
" ; printf "'%s'\n" $(ls -1q)
'f i l e n a m e'
'file
name'
POSIX defines the -1
and -q
ls
operands so:
-q
- Force each instance of non-printable filename characters and<tab>
s to be written as the question-mark ('?'
) character. Implementations may provide this option by default if the output is to a terminal device.
-1
- (The numeric digit one.) Force output to be one entry per line.
Globbing is not without its own problems - the ?
matches any character so multiple matching ?
results in a list will match the same file multiple times. That's easily handled.
Though how to do this thing is not the point - it doesn't take much to do after all and is demonstrated below - I was interested in why not. As I consider it, the best answer to that question has been accepted. I would suggest you try to focus more often on telling people what they can do than on what they can't. You're a lot less likely, as I think, to be proven wrong at least.
But why even try? Admittedly, my primary motivation was that others kept telling me I couldn't. I know very well that ls
output is as regular and predictable as you could wish it so long as you know what to look for. Misinformation bothers me more than do most things.
The truth is, though, with the notable exception of both Patrick's and Wumpus Q. Wumbley's answers (despite the latter's awesome handle), I regard most of the information in the answers here as mostly correct - a shell glob is both more simple to use and generally more effective when it comes to searching the current directory than is parsing ls
. They are not, however, at least in my regard, reason enough to justify either propagating the misinformation quoted in the article above nor are they acceptable justification to "never parse ls
."
Please note that Patrick's answer's inconsistent results are mostly a result of him using zsh
then bash
. zsh
- by default - does not word-split $(
command substituted)
results in a portable manner. So when he asks where did the rest of the files go? the answer to that question is your shell ate them. This is why you need to set the SH_WORD_SPLIT
variable when using zsh
and dealing with portable shell code. I regard his failure to note this in his answer as awfully misleading.
Wumpus's answer doesn't compute for me - in a list context the ?
character is a shell glob. I don't know how else to say that.
In order to handle a multiple results case you need to restrict the glob's greediness. The following will just create a test base of awful file names and display it for you:
{ printf %b $(printf \\%04o `seq 0 127`) |
sed "/[^[-b]*/s///g
s/\(.\)\(.\)/touch '?\v\2' '\1\t\2' '\1\n\2'\n/g" |
. /dev/stdin
echo '`ls` ?QUOTED `-m` COMMA,SEP'
ls -qm
echo ; echo 'NOW LITERAL - COMMA,SEP'
ls -m | cat
( set -- * ; printf "\nFILE COUNT: %s\n" $# )
}
OUTPUT
`ls` ?QUOTED `-m` COMMA,SEP
??\, ??^, ??`, ??b, [?\, [?\, ]?^, ]?^, _?`, _?`, a?b, a?b
NOW LITERAL - COMMA,SEP
?
\, ?
^, ?
`, ?
b, [ \, [
\, ] ^, ]
^, _ `, _
`, a b, a
b
FILE COUNT: 12
Now I'll safe every character that isn't a /slash
, -dash
, :colon
, or alpha-numeric character in a shell glob then sort -u
the list for unique results. This is safe because ls
has already safed-away any non printable characters for us. Watch:
for f in $(
ls -1q |
sed 's|[^-:/[:alnum:]]|[!-\\:[:alnum:]]|g' |
sort -u | {
echo 'PRE-GLOB:' >&2
tee /dev/fd/2
printf '\nPOST-GLOB:\n' >&2
}
) ; do
printf "FILE #$((i=i+1)): '%s'\n" "$f"
done
OUTPUT:
PRE-GLOB:
[!-\:[:alnum:]][!-\:[:alnum:]][!-\:[:alnum:]]
[!-\:[:alnum:]][!-\:[:alnum:]]b
a[!-\:[:alnum:]]b
POST-GLOB:
FILE #1: '?
\'
FILE #2: '?
^'
FILE #3: '?
`'
FILE #4: '[ \'
FILE #5: '[
\'
FILE #6: '] ^'
FILE #7: ']
^'
FILE #8: '_ `'
FILE #9: '_
`'
FILE #10: '?
b'
FILE #11: 'a b'
FILE #12: 'a
b'
Below I approach the problem again but I use a different methodology. Remember that - besides \0
null - the /
ASCII character is the only byte forbidden in a pathname. I put globs aside here and instead combine the POSIX specified -d
option for ls
and the also POSIX specified -exec $cmd {} +
construct for find
. Because find
will only ever naturally emit one /
in sequence, the following easily procures a recursive and reliably delimited filelist including all dentry information for every entry. Just imagine what you might do with something like this:
#v#note: to do this fully portably substitute an actual newline \#v#
#v#for 'n' for the first sed invocation#v#
cd ..
find ././ -exec ls -1ldin {} + |
sed -e '\| *\./\./|{s||\n.///|;i///' -e \} |
sed 'N;s|\(\n\)///|///\1|;$s|$|///|;P;D'
###OUTPUT
152398 drwxr-xr-x 1 1000 1000 72 Jun 24 14:49
.///testls///
152399 -rw-r--r-- 1 1000 1000 0 Jun 24 14:49
.///testls/?
\///
152402 -rw-r--r-- 1 1000 1000 0 Jun 24 14:49
.///testls/?
^///
152405 -rw-r--r-- 1 1000 1000 0 Jun 24 14:49
.///testls/?
`///
...
ls -i
can be very useful - especially when result uniqueness is in question.
ls -1iq |
sed '/ .*/s///;s/^/-inum /;$!s/$/ -o /' |
tr -d '\n' |
xargs find
These are just the most portable means I can think of. With GNU ls
you could do:
ls --quoting-style=WORD
And last, here's a much simpler method of parsing ls
that I happen to use quite often when in need of inode numbers:
ls -1iq | grep -o '^ *[0-9]*'
That just returns inode numbers - which is another handy POSIX specified option.
ls
is fast". Shell globbing is even faster :-). And you can resolve globs without a loop too.echo *
. Works perfectly fine. – phemmer May 12 '14 at 02:34ls -R
with a sell glob and the time it takesls
to do it. – mikeserv May 12 '14 at 04:02time bash -c 'for i in {1..1000}; do ls -R &>/dev/null; done'
= 3.18s vstime bash -c 'for i in {1..1000}; do echo **/* >/dev/null; done'
= 1.28s – phemmer May 12 '14 at 04:05find
becausefind
has a-print0
argument which uses a null character to delimit the files. A null character cannot be in a filename, thus there's no possibility of ever confusing it. – phemmer May 12 '14 at 04:12find
has a-print0
but its use is not portable code. And I demonstrate above that it is not necessary. – mikeserv May 12 '14 at 04:13-print0
is not defined in POSIX. However I have never seen anyone saying it is common practice to use newline-delimitedfind
output as reliable file delimitation. – phemmer May 12 '14 at 04:17stat
in my answer, as it actually checks that each file exists. Your bit at the bottom with thesed
thing does not work. – phemmer May 12 '14 at 04:20ls
in the first place? What you're describing is very hard. I'll need to deconstruct it to understand all of it and I'm a relatively competent user. You can't possibly expect your average Joe to be able to deal with something like this. – terdon May 12 '14 at 04:40a\nb
tostat
! – mikeserv May 12 '14 at 04:40set -- $(ls -1q | uniq)
is all it takes. – mikeserv May 12 '14 at 04:41touch foo$'\n'bar; stat --format '<%n>' foo*
. – phemmer May 12 '14 at 04:49touch a$'\n'b a$'\t'b 'a b'; set -- $(ls -1q | uniq); for i; do ls "$i"; done
. That will match thea b
file twice because of the shell glob issues. – terdon May 12 '14 at 04:49$IFS
of course - just like I said in the beginning.IFS="$(printf \\n)" touch a$'\n'b a$'\t'b 'a b'; set -- $(ls -1q | uniq); for i; do ls "$i"; done
though what your shell might do to the IFS I don't know - it's better to do an actual newline - as I demonstrate. – mikeserv May 12 '14 at 04:56\n
that's an actual newline. You canprintf 'stat "%b"\n' "$@" |. /dev/stdin
– mikeserv May 12 '14 at 04:57for f in *; do ...; done
. – terdon May 12 '14 at 05:06ls
output is wrong were covered well in the original link (and in plenty of other places). This question would have been reasonable if OP were asking for help understanding it, but instead OP is simply trying to prove his incorrect usage is ok. – R.. GitHub STOP HELPING ICE May 12 '14 at 13:05parsing ls is bad
. Doingfor something in $(command)
and relying on word-splitting to get accurate results is bad for the large majority ofcommand's
which don't have simple output. – May 13 '14 at 14:53ls
that can contain any bytes except the null byte or the slash. There is fundamentally no way to recover that into a list of filenames. The transformationls
does is non-reversible, even if it doesn't replace any nonprintable characters. When it replaces non-printable characters, you're in an even worse situation. – R.. GitHub STOP HELPING ICE May 13 '14 at 16:04[:print:]
's complement - for which you've got my vote. I considered the same but didn't care to include it as I could have done by adding that to$IFS
temporarily and/or simply setting another variable and adding to the[$glob]
. The point is though -ls
provides the marker reliably - and I can't see what more you'd need. – mikeserv May 13 '14 at 16:09ls
does not provide. If it did provide such a feature you could write a very complex script to recover the filenames, but it's utter nonsense when the shell gives you a trivially-correct way to do the same thing with no danger of misinterpreting the results. – R.. GitHub STOP HELPING ICE May 13 '14 at 16:11ls
's sort options - which is not to mention retrieving inode numbers. Certainly you must agree thatls -1i | grep -o '^ *[0-9]*'
is a simple and non-complex way to parsels
anyway. – mikeserv May 13 '14 at 16:16for i in * ; do ... ; done
are safe, whereas usages likefor i in $(echo *) ; do ... ; done
are not (the latter has a concatenation step followed by a separate word-splitting step). – R.. GitHub STOP HELPING ICE May 13 '14 at 16:18$(echo *)
I do:set -- 'string'["$glob"]'string'
- there is 0 concatenation done by anything but the shell. It is essentially the same - the-vx
output is included above. It appears perhaps you've misunderstood? – mikeserv May 13 '14 at 16:20set --
with globs also avoids any concatenation and word splitting. The incorrect usage ofset --
with the output ofls
does involve concatenation (inherent in the wayls
writes output: as a stream of bytes, not a list of strings) and word splitting. – R.. GitHub STOP HELPING ICE May 13 '14 at 16:28xargs
- or even just with a heredocument. It is a stream of bytes thatls
writes - and for each non-printable we're provided the marker for a glob. I am very curious about your specifying it an incorrect usage ofset --
though. It seems to me its as correct as any other. – mikeserv May 13 '14 at 16:32set --
- it's$IFS
and$*
for parsing argument arrays. – mikeserv May 13 '14 at 16:38set
command, like all commands to the shell, receives a list of arguments (alaargv[]
) that come from shell words on the original command line.set
itself does not do any word splitting. This is all described in POSIX XCU Chapter 2. Word-splitting is applied to the command line forset
, like any other command, but it happens before glob expansion. – R.. GitHub STOP HELPING ICE May 13 '14 at 17:48set
- as a builtin - is the shell. It is alsoset
that is specifically designed to parse arguments - split or not - according to those handed it by$*
. I've read all of that, by the way. There are a lot of topics for which my knowledge is lacking, but this isn't among them. Regardless, I don't see how that is relevant toset -- 'string'["$glob"]'string'
– mikeserv May 13 '14 at 17:59ls
? Seriously, the amount of work you have to do indicates that this is a bad idea. This is whatfind -print0
,xargs -r0
,stat
, bashwhile IFS= read -rd $'\0'
loops, etc. are for. – Aaron Davies May 15 '14 at 16:55ls --quoting-style=shell-always
? I do show a portablexargs
0-delim method above - it works forfind
as well. – mikeserv May 16 '14 at 00:24ls $(ls --quoting-style=shell-always)
doesn't work at all. did you have somethingls --quoting-style=shell-always|xargs ls
in mind? – Aaron Davies May 16 '14 at 21:51xargs
is what I had in mind - though I think I prefer the c-style escapes. Something like the following could be used withxargs printf %b\\0
- though I think I'd still have to backslash protect'single-quotes
- to recursively return a zero-delimited array of only the largest file in all child directories:ls -1bpRS ././ | sed -n ':d;\|^[.]*/\./|{s|..||;h;n;:sd;\|/$|{n;bsd};\|^$|b;G;s|\(.*\)\n\(.*[^/]\)/*:|\2/\1|p}'
– mikeserv May 16 '14 at 22:02ls -1bpRS ././ | sed -n '\|^\.*/\./|{s/..//;h;:sd;n;\|/$|bsd;/./{H;g;s|:\n|/|p}}'
– mikeserv May 16 '14 at 23:25find . -print0 | xargs -0
, but little else. Don’t use the shell for complicated things, or not only will you later hate yourself for having done this, so will everyone else, too. – tchrist May 17 '14 at 19:44ls
output specs - as it seems to me,ls
is designed to be parsed. You might also consider changingIFS=<tab>
since<tab>
in filenames is already protected. – mikeserv May 17 '14 at 19:50?
is a glob character; it's refuting your unstated assumption that, absent any?
s inserted, all filenames will match themselves (and only themselves) when interpreted as glob expressions. – Charles Duffy Jul 29 '15 at 16:35[x]
, with literal square brackets, is a counterexample to this claim, because the filename[x]
is not matched by the glob expression[x]
. Thus, the glob expression[x]?
will not match the filename$'[x]\n'
. – Charles Duffy Jul 29 '15 at 16:36nullglob
shell option enabled. – Charles Duffy Jul 29 '15 at 16:40-1
option forls
is unnecessary in your examples, because it is the default for those cases you are using. Compare for instance a plainls
(multi-column output) vs.ls|cat
(single-column output). – user1934428 Jul 10 '20 at 08:06ls
will soon have a--zero
option: https://fossies.org/linux/coreutils/ChangeLog – Jeff Schaller Oct 20 '21 at 12:37