Is it now safe to parse the output of GNU ls? (with --zero
)
--zero
does help, a lot, but it's still not safe the way it was used here. There are issues with both the output format of ls
itself, and the commands used in the question to parse the output.
--zero
is actually mentioned in the ParsingLs wiki page, but they don't use the long format in the examples there (perhaps because of the issues here!). A number of the issues in this answer were brought up by Stéphane Chazelas in the comments.
To start, ls -l
is a problem as it still happily prints user/group names that contain white space as-is, messing up the column count (--zero
doesn't matter here):
$ ls -l --time-style=long-iso foo.txt
-rw-rw-r-- 1 foo bar users 0 2023-08-16 16:45 foo.txt
In the least, you need --numeric-uid-gid
/-n
, which prints UIDs and GIDs as numbers, or -go
which would omit them completely. Both include the other long format fields too.
ls
will also list the contents of any directories that appear among the arguments, so you probably want -d
, also.
I don't think the other columns can contain spaces or NULs, so
ls -dgo --time-style=long-iso --zero -- *
might be safe. Maybe.
It's still not the easiest to parse, since if there are multiple files, it'll pad the columns with spaces, instead of using just one as a field separator, so you can't use e.g. cut
on the output. That happens even when outputting to a pipe with --zero
and omitting the UID and GID doesn't help since the file size and link count can vary in width:
$ ls -dgo --zero --time-style=long-iso -- *.txt |tr '\0' '\n'
-rw-rw-r-- 21 0 2023-08-16 17:24 bar.txt
-rw-rw-r-- 1 1234 2023-08-16 17:30 leading space.txt
The filename isn't padded to the right (and doing that would be odd), so it's probably safe to assume there's only one space between the timestamp and filename.
--time-style=long-iso
doesn't include the UTC offset, meaning the dates could be ambiguous. At worst, two files created at around the time daylight saving ends could be shown with dates that would appear to be in the wrong order. (ls
would still sort them correctly if asked to, but the output would be confusing.) --full-time
/--time-style=full-iso
(or a custom format) would be better in this, and explicitly setting TZ=UTC0
would make the dates more easy to compare as strings:
$ TZ=Europe/Helsinki ls -dgo --time-style=long-iso -- *
-rw-rw-r-- 1 0 2023-10-29 03:30 first
-rw-rw-r-- 1 0 2023-10-29 03:20 second
$ TZ=UTC0 ls -dgo --full-time -- *
-rw-rw-r-- 1 0 2023-10-29 00:30:00.000000000 +0000 first
-rw-rw-r-- 1 0 2023-10-29 01:20:00.000000000 +0000 second
$ TZ=UTC0 ls -dgo --time-style=+%FT%T.%NZ -- *
-rw-rw-r-- 1 0 2023-10-29T00:30:00.000000000Z first
-rw-rw-r-- 1 0 2023-10-29T01:20:00.000000000Z second
It gets worse if you have anything but regular files. Might not be an issue in many cases, but anyway:
For device files, ls
doesn't print their size, but instead the major/minor device numbers. Separated by a comma and a space, making the column count different than for other files. You can tell the two variants apart from the comma, but it makes the parsing more painful.
$ ls -dgo --zero --time-style=long-iso -- /dev/null somefile.txt |tr '\0' '\n'
crw-rw-rw- 1 1, 3 2023-07-16 15:37 /dev/null
-rw-rw-r-- 1 12345 2023-08-17 06:14 somefile.txt
Then there's symlinks, which in long format are printed as link name -> link target
, but there's nothing to say the link or target name can themselves contain ->
...
$ ls -dgo --zero --time-style=long-iso -- how* what* |tr '\0' '\n'
lrwxrwxrwx 1 14 2023-08-17 06:05 how -> about -> this?
lrwxrwxrwx 1 5 2023-08-17 05:54 what -> is -> this?
Well, I guess technically the size field tells the length (in bytes, not characters) of the link name...
This is a case where --quoting-style=shell-escape-always
would actually be better than --zero
, as it prints the two individually quoted with some special or non-printable characters escaped inside $''
:
$ ls -dgo --quoting-style=shell-escape-always --time-style=long-iso -- how* what* |cat
lrwxrwxrwx 1 14 2023-08-17 06:05 'how' -> 'about -> this?'
lrwxrwxrwx 1 5 2023-08-17 05:54 'what -> is' -> 'this?'
Not that it's fun to parse that either, even with a shell.
It would be nicer if we could just explicitly select the fields we do want, but I don't see an option in ls
for that. GNU find has -printf
which I think could be made to produce safe output, and if you only want ls
to sort by time, you don't need to print the timestamp, just ls --zero
with -t
/-u
/-c
should do. See below. (zsh could do that itself, but Bash isn't so nice.)
If you want the timestamps and filenames, something like
find ./* -printf '%TY-%Tm-%Td %TT %p\0'
should do, though of course it'll recurse to subdirectories by default, so you'll have to do something about that if you don't want it. Maybe just add -prune
to the end. Also --
doesn't help with find
, so you need the ./
prefix.
Maybe stat --printf
would be easier.
Is there a case that would fail in one of the two examples above? Perhaps some locale oddity?
Out of the commands used in the question, last=$(ls -tr --zero | tail -z -n1)
by itself is unsafe in Bash, since the command substitution removes trailing newlines, after ignoring the final NL. And as Ed Morton points out, at least that particular AWK command is just broken regardless of how safe the output of ls
itself is.
I don't think AWK is that well suited for inputs where there's a fixed number of fields where the last one can itself contain field separators. Perl's split()
has an extra argument to limit the number of fields to produce, except that it's not too easy to use that either when some (not all) of the field separators can be multiple spaces. A naive split/ +/, $_, 6
would eat leading spaces from filenames. You could construct a regex to deal with that and the device node issue, but that's starting to be like forcing a round peg in a square hole and doesn't fix the symlink output issue.
Without the long format output, ls --zero
should give just raw filenames terminated by NULs, so the output should be safe and simpler to parse.
For $n
oldest files, the wiki page has:
readarray -t -d '' -n 5 sorted < <(ls --zero -tr)
# check the number of elements you got
and for only one, you can use read -rd ''
would do, as was mentioned in a comment:
IFS= read -rd '' newest < <(ls -t --zero)
# check the exit status or make sure "$newest" is not empty
ls
is not available, or is not up to date. (Yes, I purposely misread "locale".) – Kusalananda Aug 16 '23 at 16:13ls
with these options, specifically. I am not suggesting we should recommend parsingls
in general. I am thinking it might be reasonable, however, on Ask Ubuntu which is only about Ubuntu and only recent ones, at that. Plus, I'm curious about this in general and whether we can find an example that breaks these specific commands. – terdon Aug 16 '23 at 16:16-Q, --quote-name
option, and look the time location using the position of the first"
? (I assume there can't be any"
in the owner/group ... not entirely sure) and then remove the first and last"
. Or even simpler: don't change the ls, and just use the location of the first:
which should be right in the middle of the time? (There are so many bad side effects on so many regular commands (especially non-gnu), I wish there was a well-defined dedicated "separatorcharacter" to make life easier) – Olivier Dulac Nov 16 '23 at 14:24