Is it now safe to parse the output of GNU ls?

Question

The accepted wisdom for the past few decades has been that it is never a good idea to parse the output of ls ([1],[2]). For example, if I want to save a file's modification date along with its name into a shell variable, this is not the right way to do it:

$ ls -l file
-rw-r--r-- 1 terdon terdon 0 Aug 15 19:16 file
$ foo=$(ls -l file | awk '{print $9,$6,$7,$8}')
$ echo "$foo"
file Aug 15 19:16

As soon as the file name is even slightly different, the approach fails:

$ ls -l file*
-rw-r--r-- 1 terdon terdon 0 Aug 15 19:16 'file with spaces'
$ foo=$(ls -l file* | awk '{print $9,$6,$7,$8}')
$ echo "$foo"
file Aug 15 19:16

It gets worse if the file's modification date isn't close to today's, since that can change the time format:

$ ls -l
total 0
-rw-r--r-- 1 terdon terdon 0 Aug 15 19:21  file
-rw-r--r-- 1 terdon terdon 0 Aug 15  2018 'file with spaces'

However, newer versions of GNU coreutils ls have two options that can be combined to set a specific time format and to produce NULL-delineated output:

      --time-style=TIME_STYLE
              time/date format with -l; see TIME_STYLE below
[...]
     --zero end each output line with NUL, not newline
[...]
       The TIME_STYLE argument can be full-iso,  long-iso,  iso,  locale,  or
       +FORMAT.   FORMAT  is  interpreted like in date(1).  If FORMAT is FOR‐
       MAT1<newline>FORMAT2, then FORMAT1 applies  to  non-recent  files  and
       FORMAT2  to recent files.  TIME_STYLE prefixed with 'posix-' takes ef‐
       fect only outside the POSIX locale.  Also the  TIME_STYLE  environment
       variable sets the default style to use.

Here are the files again, with these options set (the zero at the end of each line of output is replaced with # and a newline here for marginally improved readability):

$ ls -l --zero --time-style=long-iso -- *
-rw-r--r--+ 1 terdon terdon 0 2023-08-16 21:35 a file with a
newline#
-rw-r--r--+ 1 terdon terdon 0 2023-08-15 19:16 file#
-rw-r--r--+ 1 terdon terdon 0 2018-08-15 12:00 file with spaces#

With these options available, I can do many of the things that ls is traditionally bad for. For example:

Get the most recently modified file's name into a variable:

$ touch 'a file with a'$'\n''newline'
$ last=$(ls -tr --zero | tail -z -n1)
bash: warning: command substitution: ignored null byte in input
$ printf -- 'LAST: "%s"\n' "$last"
LAST: "a file with a 
newline"

The example that prompted this question. Another question, on Ask Ubuntu, where the OP wanted to print the file name and modification date. Someone posted an answer using ls and a clever awk trick and, if we add --zero to ls, it seems to be quite robust:
```
$ output=$(ls -l --zero --time-style=long-iso -- * | 
           awk 'BEGIN{RS="\0"}{ t=index($0,$7); print substr($0,t+6), $6 }')
$ printf 'Output: "%s"\n' "$output"
Output: "a file with a
newline 2023-08-16"
```

I can't find a name that breaks either of those two examples. So, my questions are:

Is there a case that would fail in one of the two examples above? Perhaps some locale oddity?
If not, does this mean that modern versions of GNU ls can actually be used safely with arbitrary file names?

The "local oddity" would obviously be running on a system where GNU ls is not available, or is not up to date. (Yes, I purposely misread "locale".) — Kusalananda, Aug 16 '23 at 16:13
@Kusalananda of course, but I am asking about GNU ls with these options, specifically. I am not suggesting we should recommend parsing ls in general. I am thinking it might be reasonable, however, on Ask Ubuntu which is only about Ubuntu and only recent ones, at that. Plus, I'm curious about this in general and whether we can find an example that breaks these specific commands. — terdon, Aug 16 '23 at 16:16
To adress @ikkachu's concerns about possible space in owner/group : maybe add to the ls the : -Q, --quote-name option, and look the time location using the position of the first " ? (I assume there can't be any " in the owner/group ... not entirely sure) and then remove the first and last ". Or even simpler: don't change the ls, and just use the location of the first : which should be right in the middle of the time? (There are so many bad side effects on so many regular commands (especially non-gnu), I wish there was a well-defined dedicated "separatorcharacter" to make life easier) — Olivier Dulac, Nov 16 '23 at 14:24

ilkkachu · Accepted Answer · 2023-08-18T09:06:48.230

Is it now safe to parse the output of GNU ls? (with --zero)

--zero does help, a lot, but it's still not safe the way it was used here. There are issues with both the output format of ls itself, and the commands used in the question to parse the output. --zero is actually mentioned in the ParsingLs wiki page, but they don't use the long format in the examples there (perhaps because of the issues here!). A number of the issues in this answer were brought up by Stéphane Chazelas in the comments.

To start, ls -l is a problem as it still happily prints user/group names that contain white space as-is, messing up the column count (--zero doesn't matter here):

$ ls -l --time-style=long-iso foo.txt
-rw-rw-r-- 1 foo bar users 0 2023-08-16 16:45 foo.txt

In the least, you need --numeric-uid-gid/-n, which prints UIDs and GIDs as numbers, or -go which would omit them completely. Both include the other long format fields too.

ls will also list the contents of any directories that appear among the arguments, so you probably want -d, also.

I don't think the other columns can contain spaces or NULs, so

ls -dgo --time-style=long-iso --zero -- *

might be safe. Maybe.

It's still not the easiest to parse, since if there are multiple files, it'll pad the columns with spaces, instead of using just one as a field separator, so you can't use e.g. cut on the output. That happens even when outputting to a pipe with --zero and omitting the UID and GID doesn't help since the file size and link count can vary in width:

$ ls -dgo --zero --time-style=long-iso -- *.txt |tr '\0' '\n'
-rw-rw-r-- 21    0 2023-08-16 17:24 bar.txt
-rw-rw-r--  1 1234 2023-08-16 17:30  leading space.txt

The filename isn't padded to the right (and doing that would be odd), so it's probably safe to assume there's only one space between the timestamp and filename.

--time-style=long-iso doesn't include the UTC offset, meaning the dates could be ambiguous. At worst, two files created at around the time daylight saving ends could be shown with dates that would appear to be in the wrong order. (ls would still sort them correctly if asked to, but the output would be confusing.) --full-time/--time-style=full-iso (or a custom format) would be better in this, and explicitly setting TZ=UTC0 would make the dates more easy to compare as strings:

$ TZ=Europe/Helsinki ls -dgo --time-style=long-iso -- *
-rw-rw-r-- 1 0 2023-10-29 03:30 first
-rw-rw-r-- 1 0 2023-10-29 03:20 second
$ TZ=UTC0 ls -dgo --full-time -- *
-rw-rw-r-- 1 0 2023-10-29 00:30:00.000000000 +0000 first
-rw-rw-r-- 1 0 2023-10-29 01:20:00.000000000 +0000 second
$ TZ=UTC0 ls -dgo --time-style=+%FT%T.%NZ -- *
-rw-rw-r-- 1 0 2023-10-29T00:30:00.000000000Z first
-rw-rw-r-- 1 0 2023-10-29T01:20:00.000000000Z second

It gets worse if you have anything but regular files. Might not be an issue in many cases, but anyway:

For device files, ls doesn't print their size, but instead the major/minor device numbers. Separated by a comma and a space, making the column count different than for other files. You can tell the two variants apart from the comma, but it makes the parsing more painful.

$ ls -dgo --zero --time-style=long-iso -- /dev/null somefile.txt |tr '\0' '\n'
crw-rw-rw- 1  1, 3 2023-07-16 15:37 /dev/null
-rw-rw-r-- 1 12345 2023-08-17 06:14 somefile.txt

Then there's symlinks, which in long format are printed as link name -> link target, but there's nothing to say the link or target name can themselves contain ->...

$ ls -dgo --zero --time-style=long-iso -- how* what* |tr '\0' '\n'
lrwxrwxrwx 1 14 2023-08-17 06:05 how -> about -> this?
lrwxrwxrwx 1  5 2023-08-17 05:54 what -> is -> this?

Well, I guess technically the size field tells the length (in bytes, not characters) of the link name...

This is a case where --quoting-style=shell-escape-always would actually be better than --zero, as it prints the two individually quoted with some special or non-printable characters escaped inside $'':

$ ls -dgo --quoting-style=shell-escape-always --time-style=long-iso -- how* what*  |cat
lrwxrwxrwx 1 14 2023-08-17 06:05 'how' -> 'about -> this?'
lrwxrwxrwx 1  5 2023-08-17 05:54 'what -> is' -> 'this?'

Not that it's fun to parse that either, even with a shell.

It would be nicer if we could just explicitly select the fields we do want, but I don't see an option in ls for that. GNU find has -printf which I think could be made to produce safe output, and if you only want ls to sort by time, you don't need to print the timestamp, just ls --zero with -t/-u/-c should do. See below. (zsh could do that itself, but Bash isn't so nice.)

If you want the timestamps and filenames, something like find ./* -printf '%TY-%Tm-%Td %TT %p\0' should do, though of course it'll recurse to subdirectories by default, so you'll have to do something about that if you don't want it. Maybe just add -prune to the end. Also -- doesn't help with find, so you need the ./ prefix.

Maybe stat --printf would be easier.

Is there a case that would fail in one of the two examples above? Perhaps some locale oddity?

Out of the commands used in the question, last=$(ls -tr --zero | tail -z -n1) by itself is unsafe in Bash, since the command substitution removes trailing newlines, after ignoring the final NL. And as Ed Morton points out, at least that particular AWK command is just broken regardless of how safe the output of ls itself is.

I don't think AWK is that well suited for inputs where there's a fixed number of fields where the last one can itself contain field separators. Perl's split() has an extra argument to limit the number of fields to produce, except that it's not too easy to use that either when some (not all) of the field separators can be multiple spaces. A naive split/ +/, $_, 6 would eat leading spaces from filenames. You could construct a regex to deal with that and the device node issue, but that's starting to be like forcing a round peg in a square hole and doesn't fix the symlink output issue.

Without the long format output, ls --zero should give just raw filenames terminated by NULs, so the output should be safe and simpler to parse.

For $n oldest files, the wiki page has:

readarray -t -d '' -n 5 sorted < <(ls --zero -tr)
# check the number of elements you got

and for only one, you can use read -rd '' would do, as was mentioned in a comment:

IFS= read -rd '' newest < <(ls -t --zero)
# check the exit status or make sure "$newest" is not empty

Worth noting that GNU stat doesn't work for a file called - and doesn't let you specify the timestamp format (contrary to other implementations such as zsh's stat builtin or BSD stat). GNU find's -printf (which predates GNU stat by decades) is also better in that regard, though you need recent versions for it to take arbitrary file paths (via its -read0-from predicate). — Stéphane Chazelas, Aug 16 '23 at 18:51
Hah! Thanks, @ilkkachu, I haven't read through that wiki page in a few years, so I hadn't realized they had added a section on this. User names, at least on my Linux, cannot contain spaces. You and Ed though both mention group names with spaces and that's precisely the kind of edge case I was hoping to find. Are those a thing on linux or only on other systems? — terdon, Aug 17 '23 at 17:51
@terdon, that mention on the wiki page is one the things Stéphane pointed out, I didn't look too closely there either... I'm not sure user and group names are that clear an issue either way. E.g. useradd and groupadd on Ubuntu reject names with spaces, and I would expect many admins would agree with me to forbid them. But if the data source contains names like that, they seem to get printed here such as they are. — ilkkachu, Aug 17 '23 at 20:59
If the names come from e.g. some LDAP directory that's perhaps also used by other systems, it might be harder to rule out awkward names from appearing there. Ed's answer had "Domain Users", which sounds plausibly Windows-like. But I don't really know what environments would make names like that possible or likely; just the user/group names were the first thing that came to mind as a possible source of ambigous data in the ls output. Then the story got a bit out of hand... — ilkkachu, Aug 17 '23 at 20:59
If you know you don't ever have user/group names like that, then that part is not an issue. Then again, if you know you don't ever have filenames with newlines, then --zero isn't needed either... Of course, filenames can be created by any user, user/group names require admins to allow them somehow, but I took all this as an exercise in looking for corner cases. :) — ilkkachu, Aug 17 '23 at 21:06
@Deduplicator, that would only work to break filenames that start with spaces or contain repeated spaces. — ilkkachu, Aug 19 '23 at 18:14

Kaz · Answer 2 · 2023-08-17T18:22:44.020

10

If you're going to depend on the output of GNU ls specifically, that means you're dependent on the GNU Coreutils package. That means you can instead use another Coreutils utility, namely stat. Stat has format strings for getting the information about the object in the needed way.

E.g. print the modification time of the current directory in the form MMM DD HH:MM:

$ echo $(date -d @$(stat --format="%Y" .) +"%b %m %H:%M")
Aug 08 07:57

The command stat --format=%Y . gets us the modification time of the . object as a decimal integer representing the familiar seconds since the Epoch.

We interpolate that with a @ prefix as the -d argument of date (a feature of GNU Coreutils date), and then use strftime codes to get the time in the desired format.

It's too bad that stat doesn't have a way to format dates using strftime built-in. If we want to get multiple information fields, including modification time, without making multiple calls to stat, we have to get it to print a multi-field line which we then have to parse. This is still a measure better than scraping the output of ls. If utmost efficiency isn't important (and if it is, why are we coding in Bash) we can suffer several invocations of stat.

A claim was made in the comments that stat cannot be used to discover the file with the oldest modification time. It is true that stat alone cannot do it, but in fact stat combined with shell wildcard expansion can do it about as well as relying on ls -1t.

$ for x in *.txt ; do stat --format="%Y %n" "$x" ; done | sort -n | head -1
1328379315 readme-mt.txt

That file goes back a fair bit:

$ date -d @1328379315
Sat Feb  4 10:15:15 PST 2012

Now we have the problem that if the name contains newlines, it will mess up the sort. We could work around that in ways that are not easy with ls.

For instance, we could read the names into a Bash array, and then instead of names, we print the time stamps together with array indices. From the output of sort -n | head -1 we obtain an item whose second field gives us the array index of the name of the least recently modified file.

We can entirely sidestep the issue of dealing with the output of ls which has encoded spaces and newlines in some way that we have to parse.

$ array=(*.txt)
$ for x in ${!array[@]}; do 
>   printf "%s %s\n" $(stat --format="%Y" "${array[$x]}") $x 
> done | sort -n | head -1
1328379315 29
$ echo "${array[29]}"
readme-mt.txt

array[29] will hold the 30th file that was encountered by *.txt, no matter what characters that name is made out of. Our sort job is impervious to that because it doesn't see that name.

So, to answer the question, GNU ls has some features that make it safer to parse its output, but it's still not easy to parse output safely in the shell language.

GNU ls can be used safely by, say, a C program which does popen("ls ...", "r") with the right options to ls, and correct parsing logic.

The rule "don't scrape the output of ls" is in the context of scripting.

edited Aug 17 '23 at 18:22

answered Aug 16 '23 at 16:53

Kaz

8,273

The first example doesn’t extract metadata, it uses ls to find the most recently modified file. There’s no way to do that with stat. – Stephen Kitt Aug 16 '23 at 17:54
1

There are many ways of doing this, but the question is not about how to do it but whether GNU ls is safe in the specific examples in the question. – terdon Aug 16 '23 at 20:24
@StephenKitt Updated again. – Kaz Aug 16 '23 at 21:30
The first example asks for the most recently modified file, not the least recently modified file (but that’s a trivial change). It also shows how to do it with ls while handling newlines (apart from newlines at the end of a file name) and “encoded spaces”. But as terdon points out you’re not answering the question... – Stephen Kitt Aug 16 '23 at 21:40
@StephenKitt My latest example is not thwarted by any characters that can occur in any position of a name, and gets you that name into a Bash variable. – Kaz Aug 16 '23 at 21:43
3

GNU date has a -r option to print the last modification time of a file (note: after symlink resolution): date -r file +%FT%T.%N%::z for instance. – Stéphane Chazelas Aug 17 '23 at 03:03
@Kaz I know, I wasn’t commenting on your array-based solution, rather your remarks on the limitations of ls. – Stephen Kitt Aug 17 '23 at 06:23
@ikkachu Thanks. I missed the quote because my brain went into non-quoting mode from the ${!array[@]} expansion that doesn't require it due to producing decimal array indices. I added it to the echo too so it reproduces the whitespace exactly. – Kaz Aug 17 '23 at 18:25
1

@Kaz, which why it's safer to always quote, even when you don't need to. Less chances for slipping somewhere. :) Technically, ${!array[@]} would also need the quotes if someone went and set IFS to contain digits. Quoting that one too would remove a corner case. (Same for the output from stat) Anyway, sorry for the tone in that last comment, it might have come off a bit too hard. – ilkkachu Aug 17 '23 at 21:15
@ikkachu You should never write shell code that codes against buggered values of IFS, except code which itself has changed it in that scope. Code which changes IFS and then invokes unknown other code (such as yours) without first restoring it is doing it wrong; 100% of the blame is in that code. If IFS contains digits, then digits will disappear due to being treated as separators; the indices will be garbage. It's like writing C code against the possibility that someone has an inline __asm statement that trashes the stack pointer. – Kaz Aug 18 '23 at 00:00
@Kaz, well, I'm going to disagree a bit here. Using quotes even if the data is (expected to be) safe doesn't make the code work wrong, so it seems a bit strong to say you should "never" do it. You're right in that usually one has an IFS that's set to the default, but even if we ignore the possibility of someone changing it, I would still suggest it's easier and safer to just quote every time you don't explicitly want splitting or globbing. – ilkkachu Aug 18 '23 at 07:52
As we saw here, doing it only when necessary is prone to mistakes, and then if you reuse or repurpose some piece of code, you need to go through every single expansion and reconsider if it needs quoting now. That takes time and mental effort. As a practical example, the loop over the array indexes here would work as-is with an associative array with string indexes (I think), except that then you'd need to quote them. – ilkkachu Aug 18 '23 at 07:52
The command substitution is similar-ish in that there might be cases where it's plausible to change the output format of the command inside to include spaces, and then quotes would be needed on the outside of the command substitution. Here, that would add fields to the output of the printf and need changes elsewhere too, so it's not that clear in this particular case. But in something like printf "%s:%s:%s\n" "$(stat --format=... "$f")" something whatever you might plausibly change the format to include whitespace without needing to touch anything else – ilkkachu Aug 18 '23 at 07:57
Then again, there's always the option of switching to zsh, where you don't need to quote :) (except for the command substitution, sigh.) – ilkkachu Aug 18 '23 at 08:06

Ed Morton · Answer 3 · 2023-08-17T11:57:53.323

4

Given this code from the final example in the question:

ls -l --zero --time-style=long-iso -- * | 
    awk 'BEGIN{RS="\0"}{ t=index($0,$7); print substr($0,t+6), $6 }'

and this posted sample output of that ls command (with #<newline> replacing the NULs for visibility):

$ ls -l --zero --time-style=long-iso -- *
-rw-r--r--+ 1 terdon terdon 0 2023-08-16 21:35 a file with a
newline#
-rw-r--r--+ 1 terdon terdon 0 2023-08-15 19:16 file#
-rw-r--r--+ 1 terdon terdon 0 2018-08-15 12:00 file with spaces#

It looks like $7 is intended to be the timestamp. If so, then t=index($0,$7) would fail for user names/groups that are more than 1 word, e.g.:

-rw-r--r--+ 1 terdon Domain Users 0 2023-08-15 19:16 file#

since then your timestamp would be in $8 (or some higher number depending on how many words are in user name and/or group), not $7.

Given that user names/groups can't include :, you could solve that by just looking for the first : in the line instead of looking for a particular field:

ls -l --zero --time-style=long-iso -- * | 
    awk -v RS='\0' 'p=index($0,":") { print substr($0,p+4), substr($0,p-13,10) }'

or with GNU awk (which you probably are using anyway for RS='\0') for the 3rd arg to match():

ls -l --zero --time-style=long-iso -- * | 
    awk -v RS='\0' 'match($0,/(.{10}) ..:.. (.*)/,a) { print a[2], a[1] }'

edited Aug 17 '23 at 11:57

answered Aug 16 '23 at 17:09

Ed Morton

31,617

If you need owner/group name for a file one safe way I can see of handling that is to collect the uid/gid as numbers and then use id -nu UID and getent group GID | cut -d: -f1 to transform them. – Chris Davies Aug 17 '23 at 09:15
@roaima I wasn't particularly trying to get owner/group, just discuss the issues around trying to parse the output of terdon's ls command which included them. – Ed Morton Aug 17 '23 at 10:16
1

Yes, me too. Like you I have encountered names and groups that contain spaces (Active Directory, specifically) and I was musing on how best to get the detail having parsed ls --zero – Chris Davies Aug 17 '23 at 10:25
Ah, I see, understood the context now, thanks. – Ed Morton Aug 17 '23 at 10:27
1

Ah-ha! I didn't realize group names can have whitespace. I thought that they, like user names, had to be one "word". That's exactly the kind of case I was looking for. Any idea what kind of systems support such group names? – terdon Aug 17 '23 at 17:48
2

@terdon POSIX doesn’t allow them, but Active Directory does, so systems which import groups from AD often end up with such group names. – Stephen Kitt Aug 17 '23 at 20:43
FWIW I've also had my user name show up as "Ed Morton" (with a space in it) on some Unix systems too. I don't recall which system it was on. I see a question related to user names with spaces and AD at how-to-add-to-group-when-name-has-a-space and other places so maybe my experience was also an Active Directory consequence, idk, just a surprise when it happened. – Ed Morton Aug 18 '23 at 00:35

Is it now safe to parse the output of GNU ls?

3 Answers3

Linked