The ls command is not working for a directory with a huge number of files

Question

I had a directory which had around 5 million files. When I tried to run the ls command from inside this directory, my system consumed a huge amount of memory and it hung after sometime. Is there an efficient way to list the files other than using the ls command?

Make sure you don't have an alias for ls that uses --color or -F as that would mean doing a lstat(2) for each file. — Stéphane Chazelas, Mar 17 '14 at 21:37
By the way, storing millions of files in a single directory is a rather bad idea. If you control the directory layout, perhaps split it by some criteria? — d33tah, Mar 18 '14 at 01:21
@d33tah Yeah, 5 million is a lot! My root file system has a limit of 7 million inodes. — Mikel, Mar 18 '14 at 14:05
You can tune that limit while creating the filesystem, but still - as far as I know, most filesystems aren't optimized for handling millions of files in a single directory. I suggest to split them. — d33tah, Mar 18 '14 at 15:15
5 million items to output -how are you looking at this - simple listing is too much to see - so what do you want the listing for? — mmmmmm, Mar 18 '14 at 15:43

score 95 · Answer 1 · edited Nov 24 '17 at 15:00

95

Avoid sorting by using:

ls --sort=none # "do not sort; list entries in directory order"

Or, equivalently:

ls -U

edited Nov 24 '17 at 15:00

dhag

15,736
4
55
65

answered Mar 17 '14 at 16:09

Hauke Laging

90,279

12

I wonder how much overhead the column layout adds, too. Adding the -1 flag could help. – Mikel Mar 18 '14 at 14:03
Probably not much, but every little bit helps, right? :) – Mikel Mar 18 '14 at 14:20
1

@Mikel Is that just a guess, or have you measured that? To me it seems that -1 takes even longer. – Hauke Laging Mar 18 '14 at 19:32
19

"-1" helps quite a bit. "ls -f -1" will avoid any stat calls and print everything immediately. The column output (which is the default when sending to a terminal) makes it buffer everything first. On my system, using btrfs in a directory with 8 million files (as created by "seq 1 8000000 | xargs touch"), "time ls -f -1 | wc -l" takes under 5 seconds, while "time ls -f -C | wc -l" takes over 30 seconds. – Scott Lamb Dec 15 '15 at 16:08
@ScottLamb - Your two commands are not a good comparison, because -C forces columns - it overrides the default -1 that is done when piping the output of ls. AFAIK, time ls -f | wc -l will run just as fast as the -1 version. Nevertheless I upvoted your comment, because when displaying straight to terminal, it is useful: immediately start seeing some filenames. – ToolmakerSteve Apr 01 '19 at 14:15
1

@ToolmakerSteve The default behavior (-C when stdout is a terminal, -1 when it's a pipe) is confusing. When you're experimenting and measuring, you flip between seeing the output (to ensure the command is doing what you expect) and suppressing it (to avoid the confounding factor of the terminal application's throughput). Better to use commands that behave in the same way in both modes, so explicitly define the output format via -1, -C, -l, etc. – Scott Lamb Apr 01 '19 at 16:17
@ScottLamb - I understand. I realized later that you were deliberately doing the test with | wc -l as a convenience for timing, but that you were doing so to discuss the underlying performance with or without the pipe (without => -C is the default behavior as you say); you were showing that if columns were being formed, the command was much slower for many files. Thank you. – ToolmakerSteve Apr 03 '19 at 11:24
This is the working solution. Thanks. – Yu Tao Oct 29 '20 at 16:43
A great answer! – Philippe Remy Dec 21 '21 at 07:51
@ScottLamb's ls -f -1 command resulted in actually seeing output instead of either waiting for a very long time or getting an out of memory error. thanks! – Christian Apr 10 '23 at 20:00

score 61 · Accepted Answer · edited May 23 '17 at 12:40

ls actually sorts the files and tries to list them which becomes a huge overhead if we are trying to list more than a million files inside a directory. As mentioned in this link, we can use strace or find to list the files. However, those options also seemed unfeasible to my problem since I had 5 million files. After some bit of googling, I found that if we list the directories using getdents(), it is supposed to be faster, because ls, find and Python libraries use readdir() which is slower but uses getdents() underneath.

We can find the C code to list the files using getdents() from here:

/*
 * List directories using getdents() because ls, find and Python libraries
 * use readdir() which is slower (but uses getdents() underneath.
 *
 * Compile with 
 * ]$ gcc  getdents.c -o getdents
 */
#define _GNU_SOURCE
#include <dirent.h>     /* Defines DT_* constants */
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/syscall.h>

#define handle_error(msg) \
       do { perror(msg); exit(EXIT_FAILURE); } while (0)

struct linux_dirent {
   long           d_ino;
   off_t          d_off;
   unsigned short d_reclen;
   char           d_name[];
};

#define BUF_SIZE 1024*1024*5

int
main(int argc, char *argv[])
{
   int fd, nread;
   char buf[BUF_SIZE];
   struct linux_dirent *d;
   int bpos;
   char d_type;

   fd = open(argc > 1 ? argv[1] : ".", O_RDONLY | O_DIRECTORY);
   if (fd == -1)
       handle_error("open");

   for ( ; ; ) {
       nread = syscall(SYS_getdents, fd, buf, BUF_SIZE);
       if (nread == -1)
           handle_error("getdents");

       if (nread == 0)
           break;

       for (bpos = 0; bpos < nread;) {
           d = (struct linux_dirent *) (buf + bpos);
           d_type = *(buf + bpos + d->d_reclen - 1);
           if( d->d_ino != 0 && d_type == DT_REG ) {
              printf("%s\n", (char *)d->d_name );
           }
           bpos += d->d_reclen;
       }
   }

   exit(EXIT_SUCCESS);
}

Copy the C program above into directory in which the files need to be listed. Then execute the below commands.

gcc  getdents.c -o getdents
./getdents

Timings example: getdents can be much faster than ls -f, depending on the system configuration. Here are some timings demonstrating a 40x speed increase for listing a directory containing about 500k files over an NFS mount in a compute cluster. Each command was run 10 times in immediate succession, first getdents, then ls -f. The first run is significantly slower than all others, probably due to NFS caching page faults. (Aside: over this mount, the d_type field is unreliable, in the sense that many files appear as "unknown" type.)

command: getdents $bigdir
usr:0.08 sys:0.96  wall:280.79 CPU:0%
usr:0.06 sys:0.18  wall:0.25   CPU:97%
usr:0.05 sys:0.16  wall:0.21   CPU:99%
usr:0.04 sys:0.18  wall:0.23   CPU:98%
usr:0.05 sys:0.20  wall:0.26   CPU:99%
usr:0.04 sys:0.18  wall:0.22   CPU:99%
usr:0.04 sys:0.17  wall:0.22   CPU:99%
usr:0.04 sys:0.20  wall:0.25   CPU:99%
usr:0.06 sys:0.18  wall:0.25   CPU:98%
usr:0.06 sys:0.18  wall:0.25   CPU:98%
command: /bin/ls -f $bigdir
usr:0.53 sys:8.39  wall:8.97   CPU:99%
usr:0.53 sys:7.65  wall:8.20   CPU:99%
usr:0.44 sys:7.91  wall:8.36   CPU:99%
usr:0.50 sys:8.00  wall:8.51   CPU:100%
usr:0.41 sys:7.73  wall:8.15   CPU:99%
usr:0.47 sys:8.84  wall:9.32   CPU:99%
usr:0.57 sys:9.78  wall:10.36  CPU:99%
usr:0.53 sys:10.75 wall:11.29  CPU:99%
usr:0.46 sys:8.76  wall:9.25   CPU:99%
usr:0.50 sys:8.58  wall:9.13   CPU:99%

Could you add a small benchmark in timing for which your case does display with ls? — Bernhard, Mar 17 '14 at 16:34
Sweet. And you could add an option to simply count the entries (files) rather than listing their names (saving millions of calls to printf, for this listing). — ChuckCottrill, Mar 17 '14 at 20:50
Since the directory has millions of files, you could use puts((char*)d->d_name) rather than printf, so save some processing -- see: http://bytes.com/topic/c/answers/527094-puts-vs-printf — ChuckCottrill, Mar 17 '14 at 21:26
You know your directory is too big when you have to write custom code to list its contents... — casey, Mar 18 '14 at 01:45
@casey Except you don't have to. All this talk about getdents vs readdir misses the point. — Mikel, Mar 18 '14 at 13:35
Come on! It's already got 5 million files in there. Put your custom "ls" program into some other directory. — Johan, Mar 19 '14 at 08:35
@ChuckCottrill Because of the way piping works, that's not really necessary. You could just ./getdents /my/huge/dir | wc -l and it will still be pretty fast. That's because you are giving the output of getdents to wc instead of the stdout (terminal in most cases) — anu, Nov 04 '16 at 18:58
Not a C programmer...any chance somebody can update this so it has glob support? I'm currently piping results through grep, but that seems awfully suboptimal. — mlissner, May 04 '17 at 17:46
Is it expected that this would do very weird things in an sshfs-mounted directory? I'm getting back a fraction of the results I expect. — mlissner, May 05 '17 at 22:48
@Ramesh Good one !! Is there any way to get the file properties like date modified , size etc . — Joby Wilson Mathews, Nov 03 '17 at 10:21

score 19 · Answer 3 · answered Mar 18 '14 at 12:48

19

The most likely reason why it is slow is file type colouring, you can avoid this with \ls or /bin/ls turning off the colour options.

If you really have so many files in a dir, using find instead is also a good option.

answered Mar 18 '14 at 12:48

Alex Lehmann

315

7

I don't think this should have been downvoted. Sorting is one problem, but even without sorting, ls -U --color would take a long time since it would stat each file. So both are correct. – Mikel Mar 18 '14 at 13:59
Turning coloring off has a huge impact on the performance of ls and it is aliased by default in many many .bashrcs out there. – Victor Schröder Aug 07 '18 at 13:59
Yup I did a /bin/ls -U and got output in no time, compared to waiting for a very long time before – khebbie Oct 11 '19 at 07:03
1

Late comment, using find would take forever as well. – annahri May 14 '21 at 10:04

score -4 · Answer 4 · answered Mar 18 '14 at 13:39

-4

I find that echo * works much faster than ls. YMMV.

answered Mar 18 '14 at 13:39

hymie

1,710

4

The shell will sort the *. So this way is probably still very slow for 5 million files. – Mikel Mar 18 '14 at 13:52
4

@Mikel More than that, I'm pretty sure that 5 million files is over the point where globbing will break entirely. – evilsoup Mar 18 '14 at 15:26
5

Minimum file name length (for 5 million files) is 3 characters (maybe 4 if you stick to more common characters) plus delimiters = 4 chars per file, i.e 20 MB of command arguments. That is well over the common 2MB expanded command line length. Exec (and even the builtins) would baulk. – Johan Mar 19 '14 at 08:49

The ls command is not working for a directory with a huge number of files

4 Answers4

Linked