Deleting billions of files from a directory while seeing the progress as well

Question

I have a directory of 30 TB having billions of files in it which are formally all JPEG files. I am deleting each folder of files like this:

sudo rm -rf bolands-mills-mhcptz

This command just runs and doesn't show anything whether it's working or not.

I want to see as it's deleting files or what is the current status of the command.

Not answers: Sometimes it is faster to backup the stuff you want to keep, format, and restore the stuff you want to keep. Other answers: http://unix.stackexchange.com/questions/37329/efficiently-delete-large-directory-containing-thousands-of-files — Eric Towers, Dec 01 '16 at 17:59
If you just want an idea of progress, rather than knowing which particular files have been removed, you could run "df /dev/sd_whatever_the_drive_is". — jamesqf, Dec 01 '16 at 18:21
How did you end up with billions of files in a single directory?? — Lightness Races in Orbit, Dec 02 '16 at 16:44
In future, I suggest using a filesystem which can delete faster than ext4, such as xfs, or better yet, a next-gen filesystem such as ZFS or btrfs, where you can simply delete an entire dataset at once. — Michael Hampton, Dec 03 '16 at 03:54
@MichaelHampton But if the files aren't a separate dataset, it may take a long time. (on ZFS) https://serverfault.com/questions/801074/delete-10m-files-from-zfs-effectively — v7d8dpo4, Dec 03 '16 at 15:28
@v7d8dpo4 If that happens, then the system wasn't planned well enough at the start. — Michael Hampton, Dec 03 '16 at 18:11
@MichaelHampton, the last time I tried XFS, it was significantly slower than ext3 when deleting files. — Simon Richter, Dec 04 '16 at 15:42
@MichaelHampton using your logic all directories should be different datasets to avoid the chance that a program fills a directory by mistake. — Ed Neville, Feb 25 '17 at 16:16
@EdNeville Huh? Not all directories, just the ones that need to be. — Michael Hampton, Feb 25 '17 at 17:42
@MichaelHampton, that's the problem, it's a great feature to destroy a FS easily, but it is useless if you cannot predict the directories that will suffer a problem. — Ed Neville, Feb 25 '17 at 20:10
@EdNeville In the OP's case, it's very likely that it was predictable well in advance. In fact it's predictable in most cases. — Michael Hampton, Feb 25 '17 at 20:13

Lesmana · Accepted Answer · 2017-03-10T21:10:07.657

107

You can use rm -v to have rm print one line per file deleted. This way you can see that rm is indeed working to delete files. But if you have billions of files then all you will see is that rm is still working. You will have no idea how many files are already deleted and how many are left.

The tool pv can help you with a progress estimation.

http://www.ivarch.com/programs/pv.shtml

Here is how you would invoke rm with pv with example output

$ rm -rv dirname | pv -l -s 1000 > logfile
562  0:00:07 [79,8 /s] [====================>                 ] 56% ETA 0:00:05

In this contrived example I told pv that there are 1000 files. The output from pv shows that 562 are already deleted, elapsed time is 7 seconds, and the estimation to complete is in 5 seconds.

Some explanation:

pv -l makes pv to count by newlines instead of bytes
pv -s number tells pv what the total is so that it can give you an estimation.
The redirect to logfile at the end is for clean output. Otherwise the status line from pv gets mixed up with the output from rm -v. Bonus: you will have a logfile of what was deleted. But beware the file will get huge. You can also redirect to /dev/null if you don't need a log.

To get the number of files you can use this command:

$ find dirname | wc -l

This also can take a long time if there are billions of files. You can use pv here as well to see how much it has counted

$ find dirname | pv -l | wc -l
278k 0:00:04 [56,8k/s] [     <=>                                              ]
278044

Here it says that it took 4 seconds to count 278k files. The exact count at the end (278044) is the output from wc -l.

If you don't want to wait for the counting then you can either guess the number of files or use pv without estimation:

$ rm -rv dirname | pv -l > logfile

Like this you will have no estimation to finish but at least you will see how many files are already deleted. Redirect to /dev/null if you don't need the logfile.

Nitpick:

do you really need sudo?
usually rm -r is enough to delete recursively. no need for rm -f.

edited Mar 10 '17 at 21:10

answered Dec 01 '16 at 11:49

Lesmana

27,439

6

Nice use of pv, assuming it's not too expensive to count the billions of files ;-). (It might take nearly as much time as the rm it's supposed to measure!) – Stephen Kitt Dec 01 '16 at 14:36
7

@StephenKitt This is what really annoys me (and many other people) about the Windows file utility: it always, without fail, counts the number and sizes of files before deleting which, unless the drive is much slower than the processor, takes almost as long as the actual deletion! – wizzwizz4 Dec 01 '16 at 17:37
@wizzwizz4 Indeed! There's more to it than that though IIRC — it checks that it can delete everything before deleting anything, to increase the chances of deletions being "all or nothing". Many years ago I wrote a filesystem driver for Windows, there were quite a few oddities we had to deal with, including some related to the way Explorer goes about deleting, but I can't remember the details. (I do remember that creating a folder involves writing and deleting a file in the new folder!) – Stephen Kitt Dec 01 '16 at 17:43
@StephenKitt That would be why, in folders with certain permissions, I can create a folder from the command line but not Explorer. I do prefer chmod to the Windows file permissions system, even though I've been working with the Windows one for longer. – wizzwizz4 Dec 01 '16 at 17:46
7

@StephenKitt Maybe I'm mistaken, but isn't the bottleneck, besides the disk access, the terminal output? I believe pv refreshes the progress bar only once per second, despite its input. So, the terminal only needs to display one line instead of a ton each second. pv only needs to increment a counter for each newline it encounters; that's got to be faster than doing line wrapping, and whatnot for displaying a line in a terminal. I think running with pv like this is causes the file removals to be faster than simply rm -rv. – JoL Dec 01 '16 at 19:25
1

@jlmg you're right, the output will take time. My comment stemmed from the fact that the OP's silent rm is taking a very long time already, so simply counting the files is also going to take a very long time. (Since the OP's current command is silent, terminal output isn't a bottleneck in the question.) – Stephen Kitt Dec 01 '16 at 19:39
@StephenKitt I see. You're right that it will take slightly longer than the silent rm, though that's hardly avoidable to show progress as well. – JoL Dec 01 '16 at 19:50
@jlmg with billions of files it will take significantly longer. – Stephen Kitt Dec 01 '16 at 20:01
@StephenKitt to count them? I don't think pv has a reason to slow down rm; it can probably process the lines faster than rm can write them. Do you mean simply because rm has to write them? Can the time taken to write a line to a pipe really be that comparable to that of removing a file from disk? – JoL Dec 01 '16 at 20:10
@StephenKitt the file name already comes with the dirent structure when listing out the contents of a directory, so there doesn't seem to be a need for additional disk access to write out the filepath... – JoL Dec 01 '16 at 20:16
@jglm I mean the whole operation will take longer: counting the files, then deleting them. I agree rm and pv shouldn't be slower than rm on its own; but here we're talking about adding a lengthy operation so we can track another lengthy operation, so it's worth pointing out the cost of the measurement (as lesmana has done in an edit to this answer). – Stephen Kitt Dec 01 '16 at 20:30
@StephenKitt You were talking about doing the find operation first. I'm sorry; that's twice I misunderstood you. I was just thinking of pv -l without pv -s. – JoL Dec 01 '16 at 20:39
@StephenKitt and jlmg: Watching df -ih output change is definitely efficient, and doesn't require a count before-hand to show you meaningful numbers. (And BTW, I agree writing filenames into a pipe to pv is very cheap. It should be full-buffered, and counting lines in pv should be cheap (and happen on another core in a multi-threaded system). You will almost certainly still bottleneck on disk I/O, except maybe on a slow embedded NAS CPU with fast SSDs. Spewing lines to the terminal is what's potentially slow, even with a smart terminal emulator that skips when it falls behind. – Peter Cordes Dec 02 '16 at 01:32
re: last points of this answer: rm -f stops it from pausing for confirmation before deleting read-only files. POSIX requires rm (without -f) to be interactive in that case. I agree that usually you don't need it. I used to use alias rm=rm -i (and use \rm -r foo after I was sure I'd typed the right thing. But now I use alias rm=rm -I (capital-i), which only queries if deleting 4 or more things, or deleting anything recursively. – Peter Cordes Dec 02 '16 at 01:37
Hey, bash guru! Is there any way to count files and put it after -s in one line (single command?)? – skywinder Aug 12 '17 at 11:37
2

@skywinder rm -rv dirname | pv -l -s $(find dirname | wc -l) > logfile – Lesmana Aug 12 '17 at 14:23
ls -1 -U is also a good, fast way to get number of files – JJS Oct 30 '20 at 12:41
On AIX using ksh, the option is -e, rm -e file_to_delete – yamex5 Sep 23 '21 at 17:13

score 29 · Answer 2 · edited Apr 13 '17 at 12:36

29

Check out lesmana's answer, it's much better than mine — especially the last pv example, which won't take much longer than the original silent rm if you specify /dev/null instead of logfile.

Assuming your rm supports the option (it probably does since you're running Linux), you can run it in verbose mode with -v:

sudo rm -rfv bolands-mills-mhcptz

As has been pointed out by a number of commenters, this could be very slow because of the amount of output being generated and displayed by the terminal. You could instead redirect the output to a file:

sudo rm -rfv bolands-mills-mhcptz > rm-trace.txt

and watch the size of rm-trace.txt.

edited Apr 13 '17 at 12:36

Community

1

answered Dec 01 '16 at 10:26

Stephen Kitt

434,908

5

This can actually slow the delete down because of all the output being generated and rendered to a terminal :) – rackandboneman Dec 02 '16 at 14:10
2

Of course it will slow down. Writing billions of lines to a file does not happen in zero time. – user207421 Dec 03 '16 at 23:16

Peter Cordes · Answer 3 · 2016-12-02T15:56:35.433

Another option is to watch the number of files on the filesystem decrease. In another terminal, run:

watch  df -ih   pathname

The used-inodes count will decrease as rm makes progress. (Unless the files mostly had multiple links, e.g. if the tree was created with cp -al). This tracks deletion progress in terms of number-of-files (and directories). df without -i will track in terms of space used.

You could also run iostat -x 4 to see I/O operations per second (as well as kiB/s, but that's not very relevant for pure metadata I/O).

If you get curious about what files rm is currently working on, you can attach an strace to it and watch as the unlink() (and getdents) system calls spew on your terminal. e.g. sudo strace -p $(pidof rm). You can ^c the strace to detach from rm without interrupting it.

I forget if rm -r changes directory into the tree it's deleting; if so you could look at /proc/<PID>/cwd. Its /proc/<PID>/fd might often have a directory fd open, so you could look at that to see what your rm process is currently looking at.

df -ih is indeed a nice cheap way of watching rm progress. — Stephen Kitt, Dec 02 '16 at 08:01
BTW, this doesn't work on BTRFS, where the used-inode count is always zero. :( Same for FAT32, but you probably don't have billions of files on your /boot EFI system partition. — Peter Cordes, Dec 30 '16 at 02:09
if you don't have watch, this seems to do ok -- for (( ; ; )); do; clear; df -ih /path/to/dir; sleep 5; done — Cory Mawhorter, Feb 12 '23 at 02:19

score 5 · Answer 4 · edited Apr 13 '17 at 12:13

While the above answers all use rm, rm can actually be quite slow at deleting a large numbers of files, as I recently observed when extracting ~100K files from a .tar archive actually took less time than deleting them. Although this does not actually answer the question you asked, a better solution to your problem might be to use a different method to delete your files, such as one of the upvoted answers to this question.

My personal favorite method is to use rsync -a --delete. I find that this method performs fast enough that it's worth the ease-of-use over the most upvoted answer to that question, in which the author has written a C program that you would need to compile. (Note that this will output every file being processed to stdout, much like rm -rv; this can slow down the process by a surprising amount. If you do not want this output, use rsync -aq --delete or redirect the output to a file instead.)

The author of that answer says:

The program will now (on my system) delete 1000000 files in 43 seconds. The closest program to this was rsync -a --delete which took 60 seconds (which also does deletions in-order, too but does not perform an efficient directory lookup).

I have found that this is good enough for my purposes. Also potentially important from that answer, at least if you're using ext4:

As a forethought, one should remove the affected directory and remake it after. Directories only ever increase in size and can remain poorly performing even with a few files inside due to the size of the directory.

huh, I would have expected rm and/or find --delete to be efficient. Interesting point about deleting in sort order to avoid b-tree rebalances while deleting. Not sure how much of that applies to other filesystems. XFS is also not great with millions of files per directory. IDK about BTRFS, but I'm under the impression that it might be good for that sort of thing. — Peter Cordes, Dec 03 '16 at 19:18
Doesn't that second quote depend on the type of filesystem... — Menasheh, Dec 04 '16 at 05:14

score 3 · Answer 5 · 2016-12-03T04:18:47.480

One thing you could do would be to start up the rm process in the background (with no output, so it won't be slowed down) and then, monitor it in the foreground with a simple^(a) command:

pax> ( D=/path/to/dir ; rm -rf $D & while true ; do
...>   if [[ -d $D ]] ; then
...>     echo "$(find $D | wc -l) items left"
...>   else
...>     echo "No items left"
...>     break
...>   fi
...>   sleep 5
...> done )

27912 items left
224 items left
No items left

pax> _

The find/wc combo could be replaced with any tool able to give you the units you want.

^(a) Well, relatively simple, compared to, say, nuclear physics, the Riemann hypothesis, or what to buy my wife for Xmas :-)

score 0 · Answer 6 · answered Feb 25 '17 at 16:14

A while ago I wrote something to print the rate that lines were printed. You can run rm -rfv | ./counter and it will print lines per sec/min. Although not a direct progress, it will give you some feedback on the progress rate, maybe the rm wandered into a network filesystem or similar perhaps?

Link to the code is here:

http://www.usenix.org.uk/code/counter-0.01.tar.gz

Deleting billions of files from a directory while seeing the progress as well

6 Answers6

Linked

Related