13

I have a directory where lots of cached files are getting generated very quickly. Since these are very small files, it is consuming all my inodes very quickly.

Currently I am running the following command to find all the files older than 12 hours and delete them.

$ find ./cache -mtime +0.5 -exec rm {} \;

But the rate at which this command is deleting is slower than the rate at which files are being generated. Can someone tell me some alternative way to remove large number of files quickly.

  • Maybe the problem is that it's not deleting enough files. You are aware that your command is deleting files whose mtime is more than 12 hours, right? If the files are generated very quickly, then you probably need to delete more recent files than that. – Joseph R. Oct 21 '13 at 09:20
  • Actually the command which @Gnouc mentioned worked faster. I dont know the reason how it works faster. – pradeepchhetri Oct 21 '13 at 09:43
  • 6
    Replace the \; with a + or better use -delete as you seem to be using GNU find. – Stéphane Chazelas Oct 21 '13 at 12:48

8 Answers8

26

find … -exec rm {} \; executes the rm command for each file. Even though starting a new process is pretty fast, it's still a lot slower than the mere act of deleting a file.

find … -exec rm {} + would call rm in batches, which is a lot faster: you pay the cost of running rm once per batch, and each batch performs many deletion.

Even faster is to not invoke rm at all. The find command on Linux has an action -delete to delete a matching file.

find ./cache -type f -mtime +0.5 -delete

However, if you're producing files at such a rate that find … -exec rm {} \; can't keep up, there's probably something wrong with your setup. If cache contains millions of files, you should split it into subdirectories for faster access.

  • 1
    Nearly 7 million very small DHCP lease files consumed all the inodes. find -delete saved the day; removed about 6.5 million files in the same time as rm took to remove about 20 thousand. – studog Feb 27 '18 at 18:50
  • There is a subtle and possibly sinister problem with all these answers. Find does a depth first search - including subdirectories. If ./cache contains subdirectories those contents will ALSO be removed, but as written, find will try to delete the directories BEFORE deleting the content. There are solutions here: 1. Add -depth if deletion of subdirectories is desired. and 2. Add -type f to avoid attempting sub-directory deletion. Another way to limit to the CURRENT directory only is to use -prune – Steven the Easily Amused Jul 25 '20 at 17:49
  • @SteventheEasilyAmused -delete automatically turns on -prune. But skipping directories is a good idea: find won't delete non-empty directories, but it will signal an error. – Gilles 'SO- stop being evil' Jul 25 '20 at 19:38
19

Try using xargs:

find ./cache -mtime +0.5 -print0 | xargs -0 rm -f

Update explaination for @pradeepchhetri

If you use find with -exec, every file that find found will call rm one time. So if you found a huge of files, i.e 10000 files, you called rm 10000 times.

xargs will treat ouput of find as command argument to rm, so that, xargs will provide as many arguments as rm can handle at once, i.e rm -f file1 file2 ... So it makes less fork call, make program run faster.

cuonglm
  • 153,898
  • 1
    Can you explain me why this runs faster the command i mentioned. I noticed that it works faster. – pradeepchhetri Oct 21 '13 at 09:34
  • @pradeepchhetri With your method, find starts a new rm process for every file that meets the criteria. With Gnouc's method, xargs starts only one instance for rm for a bunch of files. Starting many less programs makes it faster. – kurtm Oct 21 '13 at 11:31
  • @kurtm: Thank you for the explanation. But i saw remarkable difference between the two commands, is the number of fork() syscalls is the only difference between the two ? – pradeepchhetri Oct 21 '13 at 12:06
  • @pradeepchhetri Yes, that is the only difference, the number of rm processes spawned. – kurtm Oct 21 '13 at 13:34
  • if you use find with + instead of ; it will bundle up the execs and will only launch an exec when the command line gets "big" – kdubs Feb 17 '17 at 18:31
  • @cuonglm does work better than find with the delete option or rsync? – cokedude Jan 30 '20 at 15:51
  • @cokedude should use find -delete if your find supports. – cuonglm Jan 31 '20 at 02:58
1

If you just want to get rid of many files as soon as possible ls -f1 /path/to/folder/with/many/files/ | xargs rm might work okay, but better don't run it on production systems because your system might become IO issues and applications might get stuck during the delete operation.

This script works nicely for many files and should not affect the ioload of the system.

#!/bin/bash

# Path to folder with many files
FOLDER="/path/to/folder/with/many/files"

# Temporary file to store file names
FILE_FILENAMES="/tmp/filenames"

if [ -z "$FOLDER" ]; then
    echo "Prevented you from deleting everything! Correct your FOLDER variable!"
    exit 1
fi

while true; do
    FILES=$(ls -f1 $FOLDER | wc -l)
    if [ "$FILES" -gt 10000 ]; then
        printf "[%s] %s files found. going on with removing\n" "$(date)" "$FILES"
        # Create new list of files
        ls -f1 $FOLDER | head -n 5002 | tail -n 5000 > "$FILE_FILENAMES"

        if [ -s $FILE_FILENAMES ]; then
            while read FILE; do
                rm "$FOLDER/$FILE"
                sleep 0.005
            done < "$FILE_FILENAMES"
        fi
    else
        printf "[%s] script has finished, almost all files have been deleted" "$(date)"
        break
    fi
    sleep 5
done
1

Although find is the best (simplest, idiomatic) approach,

find $dir -exec rm {} +

You could move the directory aside, create a new directory (for your program), and then delete...

mv $idr old$dir && mkdir $dir && rm -rf old$dir

but maybe your problem is creating too many files. Why not change your program to append to an existing file, rather than create a new file? And then you could move this (logfile) aside, and then your program could create/append to a new file, for example,

fd = open("logfile","a+");
1

if the creation rate exceeds deletion rate you are best of by making the cache completely empty, and removing old files without any mtime evaluation

mv cache foobar
mkdir cache
# may require app restart
rm -rf foobar
kerolasa
  • 267
  • But this would cause the process accessing the cache directory to either fail, or continue to write into the foobar directory (depending on how it is written). Meanwhile, this would remove ALL entries that were originally in cache – Steven the Easily Amused Jul 25 '20 at 17:36
0

rm -rf directory/ also works faster for billion of files in one folder. I tried that.

  • But that would not prevent the program which generates the files to recreate the folder. Or it could make it fail, which is not a desirable effect. Apparently, the point here is to delete only old cached files, not all of them. – lgeorget Apr 17 '14 at 07:00
0

Another Linux specific solution would be to use inotify(7) facilities ; you'll detect when files are added, and then you'll immediately run something to remove the older ones.

OTOH, I guess that you might have some XY problem. Why do you have so many new files? Perhaps using sqlite, or GDBM indexed files, or some real database (e.g. PostGresQL, MariaDB, MongoDB) might be better.... Perhaps you need some version control system like git?

-2
find . -name -mtime +0.5 -print -delete 

is another option to delete large number of files quickly.

Archemar
  • 31,554
Manoj
  • 1
  • 1
    This was likely downvoted because it's dangerous. Where the original question mentioned ./cache this answer assumes that ./cache is the current directory - which it would not be per the OP. Secondly, the addition of the -print would spew (tens of) thousands of lines of output.

    There is another subtle issue: if there are subdirectories in the current directory, it will attempt (and fail) to delete them - unless they are already empty... but that may also be undesired! Why: Because find does a depth-first search starting at each subdirectory.

    – Steven the Easily Amused Jul 25 '20 at 17:41