Find biggest files and delete automatically

Question

I run this command to find the biggest files:

du -Sh | sort -rh | head -5

Then I do -rm rf someFile.

Is there a way to automatically delete the files found from the former command?

Do the file sizes differ all the time? Is there a minimum/maximum limit? — ChristophS, Jul 24 '17 at 10:57
It doesn't matter -- I'm deleting whatever are the 5 biggest files. Although it would be good to have the option to set a minimum file size of 25MB. — Dan P., Jul 24 '17 at 10:59
For automatism I would use cron jobs. Place your commands inside a shell script and establish an cron job. — ChristophS, Jul 24 '17 at 11:02
@AFSHIN that makes no attempt to remove the file sizes from the input to xargs. also no attempt to limit to regular files. — cas, Jul 24 '17 at 12:29

cas · Accepted Answer · 2017-07-24T12:19:06.040

If you're using GNU tools (which are standard on linux), you could do something like this:

stat --printf '%s\t%n\0' ./* | 
  sort -z -rn | 
  head -z -n 5 | 
  cut  -z -f 2- |
  xargs -0 -r echo rm -f --

(remove the 'echo' once you've tested it).

The stat command prints out the filesize and name of each file in the current directory separated by a tab, and with each record terminated by a NUL (\0) byte.

the sort command sorts each NUL-terminated record in reverse numeric order. The head command lists only the first five such records, then cut removes the file size field from each record.

Finally xargs takes that (still NUL-terminated) input and uses it as arguments for echo rm -f.

Because this uses NUL as the record (filename) terminator, it copes with filenames that have any valid character in them.

If you want a minimum file size, then you could insert awk or something between the stat and the sort. e.g.

stat --printf '%s\t%n\0' ./* | 
  awk 'BEGIN {ORS = RS = "\0" } ; $1 > 25000000' |
  sort -z -rn | ...

NOTE: GNU awk doesn't have a -z option for NUL-terminated records, but does allow you to set the record separator to whatever you want. We have to set both the output record separator (ORS) and the input record separator (RS) to NUL.

Here's another version that uses find to explicitly limit itself to regular files (i.e. excluding directories, named pipes, sockets, etc) in the specified directory only (-maxdepth 1, no subdirs) which are larger than 25M in size (no need for awk).

This version doesn't need stat because GNU find also has a printf feature. BTW, note the difference in the format string - stat uses %n for the filename, while find uses %p.

find . -maxdepth 1 -type f -size +25M -printf '%s\t%p\0' | 
  sort -z -rn | 
  head -z -n 5 | 
  cut  -z -f 2- |
  xargs -0 -r echo rm -f --

To run it for a different directory, replace the . in the find command. e.g. find /home/web/ ....

shell script version:

#!/bin/sh

for d in "$@" ; do
  find "$d" -maxdepth 1 -type f -size +25M -printf '%s\t%p\0' | 
    sort -z -rn | 
    head -z -n 5 | 
    cut  -z -f 2- |
    xargs -0 -r echo rm -f --
done

save it as, e.g., delete-five-largest.sh somewhere in your PATH and run it as delete-five-largest.sh /home/web /another/directory /and/yet/another

This runs the find ... once for each directory specified on the command line. This is NOT the same as running find once with multiple path arguments (which would look like find "$@" ..., without any for loop in the script). It deletes the 5 largest files in each directory, while running it without the for loop would delete only the five largest files found while searching all of the directories. i.e. five per directory vs five total.

How can a directory to search in be added? For example /home/web — Dan P., Jul 24 '17 at 11:24
the same as with any other command that has filenames as arguments. e.g. stat --printf '%s\t%n\0' /home/web/* | ... — cas, Jul 24 '17 at 11:26
I've made a .sh script and tried to run the last command in your answer with sh r.sh, and nothing is printed out. I put the script in the directory that has all the folders for files, and the files directly in them (no more subfolders) — Dan P., Jul 24 '17 at 12:08
For example Folder1/file.jpg (and many other files) Folder2/blabla.mp3 (and many other files) r.sh on the same level as Folder1 and Folder2. — Dan P., Jul 24 '17 at 12:08
did you run it from within that directory (i.e. cd to it first) or from somewhere else. . means the current directory, not the directory containing the script. i'll add an example of how to turn this into a script. — cas, Jul 24 '17 at 12:10
Note the use of find's -maxdepth 1 option. that explicitly limits find to the specified directory only, without any recursion to subdirectories. that's quite deliberate. If you save the shell script as in my updated answer above, you'd then run it like delete-five-largest.sh /path/to/Folder[12]/ — cas, Jul 24 '17 at 12:22
Great thanks. I'll pick your answer as the correct one because there are more useful options. I also successfully used Stephane's answer so we should upvote his answer so he earns some points as well. — Dan P., Jul 24 '17 at 12:27
yeah, part of my purpose was to show not only how to do something but how to turn that into a re-usable tool. the tool-building culture of unix is, IMO, the best thing about it...you're not just a passive consumer of "apps", you build tools to suit your needs. — cas, Jul 24 '17 at 12:33

Stéphane Chazelas · Answer 2 · 2017-07-24T11:44:20.377

5

With recent GNU tools (you're already using GNU-specific options):

du -S0 . |sort -zrn | sed -z 's@[^/]*@.@;5q' | xargs -r0 echo rm -rf

(remove echo if happy).

The -0/-z is to be able to copy with files/directories with arbitrary names.

Note that most rm implementations will refuse to remove . (the current working directory), so you may want to do it from one level up and do:

du -S0 dir | sort -zrn | sed -z 's@\s*\d+\s*@@;5q' | xargs -r0 echo rm -rf

So it can remove dir if that's one of the biggest files (note that it would also remove all the subdirs). It's not clear from your requirements if it's really what you want.

Now, if all you want is to remove the 5 biggest regular files (excluding other types of files like directories, devices, symlinks...), it's just a matter of using zsh and:

echo rm -f ./**/*(D.OL[1,5])

(OL is to reverse-sort by length (size, not disk usage)).

edited Jul 24 '17 at 11:44

answered Jul 24 '17 at 11:24

Stéphane Chazelas

544,893

It seems to be deleting files, but not the biggest ones. When running my initial command du -Sh | sort -rh | head -5 following your command (without the echo), the 5 biggest files are still there. – Dan P. Jul 24 '17 at 11:35
try changing rm -f to rm -fv to get it to print out what it's deleting as it deletes them. – cas Jul 24 '17 at 11:36
rm: cannot remove './Wo72RXD6YLWdAICUFKT3uj7n4kB3': Is a directory – Dan P. Jul 24 '17 at 11:37
It printed that 5 times. The parent folder of each file. – Dan P. Jul 24 '17 at 11:39
My goal is to remove the file only, not the dir. – Dan P. Jul 24 '17 at 11:40
@DanP, you need the -r to remove directories (and their content). See also the edit. – Stéphane Chazelas Jul 24 '17 at 11:40
@DanP, do you want to remove the biggest files based on size or disk usage? – Stéphane Chazelas Jul 24 '17 at 11:41
Size. For example if there are 10 files, one 51MB, next 52MB, up to 60MB, it should delete the files that are 56MB to 60MB. – Dan P. Jul 24 '17 at 11:43
Note that du reports the disk usage, not the size (it also shows hard links to a given file only once) – Stéphane Chazelas Jul 24 '17 at 11:45
Got it. Actually disk usage would be fine too, I am not sure what the practical difference is. – Dan P. Jul 24 '17 at 11:48
Command ends up being: du -S0 . |sort -zrn | sed -z 's@[^/]*@.@;5q' | xargs -r0 echo rm -f ./**/*(D.OL[1,5]) ? – Dan P. Jul 24 '17 at 11:48
@DanP, for instance, after truncate -s100T file, you have a 100TiB file that takes no space on disk. – Stéphane Chazelas Jul 24 '17 at 11:49
@DanP, no. Only echo rm -f ./**/*(D.OL[1,5]), but in the zsh shell (and remove echo if happy) – Stéphane Chazelas Jul 24 '17 at 11:50
i assumed he meant file size, not disk usage...which is why i ignored his original use of du and went straight for stat (and later, find). i also assumed he wanted to delete files rather than directories, so didn't use rm -r and took steps to avoid sub-directories. – cas Jul 24 '17 at 11:52
Used zsh and rm -f ./**/*(D.OL[1,500]) with success. Cleared up some good space on the server. Thanks. – Dan P. Jul 24 '17 at 12:04

score 0 · Answer 3 · answered Jul 24 '17 at 11:15

0

Here you've got a (subshell intensive) loop for each file. Replace the echo by your rm command:

du -Sh /your/search/path/ |\
sort -rh |\
head -5 |\
awk '{print $2}' |\
while read file ; do
  echo "$file"
done

Works within acutal bash. But this is anything else but an nice script. And I am sure to earn some comments because of whitespaces inside filenames. ;) They are welcome!

If you are familar with cron jobs, execute this script periodically.

answered Jul 24 '17 at 11:15

ChristophS

565
5
13

This seems to be taking the parent folder of the file instead of the file itself – Dan P. Jul 24 '17 at 11:23
Hm, simply used your du command and did't verify that. Right now there are some competent answers you could take a look to ... – ChristophS Jul 24 '17 at 11:29

score 0 · Answer 4 · answered Jul 24 '17 at 14:00

Here's a simple answer that hopefully helps you - 'find / -type f -size 1G -exec rm {} \;' This will find any file under root that is a file, not a directory, that is over 1G in size, and will remove it. You can add extra file sorting after exec if you need to choose a file by name for example. Size can be changed to M (megabytes), k (kilobytes), c (bytes). There are many options to find and it's a powerful command, check out the man page! :)

Find biggest files and delete automatically

4 Answers4

Linked