24

Is there a straightforward way to find all the sparse files on my system, or in a particular directory tree?

If it's relevant, I'm using zsh on Ubuntu 12.04, although a more generic Unix-y answer for bash/sh, for example, would be fine.

Edit: to clarify, I'm looking to search for sparse files, not check the sparseness status of a single one.

5 Answers5

14

On systems (and file systems) supporting the SEEK_HOLE lseek flag (like your Ubuntu 12.04 on ext4 would) and assuming the value for SEEK_HOLE is 4 as it is on Linux:

if perl -le 'seek STDIN,0,4;$p=tell STDIN;
   seek STDIN,0,2; exit 1 if $p == tell STDIN'< the-file; then
  echo the-file is sparse
else
  echo the-file is not sparse
fi

That shell syntax is POSIX. The non-portable stuff in it are perl and that SEEK_HOLE.

lseek(SEEK_HOLE) seeks to the start of the first hole in the file, or the end of the file if no hole is found. Above we know the file is not sparse when the lseek(SEEK_HOLE) takes us to the end of the file (to the same place as lseek(SEEK_END)).

If you want to list the sparse files:

find . -type f ! -size 0 -exec perl -le 'for(@ARGV){open(A,"<",$_)or
  next;seek A,0,4;$p=tell A;seek A,0,2;print if$p!=tell A;close A}' {} +

The GNU find (since version 4.3.3) has -printf %S to report the sparseness of a file. It takes the same approach as frostschutz' answer in that it takes the ratio of disk usage vs file size, so is not guaranteed to report all sparse files (like when there's compression at filesystem level or where the space saved by the holes doesn't compensate for the filesystem infrastructure overhead or large extended attributes), but would work on systems that don't have SEEK_HOLE or file systems where SEEK_HOLE is not implemented. Here with GNU tools:

LC_ALL=C find . -type f ! -size 0 -printf '%S:%p\0' |
  LC_ALL=C awk -v RS='\0' -F : '$1 < 1 {sub(/^[^:]*:/, ""); print}'

(note that an earlier version of this answer didn't work properly when find expressed the sparseness as for instance 3.2e-05. Thanks to @flashydave's answer for bringing it to my attention. LC_ALL=C is need for the decimal radix to be . instead of the locale's one (not all awk implementations honour the locale's setting)

  • Same comment as above; I'm looking for a way to find all sparse files, not check a particular file. – Andrew Ferrier Aug 12 '13 at 19:12
  • 1
    Maybe find should also exclude 0-byte-files outright? – frostschutz Aug 12 '13 at 19:56
  • @frostschutz, good point, answer updated. – Stéphane Chazelas Aug 12 '13 at 20:04
  • Nice find with the find -printf '%S'! :-) – frostschutz Aug 14 '13 at 10:29
  • @StéphaneChazelas Thanks. I'd already deleted my comment. Time for me to update that machine!! – localhost Oct 26 '16 at 11:57
  • Thanks, this one (old gnu version)is actually popping them off much faster now, worked for me, on RHEL64 7.3. It spotted the less than hundred of them in about 5 minutes or less, across 3TB. Good deal. @frostschutz getting the job done, taking longer maybe. Nice work on both!! – Brian Thomas Jan 30 '17 at 00:07
  • so i have this wonderful command find . -type f ! -size 0 -printf '%S:%p\0' | sed -zn 's/^0[^:]*://p' | tr '\0' '\n' , im having a hard time figuring out how to alter the exec, or add parens, or a -delete so that i can delete all found files. how would i alter that command for delete? – Brian Thomas Jan 30 '17 at 01:03
  • 1
    @Brian, replace the tr command with xargs -r0 rm -f – Stéphane Chazelas Jan 30 '17 at 07:29
  • I cant understand it, but for some reason this only worked that day. This is a rhel 7.3. Shouldn't this also detect 0 byte files? I made a bash script called sparseoff https://defuse.ca/b/b5f4d3bsdlklNqGx1p9DJQ , cant seem to get it to work, althought i had it working, probably something simple.. – Brian Thomas Feb 15 '17 at 02:20
  • @BrianThomas, you probably have compression enabled on your file system. ZFS does support lseek(SEEK_HOLE) so you should be able to use the more reliable approach described here. Note that empty files have neither holes nor allocated data. They're neither sparse nor full, they're empty, that's why we exclude them explicitly as we can tell they're not sparse. $PATH is a special variable (list of paths to look up executables), don't use it as a variable name for anything else. Remember to quote your variables. – Stéphane Chazelas Feb 15 '17 at 08:20
  • @StéphaneChazelas find with %S still doesn't work for me with LANG=de_DE.UTF-8 because find will print 7,01365e-05 instead of 7.01365e-05, and awk won't parse that as a single number. LC_ALL=C find ... works for me. – Martin von Wittich Oct 26 '22 at 13:29
  • 1
    @MartinvonWittich, with gawk, you'd need to call it with the POSIXLY_CORRECT env var set for it to recognise the locale's decimal radix. Nevertheless, using LC_ALL=C for both find and awk would make it more portable, I'll edit that in. – Stéphane Chazelas Oct 26 '22 at 14:37
9

A file is usually sparse when the number of allocated blocks is smaller than the file size (here using GNU stat as found on Ubuntu, but beware other systems may have incompatible implementations of stat).

if [ "$((`stat -c '%b*%B-%s' -- "$file"`))" -lt 0 ]
then
    echo "$file" is sparse
else
    echo "$file" is not sparse
fi

Variant with find: (stolen from Stephane)

find . -type f ! -size 0 -exec bash -c '
    for f do
        [ "$((`stat -c "%b*%B-%s" -- "$f"`))" -lt 0 ] && printf "%s\n" "$f";
    done' {} +

You'd usually put this in a shell script instead, then exec the shell script.

find . -type f ! -size 0 -exec ./sparsetest.sh {} +
frostschutz
  • 48,978
  • That may not work if the sparse blocks are not enough to cover for the overhead of indirect blocks in traditional file systems for instance, of if compression instead of sparseness is reducing the amount of allocated space. – Stéphane Chazelas Aug 12 '13 at 17:28
  • Sure; SEEK_HOLE is just as problematic though, as it's not supported by many platforms/filesystems. In Linux you could also use FIEMAP/FIBMAP, but FIBMAP in particular is horribly slow... there just doesn't seem to be a good way. – frostschutz Aug 12 '13 at 17:51
  • Also a lot of these methods require the file to be synced first. – frostschutz Aug 12 '13 at 17:57
  • Thanks. That doesn't really answer the question, though. I'm not looking to check if a particular file is sparse, but to find all sparse files on the system. – Andrew Ferrier Aug 12 '13 at 19:11
  • What do you mean by the file needs to be synced first? – Stéphane Chazelas Aug 12 '13 at 19:44
  • 2
    @AndrewFerrier sorry, I guess I thought it was trivial enough to wrap this in a for file in * or find. If you can test a single file, you can test all files... although you do have to exclude directories with this method. – frostschutz Aug 12 '13 at 19:57
  • This answer is great, thanks for adding the wrapper. I've marked the other as correct only because the answer is more compact. – Andrew Ferrier Aug 13 '13 at 18:58
  • Trying the other two options, they are all running, they havent yielded anything yet after minutes already, and this one off the top found one. I need a quick way to find and kill them, since its a zfs backup that doesnt support sparse files, this is perfect. Thanks! Note, potential workaround to DeviceIoControl, FSCTL_SET_SPARSE error here. If i can just stop freefilesync to stop trying to write more ill be good. – Brian Thomas Jan 30 '17 at 00:00
7

You can find sparse files with the %S format in find:

# find / -type f -printf "%S\t%p\n" | gawk '$1 < 1.0 {print}'
0.0139994       /var/log/lastlog
0.959592        /usr/lib/locale/locale-archive
...

Found it in this article: https://www.thegeekdiary.com/how-to-find-all-the-sparse-file-in-linux/

akostadinov
  • 1,048
  • 1
    Changing this to the accepted answer, since it is more straightforward than the others shown so far. – Andrew Ferrier Mar 19 '20 at 11:59
  • 1
    Note that it can give false negatives. See for instance on a ext4 FS with 4KiB block size (the usual default on Ubuntu). fallocate -l 64M file1; fallocate -pl 4K file1 and truncate -s4K file2; setfattr -n user.foo -v "$(seq 1000)" file2. Both file1 and file2 are sparse files with a 4KiB hole, but %S returns 1 for both. IOW, if $1 is >= 1, you can't tell whether the file is sparse or not. If it's < 1, it should be sparse unless the FS supports compression. – Stéphane Chazelas Oct 05 '20 at 12:17
  • @StéphaneChazelas, it is a good point and even sparseness > 1 is described in man find. I guess it depends on why exactly one is looking for sparse files. e.g. if you're looking at your VM images to see if they can possibly grow or some other reason. – akostadinov Oct 05 '20 at 15:37
4

Stephane Chazelas answer above doesn't take into account the fact that some sparse files with the find %S parameter report the ratio as floating point numbers like

9.31323e-09:./somedir/sparsefile.bin

These can be found in addition with

find . -type f ! -size 0 -printf '%S:%p\0' |
   sed -zn '/^\(0[^:]*:\)\|\([0-9.]\+e-.*:\)/p' |
   tr '\0' '\n'
1

A short script I wrote while trying to find out what are the locations of holes in a file:

#!/usr/bin/python3
import os
import sys
import errno

def report(fname):
    fd = os.open(fname, os.O_RDONLY)
    len = os.lseek(fd, 0, os.SEEK_END)
    offset = 0
    while offset < len:
        start = os.lseek(fd, offset, os.SEEK_HOLE)
        if start == len:
            break
        try:
            offset = os.lseek(fd, start, os.SEEK_DATA)
        except OSError as e:
            if e.errno == errno.ENXIO:
                offset = len
            else:
                raise
        print(f'found hole between 0x{start:08X} and 0x{offset:08X} ({offset - start} bytes)')

if __name__ == '__main__':
    for name in sys.argv[1:]:
        report(name)

This prints stuff like:

$ echo -n 'a' >zeros; truncate -s $((4096*4)) zeros; test/report-holes.py zeros
found hole between 0x00001000 and 0x00004000 (12288 bytes)
zbyszek
  • 214
  • Doesn't answer my question as I was looking for sparse files, not the holes in a specific file, but still a useful/relevant script. Thanks. Upvoted. – Andrew Ferrier Feb 13 '18 at 10:19