1

According to the Open Group specs, POSIX du doesn't have the -b option to display the size in bytes. So what is the POSIX-compliant way to get the size of a file or folder in bytes?

finefoot
  • 3,060

2 Answers2

1

As an approximation of what GNU du does with -sb, you could do:

cumulative_size() (
  export LC_ALL=C
  ret=0
  [ "$#" -gt 0 ] || set .
  for file do
    case $file in
      (/*) sanitized=$file;;
      (*) sanitized=./$file;;
    esac
    size=$(
      find "$sanitized" ! -type b ! -type c -exec ls -niqd {} + |
        awk '! seen[$1]++ {sum += $6}
             END {print sum}'
    )
    if [ -n "$size" ]; then
      printf '%s\t%s\n' "$size" "$file"
    else
      ret=1
    fi
  done
  exit "$ret"
)

Like GNU du does we try to count files only once, by looking at their inode number (as reported in the first field of ls -ni), but since we don't have the device number which ls cannot report, that assumes the directory hierarchies don't span several filesystems.

Contrary to du we also only do the deduplication in each file argument.

For instance in:

cumulative_size dir dir

The cumulative disk usage of dir and its contents is reported twice, with files within each counted only once, while GNU du -bs would only report dir disk usage once.

We exclude device files because ls -n doesn't report their size. On Linux at least, that won't make a difference as their size is otherwise always reported as 0.

find can't be given file paths that start with - or whose name matches its predicates (including !, (... as well). Here we work around that by prefixing the file paths with ./ if they don't start with /, so find ! or find -print becomes find ./! or find ./-print for instance. That assumes no find implementation has a predicate that starts with /. That means we also don't need to pass a -- to ls to mark the end of its options.

We use ls -n instead of ls -l to avoid decoding uids/gids into user or group names (which would be expensive and also cause problem here for names with spaces). POSIX specifies -o/-g option to remove those fields altogether, but they are optional there.

The output of ls -n is only specified in the C/POSIX locale. Also, file paths being arbitrary sequences of non-null bytes, you can only process them as text in the C locale, hence the LC_ALL=C.

We also use -q to make sure newline characters in file names or symlink targets don't put a spanner in the works.

Also note that since the full paths are passed to ls, we can't process directory structures of arbitrary depth as it will stop working once paths lengths exceed PATH_MAX.

The error reporting is rather crude. We only report a non-zero exit status if any of the computed size return empty. So a zero exit status is not a guarantee that all file sizes have been counted.

  • Your answers are always very helpful and a great source to learn from. I am still trying to understand each line. "The output of ls -n is only specified in the C/POSIX locale." Where are you getting this information from? I had a look at the opengroup utilities/ls page – finefoot May 29 '23 at 23:34
  • @finefoot, the date field is locale-dependant as well as what blank characters may be used to separate fields (though in practice, that date field appears after the field we're interested in so unless it contains newline characters, it's likely not going to be a problem, and I've not seen any ls implementation that uses blank characters other than space to separate fields). C locale in any case removes complex processing in both printing and parsing and reduces the risk of bad surprise. – Stéphane Chazelas May 30 '23 at 05:06
  • Ahh, great. That's explains it, thank you. :) I've seen LC_ALL=C quite a few times in scripts and always wanted to read about the reason why it's used. And you opted for ls compared to wc -c (see below) due to much better performance of reading the size instead of counting the length, right? – finefoot May 30 '23 at 12:27
  • @finefoot, See also What does "LC_ALL=C" do?. wc -c only works for non-directory files you have permission to read and have possible side effects for non-regular files. See also How can I get the size of a file in a bash script? – Stéphane Chazelas May 30 '23 at 12:53
1

Unfortunately, the output format of ls is apparently not standardized. So it might not be the best idea to parse its output.

An alternative POSIX-compliant way to find out the size in bytes of a single file is to use wc -c:

-c Write to the standard output the number of bytes in each input file.

Source: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/wc.html

$ printf %s 0123456789ABCDEF >sixteenbytestestfile  # example file of 16 bytes length
$ wc -c sixteenbytestestfile
16 sixteenbytestestfile

If we don't pass the file as an argument, but via standard input, the filename will be omitted from the output:

$ wc -c <sixteenbytestestfile
16

Apparently, some systems add some whitespace around the number output. We can remove it by using Arithmetic Expansion without any arithmetic operations:

$ filesize="   123   "  # possible wc -c output
$ printf %s\\n "-$filesize-"
-   123   -
$ printf %s\\n "-$((filesize))-"
-123-

In conclusion, here is a definition of a simple function to get the size of a file:

$ filesize() { printf %s\\n "$(($(wc -c <"$1")))"; }
$ filesize sixteenbytestestfile
16
finefoot
  • 3,060