Break down directory by file extension and get total size and count for each

Question

I have a directory (e.g. /home/various/) with many subdirectories (e.g. /home/various/foo/, /home/various/ber/, /home/various/kol/ and /home/various/whatever/).

Is there a command I can run, which will breakdown the contents per file extension showing totals like

total size
number of files

Let's say, I don't want to manually type each file extension in the terminal, in part because I don't know all the file extensions inside (recursively) /various/.

An output like this, would be great:

*.txt 23 files, 10.2MB
*.pdf 8 files, 23.2MB
*.db 3 files, 2.3MB
*.cbz 24 files, 2.3GB
*.html 2,508 files, 43.9MB
*.readme 13 files, 4KB

Are extensions case-sensitive, i.e. .pdf is different than .PDF? What about files that have no extensions? — fpmurphy, Jan 14 '21 at 16:02

Kamil Maciorowski · Answer 1 · 2021-01-15T05:57:58.643

Basic code

duext() {
case "$1" in
  -* )
    set "./$1"
esac
POSIXLY_CORRECT= find "${1-.}" -type f -exec du {} + | awk '
{
  sz=$1
  $1=""
  sub("^ ","")
  sub("^./","")
  sub("^\.","")
  w=split($0,a,".")
  e=tolower(w==1?"":"."a[w])
  s[e]+=sz
  n[e]+=1
}
END {
  for (e in s) print 512*s[e]"\t"n[e]"\t"e
}'
}

Usage: duext path. The default path is .. The function should work in sh and compatible shells.

The function generates lines in the following form:

s<tab>n<tab>e

where s is disk size used (in bytes), n is the number of files, e is the extension. This is different from your requested output because I decided to optimize for parsing. What you call "extension" is just a part of the filename in *nix. Filenames may contain spaces or tabs. Placing e (that may contain spaces or tabs) at the end of the line allows us to recognize other fields reliably. E.g. you can sort by size easily:

duext /home/various/ | sort -rn -k1,1       # optionally: … | column -t

Notes:

Newline characters in pathnames will make the results incorrect.
POSIXLY_CORRECT= du … is a portable way to get disk size used. It reports in units of 512 bytes, therefore 512*s[e] later in the awk code. GNU du provides some interesting options (e.g. --apparent-size); they may require adjusting the awk code.
sub("^\\.","") is responsible for not treating the leading dot in the name as the extension separator. In effect .nfo is interpreted as a (hidden) file without extension rather than a file with nfo extension. If this is not what you want, remove the line.
The code tells apart empty extension (e.g. foo.) from no extension (foo). The former is reported as *.; the latter is reported as *.
The code is case-insensitive. Remove tolower to make it case-sensitive.
Hardlinks can distort the result. Your du may or may not omit a file if it's a hardlink to some already accounted file. Additionally find … -exec du {} + runs du as many times as it needs (to avoid argument list too long) and hardlinked files may or may not be passed to the same du. You can force counting every single hardlink by using du -l (non-portable option in GNU du) or portably by running one du per file: find … -exec du {} \;. To reliably count hardlinks just once, you need a different approach (single instance of GNU du and --files0-from=?). In general it's possible to have hardlinks with different extensions. This is not a problem when you want to count each hardlink separately, but if you want to count them as one file then it's indeterminate which extension to assign.

Customizing format

I'm not sure if by MB you mean mebibytes or megabytes, I assume the latter. The following code should translate to the format you want:

yourformat() { awk '
  function human(x) {
    if (x<1000) {return x} else {x/=1000}
    s="kMGTEPZY";
    while (x>=1000 && length(s)>1)
      {x/=1000; s=substr(s,2)}
    return int(10*x+0.5)/10 substr(s,1,1)
  }
  {
    s=$1; n=$2
    $1=""; $2=""
    sub("^  ","")
    print $0" "n" file"(n==1?"":"s")", "human(s)"B"
  }'
}

(Note: human(x) was taken from this answer and adjusted.)

Use it like this:

duext /home/various/ | yourformat

duext uses awk internally and now we're piping it to yourformat which also uses awk. Overall we could use single awk in a single function instead. Still separate awks allow us to put e.g. sort … in between (in a single shell function or in a pipe between functions). While some kind of sorting can be implemented in awk (or at least in GNU awk), there is no point in reinventing the wheel. IMO keeping the output from the first awk easily parsable is the right thing. This way you can apply any filter and formatting later.

Let's improve your format, so column -t can be used. And how about a factor of 1024?

myformat() { awk '
  function human(x) {
    if (x<1000) {return x" "} else {x/=1024}
    s="kMGTEPZY";
    while (x>=1000 && length(s)>1)
      {x/=1024; s=substr(s,2)}
    return int(10*x+0.5)/10" "substr(s,1,1)"i"
  }
  {
    s=$1; n=$2
    $1=""; $2=""
    sub("^  ","")
    print $0"\t"n" file"(n==1?"":"s")"\t"human(s)"B"
  }'
}

And then:

duext /home/various/ | sort -nr -k1,1 | myformat | column -t -s "$(printf '\t')"

Notes:

"$(printf '\t')" is a portable way to get a tab character. In some shells (e.g. in Bash) $'\t' does the same.
column itself is not portable.
Extensions with tab characters will break the formatting. They are rather rare though.

Frankly I like this solution enough to keep it. I created a script named due for future use:

#!/bin/sh
duext() {
  …
}
myformat {
  …
}
duext "${1-.}" | sort -nr -k1,1 | myformat | column -t -s "$(printf '\t')"

eloyesp · Answer 2 · 2021-01-14T15:19:17.243

It is a very interesting question, the best I can build is this script:

set -e
# set -x
folder=$1
counter=$(tempfile)
List file extensions
list_extensions() {
  find "$folder" -type f |
  while read filename
  do
    basename=${filename##/}
    ext=${basename##.}
    echo ${ext,,}  # downcase extensions to prevent duplicates
  done |
  sort -u
}
list_extensions |
while read extension
do
  size=$(find "$folder" -type f -iname ".$extension" -fprintf $counter . -print0 |
    du -hc --files0-from=- | tail -n 1 | sed -E 's/\s+total//')
  count=$( wc -c < $counter )
  printf ".%-10s\t%6s files\t%10s\n" "$extension" "$count" "$size"
done
rm $counter

It does not support complex filenames, there might be a lot of exceptions and performance is not great, but it does work.

Sample output:

*.wma              122 files          411M
*.wpl               16 files           64K
*.xls                2 files           24K
*.xlsx               1 files           28K
*.zip                5 files          333M

Break down directory by file extension and get total size and count for each

2 Answers2

Basic code

Customizing format

List file extensions