Basic code
duext() {
case "$1" in
-* )
set "./$1"
esac
POSIXLY_CORRECT= find "${1-.}" -type f -exec du {} + | awk '
{
sz=$1
$1=""
sub("^ ","")
sub("^./","")
sub("^\.","")
w=split($0,a,".")
e=tolower(w==1?"":"."a[w])
s[e]+=sz
n[e]+=1
}
END {
for (e in s) print 512*s[e]"\t"n[e]"\t"e
}'
}
Usage: duext path
. The default path
is .
. The function should work in sh
and compatible shells.
The function generates lines in the following form:
s<tab>n<tab>e
where s
is disk size used (in bytes), n
is the number of files, e
is the extension. This is different from your requested output because I decided to optimize for parsing. What you call "extension" is just a part of the filename in *nix. Filenames may contain spaces or tabs. Placing e
(that may contain spaces or tabs) at the end of the line allows us to recognize other fields reliably. E.g. you can sort by size easily:
duext /home/various/ | sort -rn -k1,1 # optionally: … | column -t
Notes:
- Newline characters in pathnames will make the results incorrect.
POSIXLY_CORRECT= du …
is a portable way to get disk size used. It reports in units of 512 bytes, therefore 512*s[e]
later in the awk
code. GNU du
provides some interesting options (e.g. --apparent-size
); they may require adjusting the awk
code.
sub("^\\.","")
is responsible for not treating the leading dot in the name as the extension separator. In effect .nfo
is interpreted as a (hidden) file without extension rather than a file with nfo
extension. If this is not what you want, remove the line.
- The code tells apart empty extension (e.g.
foo.
) from no extension (foo
). The former is reported as *.
; the latter is reported as *
.
- The code is case-insensitive. Remove
tolower
to make it case-sensitive.
- Hardlinks can distort the result. Your
du
may or may not omit a file if it's a hardlink to some already accounted file. Additionally find … -exec du {} +
runs du
as many times as it needs (to avoid argument list too long
) and hardlinked files may or may not be passed to the same du
. You can force counting every single hardlink by using du -l
(non-portable option in GNU du
) or portably by running one du
per file: find … -exec du {} \;
. To reliably count hardlinks just once, you need a different approach (single instance of GNU du
and --files0-from=
?). In general it's possible to have hardlinks with different extensions. This is not a problem when you want to count each hardlink separately, but if you want to count them as one file then it's indeterminate which extension to assign.
Customizing format
I'm not sure if by MB
you mean mebibytes or megabytes, I assume the latter. The following code should translate to the format you want:
yourformat() { awk '
function human(x) {
if (x<1000) {return x} else {x/=1000}
s="kMGTEPZY";
while (x>=1000 && length(s)>1)
{x/=1000; s=substr(s,2)}
return int(10*x+0.5)/10 substr(s,1,1)
}
{
s=$1; n=$2
$1=""; $2=""
sub("^ ","")
print $0" "n" file"(n==1?"":"s")", "human(s)"B"
}'
}
(Note: human(x)
was taken from this answer and adjusted.)
Use it like this:
duext /home/various/ | yourformat
duext
uses awk
internally and now we're piping it to yourformat
which also uses awk
. Overall we could use single awk
in a single function instead. Still separate awk
s allow us to put e.g. sort …
in between (in a single shell function or in a pipe between functions). While some kind of sorting can be implemented in awk
(or at least in GNU awk
), there is no point in reinventing the wheel. IMO keeping the output from the first awk
easily parsable is the right thing. This way you can apply any filter and formatting later.
Let's improve your format, so column -t
can be used. And how about a factor of 1024?
myformat() { awk '
function human(x) {
if (x<1000) {return x" "} else {x/=1024}
s="kMGTEPZY";
while (x>=1000 && length(s)>1)
{x/=1024; s=substr(s,2)}
return int(10*x+0.5)/10" "substr(s,1,1)"i"
}
{
s=$1; n=$2
$1=""; $2=""
sub("^ ","")
print $0"\t"n" file"(n==1?"":"s")"\t"human(s)"B"
}'
}
And then:
duext /home/various/ | sort -nr -k1,1 | myformat | column -t -s "$(printf '\t')"
Notes:
"$(printf '\t')"
is a portable way to get a tab character. In some shells (e.g. in Bash) $'\t'
does the same.
column
itself is not portable.
- Extensions with tab characters will break the formatting. They are rather rare though.
Frankly I like this solution enough to keep it. I created a script named due
for future use:
#!/bin/sh
duext() {
…
}
myformat {
…
}
duext "${1-.}" | sort -nr -k1,1 | myformat | column -t -s "$(printf '\t')"
.pdf
is different than.PDF
? What about files that have no extensions? – fpmurphy Jan 14 '21 at 16:02