find
is great, and I use it all the time, for tasks much more complicated than this...but sometimes figuring out all of find
's options and getting it to do what you want and then using other programs like sort
, grep
, uniq
, etc is a bit of a PITA and it seems simpler to just write your own custom tool to do exactly what you want in a language with a decent library for recursively searching directories, and do it with a decent editor rather than the shell's command-line editor.
So you end up writing yet another minor variation of something like the following. Change the wanted
subroutine, and you change what the find
function discovers. This one prints out a list of directories containing at least one regular file <= 10MiB in size, with a filename ending with .mp4
:
$ cat find-mp4-1.pl
#!/usr/bin/perl
use strict;
use File::Find;
my %found;
sub wanted {
-f $_ && -s $_ <= 10485760 && /.mp4\Z/s &&
$found{$File::Find::dir . "/"}++;
};
Search all directories listed on command line.
Default to current directory
find(&wanted, @ARGV ? @ARGV : '.');
print join("\n", sort keys %found), "\n" if %found;
I've written so many little File::Find
scripts like this that I've lost count.
Sample run:
$ mkdir videos
$ touch video1.mp4 videos/video2.mp4
$ ./find-mp4-1.pl
./
./videos/
And then you realise that it would sometimes be useful to have NUL-separated output, so it needs a -0
option. And once that's done, think that being able to specify the required size on the command line would be nice, and ditto for the filename pattern to match, and an option for searching case-insensitively would be great, and so would being able to use "human-readable" sizes, and I could make it a little bit faster by pre-compiling the regex and matching only against the basename portion of the filename (who doesn't love a little bit of premature optimisation) and... you get carried away and do this:
$ cat find-mp4-2.pl
#!/usr/bin/perl
use strict;
use File::Find;
use Number::Bytes::Human qw(parse_bytes);
use Getopt::Std;
my %found;
my %opts;
$Getopt::Std::STANDARD_HELP_VERSION=1;
our $VERSION='0.2';
getopts('0:s:r:i',%opts) ||
die "Usage: $0 [-0] [-s size] [-r regex] [-i] [directory...]\n";
my $sep = $opts{0} ? "\0" : "\n";
my $size = $opts{s} // '10MiB';
my $regex = $opts{r} // '.mp4\Z';
$size = parse_bytes($size);
pre-compile the regex: case insensitive or case sensitive?
$regex = $opts{i} ? qr/$regex/si : qr/$regex/s;
sub wanted {
-f $_ && -s $_ <= $size && $File::Find::name =~ /$regex/ &&
$found{$File::Find::dir . "/"}++;
};
find(&wanted, @ARGV ? @ARGV : '.');
print join($sep, sort keys %found), $sep if %found;
Note: File::Find and Getopt::Std are core perl modules and are included with perl.
Number::Bytes::Human is not, it needs to be installed separately (on Debian and derivatives: sudo apt-get install libnumber-bytes-human-perl
. Other distros may have it packaged too. Otherwise, install it with cpan
).
Or just delete the use Number::Bytes::Human qw(parse_bytes);
and $size = parse_bytes($size);
lines and specify file sizes in bytes like some primitive cave-person.
And then you think "hmmm...maybe I should have used Getopt::Long instead of Getopt::Std
to be able to handle --long
options too, and having a -c
option to output the number of matches in a directory might be useful and it needs documentation and ...". Maybe you even start modifying it to do that before you realise, "No! This is madness. Tool-making is fun, but enough is enough.".
You know, just as some hypothetical example of what someone crazy enough might do, not naming any names or anything. I can stop any time I want to. Where's my sponsor's phone number? I think I need to call them.
BTW, to print only the directories that contain exactly one matching video, you could change the print join ...
line to:
foreach (sort keys %found) {
print "$d\n" if $found{$_} == 1
};
(or print "$d$sep" ...
for the second version)
Note that this would print directories that contained more than one .mp4 file where only one of them was <= 10MB. To exclude those, you'd have to modify the wanted
subroutine so that they never made it into the %found
hash (or were deleted from it before the find()
function finishes). Maybe by using another hash to keep track of directories where more than one .mp4 file was found, something like:
sub wanted {
next unless -f $_ && $File::Find::name =~ /\.mp4\Z/s;
my $d = $File::Find::dir . '/';
$seen{$d}++;
if ($seen{$d} > 1) {
delete $found{$d};
} else {
$found{$d} = 1 if -s $_ <= 10485760;
}
};
and change the my %found;
line to my (%found, %seen);