Recursively search all directories that have one mp4 file whose size is less than 10MB

Question

I am trying to recursively find all directories that have one mp4 file whose size is less than 10MB.

The conditions are ,

There must be only one mp4 file in the directory.
The mp4 file can not be more than 10MB.

The command I am using is

% find . -type f -name "*.mp4" -size -10M | cut -d/ -f2 | sort | uniq -c | grep "^      1"

I am not sure what is going on, but this command is not returning accurate result.

After further investigation I found that the following command works.

find . -type 'f' -name "*.mp4" -printf '%h\n' | sort | uniq -c | grep -E "\s+1\s"| cut -c 9-

But when I add -size -10000000c to the mix , it finds files that has one mp4 file which has size less than 10MB, but there are other mp4 files which have size greater than 10MB. I mean my mentioned command does not take into consideration the mp4 files which has size greater than 10MB. I think the problem in can be broken into two steps.

Find all directories that has one mp4 file. Which is done by the mentioned command.
Check if the files are less than 10MB.

I can get the file size of single mp4 files in a directory using.

find . -type 'f' -name "*.mp4" -printf '%h\n' | sort | uniq -c | grep -E "\s+1\s" | cut -c 9-| xargs -I {} -n 1 /usr/bin/du -a "{}" | grep -v ".mp4$"

What does that command return? And why isn't that accurate? – Henrik supports the community Dec 22 '21 at 19:00 — Henrik supports the community, Dec 22 '21 at 19:00
@Henriksupportsthecommunity i have updated the question. – Ahmad Ismail Dec 23 '21 at 04:37 — Ahmad Ismail, Dec 23 '21 at 04:37

Stéphane Chazelas · Accepted Answer · 2021-12-23T06:21:49.113

With GNU find at least, -size -10M is true for files whose size rounded up to the next mebibyte is strictly less than 10, so 9 or less.

A file that is 9 x 1024 x 1024 + 1 = 9437185 byte large is not selected because that's rounded up to 10MiB so not < 10.

For files that are strictly smaller that 10MB (1 megabyte is 1,000,000 bytes, not to be confused with 1 mebibyte == 1,048,576 bytes), so sizes 0 to 9,999,999, use:

find . -size -10000000c

For files strictly smaller than 10MiB, so sizes from 0 to 10485759:

find . -size -10485760c

Now, to get the directories containing one and only one of those files, on a GNU system, you can do:

LC_ALL=C find . -name '*.mp4'  -type f -size -10000000c -printf '%h\0' |
  LC_ALL=C sort -z |
  LC_ALL=C uniq -zu |
  tr '\0' '\n'

Where

find prints the head (dirname) of those files, NUL delimited (note the LC_ALL=C to report all filenames ending in .mp4 even those whose name is otherwise not valid text in the current locale).
sort sorts them for uniq (again, with LC_ALL=C to avoid problems with filenames that are not valid text in the locale, and other problems with characters with not completely defined order).
uniq -zu reports only the unique ones.

The list of files is passed between those NUL delimited as NUL is the only character that cannot occur in a file path. We only convert those NULs to newline in the end with tr for human consumption.

With zsh, you could also do:

print -rC1 -- **/*(NFe['()(( $# == 1 )) $REPLY/*.mp4(N.L-10000000Y2)'])

Where:

print -rC1 -- prints its arguments raw on 1 Column
**/ is any number of subdirectories.
*(NF...) is any filename (excluding hidden ones) but further qualified by those N, F, e... glob qualifiers.
N: enables nullglob for that glob so that it expands to nothing instead of returning an error if there's no match.
F: selects Full directories (directories with at least one entry other than . and ..).
e[code]: selects files for which the code is successful.
() {body} arguments is an anonymous functions taking a number of arguments.
The {body} here is the (( $# == 1 )) arithmetic evaluation that returns true if the number of arguments to that anonymous function is 1.
$REPLY inside the code is the path to the file (here directory) being considered.
*.mp4(qualifiers): (non-hidden) mp4 files further qualified.
.: regular files only (like find's -type f).
L-10000000: files strictly smaller than 10MB.
Y2: stop after finding 2 files as an optimisation.

Note that it doesn't consider . (the current working directory itself). If you want it to be considered, replace **/* with {.,**/*}.

Now, as you've now clarified, if you want to find directories that contain only one mp4 file, and that file be regular (not directory, symlink...) and be smaller than 10MB (so for instance exclude a dir that contains both a 5MB and 15MB mp4 file even though it only has one mp4 file less than 10MB on the ground that it has more than one mp4 in total regardless of size), still with zsh:

print -rC1 -- **/*(NFe['
    () {
      (( $# == 1 )) && ()(($#)) $1(N.L-10000000)
    } $REPLY/*.mp4(NY2)
  '])

With GNU find and GNU awk (or any awk that can deal with NUL-delimited records), that could be:

LC_ALL=C find . -name '*.mp4' -printf '%h\0%s\0%y\0' |
  awk -v RS='\0' '
   {
     getline size; getline type
     total[$0]++
     if (size < 10e6 && type == "f") found[$0]++
   }
   END {for (dir in found) if (total[dir] == 1) print dir}'

In both zsh and bash solutions, I am getting directories that has more than one mp4 files. The conditions are , 1. There must be only one mp4 file in the directory. 2. The mp4 file can not be more than 10MB. — Ahmad Ismail, Dec 23 '21 at 03:31
the command find . -type 'f' -name "*.mp4" -printf '%h\n' | sort | uniq -c | grep -E "\s+1\s" works. But when I add -size -10000000c to the mix , it finds files that has one mp4 file which has size less than 10MB, but there are other mp4 files which have size greater than 10MB. I mean my mentioned command does not take into consideration the mp4 files which has size greater than 10MB. I think I should break the problem in two steps. 1. Find all directories that has one mp4 file. Which is done by the mentioned command. 2. Check if the files are less than 10MB. — Ahmad Ismail, Dec 23 '21 at 03:53

cas · Answer 2 · 2021-12-31T10:19:24.277

find is great, and I use it all the time, for tasks much more complicated than this...but sometimes figuring out all of find's options and getting it to do what you want and then using other programs like sort, grep, uniq, etc is a bit of a PITA and it seems simpler to just write your own custom tool to do exactly what you want in a language with a decent library for recursively searching directories, and do it with a decent editor rather than the shell's command-line editor.

So you end up writing yet another minor variation of something like the following. Change the wanted subroutine, and you change what the find function discovers. This one prints out a list of directories containing at least one regular file <= 10MiB in size, with a filename ending with .mp4:

$ cat find-mp4-1.pl 
#!/usr/bin/perl
use strict;
use File::Find;
my %found;
sub wanted {
  -f $_ && -s $_ <= 10485760 && /.mp4\Z/s &&
    $found{$File::Find::dir . "/"}++;
};
Search all directories listed on command line.
Default to current directory
find(&amp;wanted, @ARGV ? @ARGV : '.');
print join("\n", sort keys %found), "\n" if %found;

I've written so many little File::Find scripts like this that I've lost count.

Sample run:

$ mkdir videos
$ touch video1.mp4 videos/video2.mp4
$ ./find-mp4-1.pl 
./
./videos/

And then you realise that it would sometimes be useful to have NUL-separated output, so it needs a -0 option. And once that's done, think that being able to specify the required size on the command line would be nice, and ditto for the filename pattern to match, and an option for searching case-insensitively would be great, and so would being able to use "human-readable" sizes, and I could make it a little bit faster by pre-compiling the regex and matching only against the basename portion of the filename (who doesn't love a little bit of premature optimisation) and... you get carried away and do this:

$ cat find-mp4-2.pl
#!/usr/bin/perl
use strict;
use File::Find;
use Number::Bytes::Human qw(parse_bytes);
use Getopt::Std;
my %found;
my %opts;
$Getopt::Std::STANDARD_HELP_VERSION=1;
our $VERSION='0.2';
getopts('0:s:r:i',%opts) ||
  die "Usage: $0 [-0] [-s size] [-r regex] [-i] [directory...]\n";
my $sep   = $opts{0} ? "\0" : "\n";
my $size  = $opts{s} // '10MiB';
my $regex = $opts{r} // '.mp4\Z';
$size  = parse_bytes($size);
pre-compile the regex: case insensitive or case sensitive?
$regex = $opts{i} ? qr/$regex/si : qr/$regex/s;
sub wanted {
  -f $_ && -s $_ <= $size && $File::Find::name =~ /$regex/ &&
    $found{$File::Find::dir . "/"}++;
};
find(&amp;wanted, @ARGV ? @ARGV : '.');
print join($sep, sort keys %found), $sep if %found;

Note: File::Find and Getopt::Std are core perl modules and are included with perl. Number::Bytes::Human is not, it needs to be installed separately (on Debian and derivatives: sudo apt-get install libnumber-bytes-human-perl. Other distros may have it packaged too. Otherwise, install it with cpan).

Or just delete the use Number::Bytes::Human qw(parse_bytes); and $size = parse_bytes($size); lines and specify file sizes in bytes like some primitive cave-person.

And then you think "hmmm...maybe I should have used Getopt::Long instead of Getopt::Std to be able to handle --long options too, and having a -c option to output the number of matches in a directory might be useful and it needs documentation and ...". Maybe you even start modifying it to do that before you realise, "No! This is madness. Tool-making is fun, but enough is enough.".

You know, just as some hypothetical example of what someone crazy enough might do, not naming any names or anything. I can stop any time I want to. Where's my sponsor's phone number? I think I need to call them.

BTW, to print only the directories that contain exactly one matching video, you could change the print join ... line to:

  foreach (sort keys %found) {
    print "$d\n" if $found{$_} == 1
  };

(or print "$d$sep" ... for the second version)

Note that this would print directories that contained more than one .mp4 file where only one of them was <= 10MB. To exclude those, you'd have to modify the wanted subroutine so that they never made it into the %found hash (or were deleted from it before the find() function finishes). Maybe by using another hash to keep track of directories where more than one .mp4 file was found, something like:

sub wanted {
  next unless -f $_ && $File::Find::name =~ /\.mp4\Z/s;
my $d = $File::Find::dir . '/';
  $seen{$d}++;
if ($seen{$d} > 1) {
    delete $found{$d};
  } else {
    $found{$d} = 1 if -s $_ <= 10485760;
  }
};

and change the my %found; line to my (%found, %seen);

Recursively search all directories that have one mp4 file whose size is less than 10MB

2 Answers2

Search all directories listed on command line.

Default to current directory

pre-compile the regex: case insensitive or case sensitive?