You can't do this with grep
alone. You're going to need to use at least find
, and several other
programs besides.
Here's one way to do it using GNU versions of find
, stat
, sort
, tail
, cut
, xargs
, grep
, and sed
:
find . -type f -iname '*.csv' -execdir sh -c '
stat --printf "%Y\t$(pwd)/%n\0" "$@" |
sort -z -n |
tail -z -n 1 |
cut -z -f2- |
xargs -0r grep -l foo' sh {} + | sed 's=/\./=/='
For each directory containing one or more .csv files, the find -execdir
option changes to that dir and runs a shell command that outputs a NUL-separated list of the full paths to each matching filename, with each filename prefixed by its modification timestamp and a tab.
That list is then sorted numerically, and all but the most recently modified filename are dropped (by tail
), the timestamp is cut
from the output and the filenames are piped into xargs
which runs grep
.
Finally, sed
is used to clean up the output to remove the /./
artifact that is embedded by $(pwd)/%n
in the stat --printf
string and replace it with just /
. This isn't strictly necessary as the pathname(s) work exactly the same with or without the /./
(Unix doesn't care at all about extra /
s or ./
s in the path portion of a pathname), but it looks nicer.
Notes:
If required, you can use find
's -mindepth
and -maxdepth
predicates to control how find recursively searches the subdirectories.
Neither grep
nor sed
use or produce NUL-separated output here, so it isn't "safe" to use in a pipeline if any filenames contain newlines but it's OK-ish if all you want to do is just display the filenames in your terminal. To make it safe to pipe into other programs, add the -Z
option to grep and -z
to sed....with those two changes, the filename list will be NUL-separated from beginning to end.
This won't work correctly if the matching filenames in any individual directory exceed the command-line length limit (ARG_MAX, approx 2MB on Linux), as it will have to run the sh -c '...'
more than once for that directory, spoiling the desired result of sorting and tailing the filename list. This is worth noting but is very unlikely to be a problem in practice.
Similarly, the stat --printf
expands each filename to include its full path, which may prevent stat
from running successfully....this is more likely to be a problem, but still not likely in practice. It would still need to have lots of filenames with a very long path prefix to exceed 2MB ARG_MAX.
This is an example of a very commonly used technique that is often called "Decorate-Sort-Undecorate" or similar. It has been used by programmers in all sorts of languages for a very long time, since at least when lisp was shiny and new. In this case, find
can't sort by timestamp so if we want to do that, we need to add the timestamp to find's output (decorate), then sort it, then remove the timestamp (undecorate).
As mentioned in one of my comments below, this could also be done with perl
's File::Find and IO::Uncompress::AnyUncompress modules:
#!/usr/bin/perl
use File::Find;
use IO::Uncompress::AnyUncompress qw(anyuncompress $AnyUncompressError) ;
use Getopt::Std;
use strict;
my %files; # hash-of-arrays to contain the filename with newest timestamp for each dir
my @matches; # array to contain filenames that contain the desired search pattern
my %opts; # hash to contain command-line options
sub usage {
print <<EOF;
$0 [-p 'search pattern'] [-f 'filename pattern'] [directory...]
-p and -f are required, and must have arguments.
directory defaults to current directory.
Example:
$0 -p ABCD-713379 -f 'WB.*.xml.gz$' /data/inventory/
EOF
exit 1
};
Extremely primitive option processing and error checking.
usage unless getopts('p:f:', %opts) && $opts{p} && $opts{f};
default to current directory if not supplied.
@ARGV = qw(./) unless @ARGV;
Find the newest filename in each subdirectory
find(&wanted, @ARGV);
OK, we should now have a %files hash where the keys are the
directory names, and the values are an array containing a
timestamp and the newest filename in that directory.
Now "grep" each of those files by reading in each
line and seeing if it contains the search pattern.
IO::Uncompress::AnyUncompress ensures this works with
compressed and uncompressed files. Works with most common
compression formats.
The map ...
extracts only the filenames from %files - see "perldoc -f map"
foreach my $f (map { $files{$_}[1] } keys %files) {
my $z = IO::Uncompress::AnyUncompress->new($f) or
warn "anyuncompress failed for '$f': $AnyUncompressError\n";
while (my $line = $z->getline()) {
if ($line =~ m/$opts{p}/i) { push @matches, $f ; last };
};
};
Output the list of matching filenames, separated by newlines.
print join("\n",@matches), "\n";
#print join("\0",@matches), "\0"; # alternatively, NUL-separated filenames
"wanted()" subroutine used by File::Find to match files
sub wanted {
ignore directories, symlinks, etc and files that don't
match the filename pattern.
return unless (-f && /$opts{f}/i);
Is this the first file we've seen in this dir? Is the current
file newer than the one we've already seen?
If either is true, store it in %files.
my $t = (stat($File::Find::name))[9];
if (!defined $files{$File::Find::dir} || $t > $files{$File::Find::dir}[0]) {
$files{$File::Find::dir} = [ $t, $File::Find::name ]
};
};
Ignoring comments, this is about 35 lines of code. Most of it is boiler-plate. It took longer to write the comments than the code as most of it is just copy-pasted and edited from the man pages for the modules or from previous similar scripts I've written before.
Run it as, e.g., ./find-and-grep.pl -f '\.csv$' -p foo ./
.
or ./find-and-grep.pl -p ABCD-713379 -f 'WB.*\.xml\.gz$' /data/inventory/
pick the last file(or latest file created)
- Unix doesn't store file creation times, is most recently modified good enough or highest number at the end of the file name? Are there other directories ("folder" is a Windows term, in Unix it's "directory") underSubfolderMUM
, etc. or is it just 1 layer of subdirectories with only files under them? – Ed Morton May 31 '22 at 14:33