2

I have one unique requirement. Inside a folder there is plenty of subfolder and inside the subfolder there is plenty of CSV file. Looks like below

SubfolderWB
>File1.csv
>File2.csv

SubfolderMUM >File3.csv >File4.csv >file5.csv

SubfolderKEL >File6.csv >File7.csv

Now in each subfolder I need to pick the last file(or latest file created) and match with keyword using grep. If the keyword matches I need the file name.

Example: I need to find foo in CSV files in all subfolders.
So I need to pick files cat SubfolderWB/File2.csv,SubfolderMUM/file5.csv ,SubfolderKEL/File7.csv | grep foo.
If foo is present in file5.csv it should give me final output as file5.csv.

cas
  • 78,579
  • 1
    Regarding pick the last file(or latest file created) - Unix doesn't store file creation times, is most recently modified good enough or highest number at the end of the file name? Are there other directories ("folder" is a Windows term, in Unix it's "directory") under SubfolderMUM, etc. or is it just 1 layer of subdirectories with only files under them? – Ed Morton May 31 '22 at 14:33
  • @amit here the answers use the modification time. see also this post for modify/change/access time. – thanasisp May 31 '22 at 21:28
  • 1
    @EdMorton your point about folders/directories is something I was missing, I have remained silent for a long time about this. – thanasisp May 31 '22 at 21:30
  • @thanasisp yeah, I usually bit my tongue too but once in a while I mention it. – Ed Morton May 31 '22 at 21:41

4 Answers4

3

You can't do this with grep alone. You're going to need to use at least find, and several other programs besides.

Here's one way to do it using GNU versions of find, stat, sort, tail, cut, xargs, grep, and sed:

find . -type f -iname '*.csv' -execdir sh -c '
    stat --printf "%Y\t$(pwd)/%n\0" "$@" |
      sort -z -n |
      tail -z -n 1 |
      cut -z -f2- |
      xargs -0r grep -l foo' sh {} + | sed 's=/\./=/='

For each directory containing one or more .csv files, the find -execdir option changes to that dir and runs a shell command that outputs a NUL-separated list of the full paths to each matching filename, with each filename prefixed by its modification timestamp and a tab.

That list is then sorted numerically, and all but the most recently modified filename are dropped (by tail), the timestamp is cut from the output and the filenames are piped into xargs which runs grep.

Finally, sed is used to clean up the output to remove the /./ artifact that is embedded by $(pwd)/%n in the stat --printf string and replace it with just /. This isn't strictly necessary as the pathname(s) work exactly the same with or without the /./ (Unix doesn't care at all about extra /s or ./s in the path portion of a pathname), but it looks nicer.


Notes:

  1. If required, you can use find's -mindepth and -maxdepth predicates to control how find recursively searches the subdirectories.

  2. Neither grep nor sed use or produce NUL-separated output here, so it isn't "safe" to use in a pipeline if any filenames contain newlines but it's OK-ish if all you want to do is just display the filenames in your terminal. To make it safe to pipe into other programs, add the -Z option to grep and -z to sed....with those two changes, the filename list will be NUL-separated from beginning to end.

  3. This won't work correctly if the matching filenames in any individual directory exceed the command-line length limit (ARG_MAX, approx 2MB on Linux), as it will have to run the sh -c '...' more than once for that directory, spoiling the desired result of sorting and tailing the filename list. This is worth noting but is very unlikely to be a problem in practice.

    Similarly, the stat --printf expands each filename to include its full path, which may prevent stat from running successfully....this is more likely to be a problem, but still not likely in practice. It would still need to have lots of filenames with a very long path prefix to exceed 2MB ARG_MAX.

  4. This is an example of a very commonly used technique that is often called "Decorate-Sort-Undecorate" or similar. It has been used by programmers in all sorts of languages for a very long time, since at least when lisp was shiny and new. In this case, find can't sort by timestamp so if we want to do that, we need to add the timestamp to find's output (decorate), then sort it, then remove the timestamp (undecorate).


As mentioned in one of my comments below, this could also be done with perl's File::Find and IO::Uncompress::AnyUncompress modules:

#!/usr/bin/perl

use File::Find; use IO::Uncompress::AnyUncompress qw(anyuncompress $AnyUncompressError) ; use Getopt::Std; use strict;

my %files; # hash-of-arrays to contain the filename with newest timestamp for each dir my @matches; # array to contain filenames that contain the desired search pattern my %opts; # hash to contain command-line options

sub usage { print <<EOF; $0 [-p 'search pattern'] [-f 'filename pattern'] [directory...] -p and -f are required, and must have arguments. directory defaults to current directory. Example: $0 -p ABCD-713379 -f 'WB.*.xml.gz$' /data/inventory/ EOF exit 1 };

Extremely primitive option processing and error checking.

usage unless getopts('p:f:', %opts) && $opts{p} && $opts{f};

default to current directory if not supplied.

@ARGV = qw(./) unless @ARGV;

Find the newest filename in each subdirectory

find(&amp;wanted, @ARGV);

OK, we should now have a %files hash where the keys are the

directory names, and the values are an array containing a

timestamp and the newest filename in that directory.

Now "grep" each of those files by reading in each

line and seeing if it contains the search pattern.

IO::Uncompress::AnyUncompress ensures this works with

compressed and uncompressed files. Works with most common

compression formats.

The map ... extracts only the filenames from %files - see "perldoc -f map"

foreach my $f (map { $files{$_}[1] } keys %files) { my $z = IO::Uncompress::AnyUncompress->new($f) or warn "anyuncompress failed for '$f': $AnyUncompressError\n";

while (my $line = $z->getline()) { if ($line =~ m/$opts{p}/i) { push @matches, $f ; last }; }; };

Output the list of matching filenames, separated by newlines.

print join("\n",@matches), "\n"; #print join("\0",@matches), "\0"; # alternatively, NUL-separated filenames

"wanted()" subroutine used by File::Find to match files

sub wanted {

ignore directories, symlinks, etc and files that don't

match the filename pattern.

return unless (-f && /$opts{f}/i);

Is this the first file we've seen in this dir? Is the current

file newer than the one we've already seen?

If either is true, store it in %files.

my $t = (stat($File::Find::name))[9]; if (!defined $files{$File::Find::dir} || $t > $files{$File::Find::dir}[0]) { $files{$File::Find::dir} = [ $t, $File::Find::name ] }; };

Ignoring comments, this is about 35 lines of code. Most of it is boiler-plate. It took longer to write the comments than the code as most of it is just copy-pasted and edited from the man pages for the modules or from previous similar scripts I've written before.

Run it as, e.g., ./find-and-grep.pl -f '\.csv$' -p foo ./.

or ./find-and-grep.pl -p ABCD-713379 -f 'WB.*\.xml\.gz$' /data/inventory/

cas
  • 78,579
  • Thanks for the help!!........ In my case file size is 50 MB I think that is why no result is displayed – amit10504 May 31 '22 at 09:25
  • Nope, grep doesn't have arbitrary limits on file size so this will work with files of any size. if a .csv file contains foo (or whatever it is you're searching for), then this will find it. There must be something you haven't mentioned about your files and/or directory structure. – cas May 31 '22 at 09:31
  • I have changed it to this – amit10504 May 31 '22 at 09:37
  • find /data/inventory/ -type f -iname '*WB*.xml.gz' -execdir sh -c ' stat --printf "%Y\t$(pwd)/%n\0" "$@" | sort -z -n | tail -z -n 2 | cut -z -f2- | xargs -0r grep -l ABCD-713379' sh {} + | sed 's=/\./=/=' – amit10504 May 31 '22 at 09:41
  • you'll need to use zgrep if your files are gzip compressed. note that zgrep does not support either the -z or -Z options so NUL-separated output won't be possible. If you need that, write a perl script instead (perl has a File::Find core module that does recursive filename searches like find, as well as modules like PerlIO::Gzip for reading and/or writing gzipped files). – cas May 31 '22 at 09:41
  • Thanks Cas!! It worked perfectly. Appreciate the effort for explanation. – amit10504 May 31 '22 at 09:48
  • Actually, IO::Uncompress::AnyUncompress would be better than PerlIO::Gzip. It's a core perl module (so is included with perl) and supports a lot more compression formats: gzip, zip, bzip2, zstd, xz, lzma, lzip, lzf and lzop. BTW, python would also be a good choice for writing a script to do this, it also has good library modules for doing these kinds of tasks. – cas May 31 '22 at 09:50
  • You can omit stat, using -printf '%Ts %p\0' (or similar) with the super-duper find. – thanasisp May 31 '22 at 15:07
  • @thanasisp no, I can't. It has to find the most recent file in each directory, not the most recent file in any of the subdirs. Using -execdir and stat rather than just -exec was the simplest/easiest way of achieving this in a shell one-liner. – cas May 31 '22 at 16:01
  • Yes, right, you 'd have to modify the following commands also (sort by parent and time, get first by parent), or, for every dir do find (it is about depth == 2). – thanasisp May 31 '22 at 16:41
2

Given a set of subdirectories with files

 % tree -tD --timefmt='%H:%M:%S'
.
├── [07:46:40]  SubfolderKEL
│   ├── [07:46:20]  File1
│   ├── [07:46:24]  File3
│   ├── [07:46:26]  File4
│   ├── [07:46:30]  File6
│   ├── [07:46:32]  File7
│   ├── [07:46:34]  File8
│   ├── [07:46:36]  File9
│   └── [08:05:32]  File11
├── [07:46:54]  SubfolderWB
│   ├── [07:46:38]  File10
│   ├── [07:46:48]  File15
│   ├── [07:46:52]  File17
│   └── [07:46:54]  File18
└── [07:46:58]  SubfolderMUM
    ├── [07:46:22]  File2
    ├── [07:46:28]  File5
    ├── [07:46:42]  File12
    ├── [07:46:44]  File13
    ├── [07:46:46]  File14
    ├── [07:46:50]  File16
    ├── [07:46:56]  File19
    └── [07:46:58]  File20

3 directories, 20 files

then using zsh, you could pick the newest file (by modification time) from each subdirectory using glob qualifiers in an anonymous function:

 % for d (Subfolder*(/)) (){ print -rC1 $@ } $d/*(om[1])
SubfolderKEL/File11
SubfolderMUM/File20
SubfolderWB/File18

Using the same structure, you can grep for content and return the name(s) of files that contain matches:

 % for d (Subfolder*(/)) (){ grep -l foo -- $@ } $d/*(om[1])
SubfolderKEL/File11
steeldriver
  • 81,074
1

Assuming you're OK with file modification times rather than creation times (which Unix doesn't keep), then using GNU find, sort, and awk:

#!/usr/bin/env bash

find . -type f -name '.csv' -printf '%T@ %p\0' | sort -srnz | awk -v RS='\0' ' ARGIND == 1 { match($0,"[^ ]+ ((.)/[^/]+$)",a) if ( !seen[a[2]]++ ) { ARGV[ARGC++] = a[1] } } /foo/ { print FILENAME nextfile } ' -

Ed Morton
  • 31,617
1

The following uses the zsh shell and assumes that the variable pattern contains a basic regular expression that you'd like to match.

for dirpath in Subfolder*(/); do
    grep -l -e $pattern $dirpath/*.csv(.om[1])
done

The for loop iterates over all directories in the current directory whose names start with Subfolder. For each such directory, the most recently modified regular file, whose name matches the pattern *.csv, is given to grep. The grep utility will try to match the given regular expression, and if it matches, it will print the file's name (including the subdirectory name).

The special zsh features used here are the two globbing qualifiers (/) and (.om[1]). The first one makes the preceding pattern only match directories, while the second makes the pattern match only regular files and orders the files by modification timestamp, and returns only the first of the ordered items (i.e., the most recently modified regular file).

The -l option to grep makes it output only the pathname of the file with a match.

Kusalananda
  • 333,661