198

I am using a script to regularly download my gmail messages that compresses the raw .eml into .gz files. The script creates a folder for each day, and then compresses every message into its own file.

I would like a way to search through this archive for a "string."

Grep alone doesn't appear to do it. I also tried SearchMonkey.

Kendor
  • 2,081
  • None of the answer have worked so far for me! Did anything work for you? – achhainsan Sep 18 '23 at 08:42
  • @achhainsan if they don't work for you, please ask a new question, explain exactly what you are trying to do, and exactly how they fail. You can link back to this question for reference. These are very standard approaches, so if they don't work, you are probably requiring something slightly different to what is being requested here. – terdon Sep 18 '23 at 12:34

6 Answers6

213

If you want to grep recursively in all .eml.gz files in the current directory, you can use:

find . -name \*.eml.gz -print0 | xargs -0 zgrep "STRING"

You have to escape the first * so that the shell does not interpret it. -print0 tells find to print a null character after each file it finds; xargs -0 reads from standard input and runs the command after it for each file; zgrep works like grep, but uncompresses the file first.

  • 5
    '-print0' and '-0' are not mandatory. xargs uses '\n' by default. – Jaime M. Jul 07 '15 at 08:50
  • 5
    They're necessary if there might be space characters in the paths; there's no reason other than complexity not to use them. – Daniel Griscom Sep 23 '15 at 14:38
  • 4
    zgrep actually seems faster than grep run on uncompressed files. It must be because compressed files can be read off the HD and decompressed faster than reading an uncompressed file from the HD. – Geremia Aug 19 '16 at 17:54
  • 6
    @JaimeM. xargs uses blanks (whitespace) by default. Sure, files almost never have newlines in them, but spaces are not unheard of (even if most UNIXy types frown on them). That said, you can simplify without worrying about whitespace even more easily: find . -name '*.eml.gz' -exec zgrep "STRING" {} + That gets the same many arguments per-launch of xargs, the safety of -print0/-0, and all without the overhead of an extra process launch and piping, and fairly concisely. -exec with + is POSIX specified, so it should be on most semi-recent UNIX-like systems to my knowledge. – ShadowRanger Dec 09 '16 at 18:38
  • @Jared Is there a way to do a wildcard search only knowing the beginning of the file pattern? For example, I have .gz files that have date/time stamps at the end of them. ABCLog04_18_18_2_21.gz

    Is there a way to recursively look for files beginning with ABC*. I tried replacing \*.eml.gz in your example above with ABCLog* and get an error about file format.: find: paths must precede expression: ABCLog-2018-03-12-10-16-1.log.gz Usage: find [-H] [-L] [-P] [-Olevel] [-D help|tree|search|stat|rates|opt|exec] [path...] [expression]

    – DevelopingDeveloper Apr 18 '18 at 19:21
  • This just dumps all content from all the flies... – Cerin Nov 08 '21 at 03:37
90

There's a lot of confusion here because there isn't just one zgrep. I have two versions on my system, zgrep from gzip and zgrep from zutils. The former is just a wrapper script that calls gzip -cdfq. It doesn't support the -r, --recursive switch.1
The latter is a c++ program and it supports the -r, --recursive option.
Running zgrep --version | head -n 1 will reveal which one (if any) of them is the default:

zgrep (gzip) 1.6

is the wrapper script,

zgrep (zutils) 1.3

is the cpp executable.
If you have the latter you could run:

zgrep 'pattern' -r --format=gz /path/to/dir

Anyway, as suggested, find + zgrep will work equally well with either version of zgrep:

find /path/to/dir -name '*.gz' -exec zgrep -- 'pattern' {} +

If zgrep is missing from your system (highly unlikely) you could try with:

find /path/to/dir -name '*.gz' -exec sh -c 'gzip -cd "$0" | grep -- "pattern"' {} \;

but there's a major downside: you won't know where the matches are as there's no file name prepended to the matching lines .


1: because it would be problematic

don_crissti
  • 82,805
11

ag is a variant of grep, with some nice extra features.

  • has -z option for compressed files,
  • has many of ack features.
  • it is fast

So:

ag -r -z your-pattern-goes-here   folder

If not installed,

apt-get install silversearcher-ag   (debian and friends)
yum install the_silver_searcher     (fedora)
brew install the_silver_searcher    (mac)

(edit in Sep 2021 \thanks(x-yuri))

Also consider rg (recursive grep) that has -z option

rg -z your-pattern-goes-here   folder

rg has also a large set of useful options. If necessary:

apt install ripgrep 
JJoao
  • 12,170
  • 1
  • 23
  • 45
5

Recursion alone is easy:

   -r, --recursive
          Read all files  under  each  directory,  recursively,  following
          symbolic  links  only  if they are on the command line.  This is
          equivalent to the -d recurse option.

   -R, --dereference-recursive
          Read all files under each directory,  recursively.   Follow  all
          symbolic links, unlike -r.

However, for compressed files you need something like:

shopt globstar 
for file in /path/to/directory/**/*gz; do zcat ""$file" | grep pattern; done

path/to/directory should be the parent directory that contains the subdirectories for each day.


zgrep is the obvious answer but, unfortunately, it does not support the -r flag. From man zgrep:

These grep options will cause zgrep to terminate with an error code: (-[drRzZ]|--di*|--exc*|--inc*|--rec*|--nu*).

terdon
  • 242,166
5

If your system has zgrep, you can simply

zgrep -irs your-pattern-goes-here the-folder-to-search-goes-here/

If your system does not have zgrep, you can use the find command to run zcat and grep against each file like so:

find the-folder-to-search-goes-here/ -name '*.gz' \ -exec sh -c 'echo "Searching {}" ; zcat "{}" | grep your-pattern-goes-here ' \;

  • Forgive me greeness on this...

    the files to be searched through are a couple of layers deep. ~/gmvault-db/db/2015-02 contains a folder for each month archived, and then underneath that the .gz files for that month are stored. If I'm search for .mil within that whole tree, is that what I would do?

    find ~/gmvault-db/db/ -name '*.gz'
    -exec sh -c 'echo "Searching {}" ; zcat "{}" | grep .mil ' ;

    – Kendor Mar 02 '15 at 16:28
  • 1
    That's fine - the "r" in -irs will cause zgrep to search recursively. The find command operates recursively by default, so any file which ends in .gz will be zcatted and passed into grep. (and the {} will be expanded to the relative path of the file which is about to be searched). So when you get a hit, it will be preceded by

    Searching ~/gmvault-db/db/2015-02/03/whatever.gz

    – Nate from Kalamazoo Mar 02 '15 at 16:29
  • Here's what I get back: find: "paths must precede expression: -exec"

    Here's the command I used: find ~/gmvault-db/db/ -name '*.gz' \ -exec sh -c 'echo "Searching {}" ; zcat "{}" | grep .mil ' ;

    – Kendor Mar 02 '15 at 16:36
  • take out the backslash between the '*.gz' and the -exec. – Nate from Kalamazoo Mar 02 '15 at 16:37
  • 4
    zgrep won't take the -r flag for some reason. That's mention in man zgrep (also see my answer). – terdon Mar 02 '15 at 17:12
  • @terdon - depends on the zgrep flavor: zgrep from gzip won't take the -r flag but zgrep from zutils will, see my answer (it's actually a comment but too long to fit in a comment block). – don_crissti Mar 03 '15 at 10:20
  • @don_crissti I see, thanks, I didn't know of zutil's zgrep. – terdon Mar 03 '15 at 12:04
1

xzgrep -l "string" ./*/*.eml.gz

xzgrep is a derivative of the zgrep utils (less /bin/xzgrep)

From the Man page:

xzgrep invokes grep(1) on files which may be either uncompressed or compressed with xz(1), lzma(1), gzip(1), bzip2(1), or lzop(1). All options specified are passed directly to grep(1).

-l print the matching file name

-R for recursion will not work as it's specifically prohibited in the script, however simple shell globbing should get us there

./*/*.eml.gz

from a relative path where ./today/sample.eml.gz, match on all instances of that are one level below our relative position in the shell, that ends with ".eml.gz"

John
  • 1,210