1

locate (or rather, updatedb) is somewhat simple: it takes the output of find for the required paths (usually '/'), sorts it, and then compresses it with a front-compression tool (frcode), in which the consecutive common prefixes are replaced by number of repeated characters.

So I'm wondering, what's stopping anyone from creating something similar for full text search? Say, how about concatenating every file in the system, sorting every line with the format line:filename:linenumber, and doing front-compression? I guess you would end up with a faster grep, with the tradeoff of being outdated until the daily/weekly cron job runs, just like locate.

Maybe locategrep would be overkill for the entire system, but I can see it being useful to speed up a large project which won't change much for the rest of the day.

Does something like this exists already or is it trivial to implement with some known tools?

Note: I would rather avoid enterprise-like solutions that include features beyond plain-text searching (but I appreciate regex support).

1 Answers1

2

Often, GNU grep and BSD competition is just pretty slow.

People like ag (aka the_silver_searcher), rg (aka ripgrep) or ack; they don't try to build an index of the text, they just search it anew for every query, but in a more efficient manner than grep. I'm using (mostly) rg these days, and it really makes searching the complete Linux source tree quite manageable (a "search every file, even if not a C header" rg FOOBAR takes ~3s when I've warmed the filesystem caches; GNU grep takes > 10s).

There's also full-text search engines (mostly, xapian), which I use as plugins on my IMAP server to speed up full-text searching. That's the only use case where this has proven to actually make a difference to me.

(Ceterum censeo mandbem esse delendam; our search tools are so fast that taking 30s to rebuild a friggin index of 190 MB of man pages is simply not acceptable; and the idea that gzip is good compressor for really uniform data such as man pages where there's one compression dictionary that would make these things incredibly small is another annoyance of me. But things are intertwined enough that I can't be moved to get rid of mandb.)

  • Well, I didn't believe it because I already exclude unwanted directories with grep, but I just tried it and even then rg is ridiculously faster. It makes no sense. I am now converted. Thanks for the xapian suggestion too. – Sebastian Carlos Jul 26 '23 at 14:57
  • Interesting, didn't hear of those tools. Just out of curiosity; you compared between rg and GNU grep. How do you run the grep command? Is it just recursive with -r flag, or something like find + xargs + grep (which is what I usually do, limiting the number of arguments for xargs and increasing max-procs. Of course a dedicated tool would be more convenient then my method, but I was just wondering if the time difference is still high). – aviro Jul 26 '23 at 14:58
  • This is how I do it, combining find and grep. Using xargs instead of -exec would also work. Sadly -r is not enough if you want to do fancy exclusion at the level of rg – Sebastian Carlos Jul 26 '23 at 15:16
  • 1
    I typically have something like rg foobar vs grep foobar **/*.{cc,cpp,c,h,hpp,hxx} (and I use zsh, so that ** is recursive by default) – Marcus Müller Jul 26 '23 at 15:27
  • 1
    @SebastianCarlos by the way, rg, ag and ack all by default exclude a couple of directories, like .git (rg for example also excludes anything in your .gitignore by default, which I find very handy most of the time; you can of course disable that all to your heart's delight) – Marcus Müller Jul 26 '23 at 15:35
  • I often use grep on recently created files (logs, downloaded things, etc). How would last night's index help? – waltinator Jul 26 '23 at 19:24
  • @waltinator use the index on the older files/datapoints, and use plaintext search on the newer ones. That'd still not be bad, and it's how databases etc with lazy indexing work, essentially. – Marcus Müller Jul 26 '23 at 19:28
  • @waltinator another common use case of such searches is looking through versioned files, in which case the version control system would inherently be aware of what is already indexed and what needs to be, when you check out a specific version – Marcus Müller Jul 26 '23 at 19:29
  • Have you analyzed the amount of actual data a full text index of "the system" would take? How would you index binary files? I've grepped them. That's s lot of data to frdecode every time I want to search for "error" in a bunch of files. – waltinator Jul 26 '23 at 20:37
  • To search within structured, or binary, or version controlled files, one Must have some understanding of the particular object. – waltinator Jul 26 '23 at 20:43