How to search for a string in a very large file with very long lines?

Question

It turns out I was accidentally using grep wrong yesterday. I just checked my bash history and saw what I was executing:

grep search-string-here -f large-file-with-long-lines.txt

And that was what was causing the memory exhaustion.

Executing:

grep search-string-here large-file-with-long-lines.txt

...has the desired behavior.

Thanks to @αғsнιη for pointing to the question with a similar mistake, and to @EdMorton and @ilkkachu for correcting my assumptions about the length of the lines and how grep and awk use memory.

Below is the original question (though it seems I was wrong about the long lines not being able to fit in 8 GB of RAM) and @EdMorton's accepted answer.

I have a very large file (over 100 GB) with very long lines (can't even fit in 8 GB RAM) and I want to search it for a string. I know grep can't do it because grep tries to put entire lines into memory.

So far the best solution I've come up with is:

awk '/search-string-here/{print "Found."}' large-file-with-long-lines.txt

I'm actually happy with this solution, but I'm just wondering if there is some more intuitive way to do it. Maybe some other implementation of grep?

Is it related to grep: memory exhausted? what is your need? just print "found" when matched lines? for all of them or just one if enough then exit processing the rest if the file? what is the line break? newline? is there any other unique delimiter other than that, that we can use it as record seperator which with awk we can specify it? — αғsнιη, Nov 16 '21 at 10:59
but there's no regular expression that you need to match, it's just a string, right? What do you want to have as the result? The line containing the string? Or just whether the string is in there? — Marcus Müller, Nov 16 '21 at 11:02
@eblock if the large line length is the problem, splitting won't help — Marcus Müller, Nov 16 '21 at 11:12
Change the action from print "Found" to print "Found"; exit so you don't need to continue reading the file for no further benefit — Chris Davies, Nov 16 '21 at 12:17
The awk script in your question also reads each line into memory, just like you say grep does, so it shouldn't be any more likely to work than grep is if that's the problem and if it's working for you now it's probably about 1 character longer input from failing. If your lines have some kind of structure, eg they're space-separated or CSV or tag=value pairs or something then tell us about that and show a minimal example using such lines in your question as sample input/output. Obviously don't make it 8G lines - just something small that shows the structure of a few lines. — Ed Morton, Nov 16 '21 at 15:56
do you need to just check if the string exists in the file (like your awk program does), or do you also need to print the super-long line it appears on (like grep would do)? — ilkkachu, Nov 16 '21 at 19:24
but anyway, if you have a version of awk that works for you (and you tested the largest size file you're likely to meet), you're hardly likely to find anything much more "intuitive", given it's just the pattern and a print command to implement grep in awk. — ilkkachu, Nov 16 '21 at 19:30
@αғsнιη @Marcus Müller @ilkkachu My needs are already met with the awk line I have. I just asked because I was wondering if there were a more general solution that would be easier to remember and use, is all. Like an llgrep (ll for long lines) or something like that.
@ilkkachu Yeah, I thought that might be the case. — hilltothesouth, Nov 17 '21 at 14:37
@hilltothesouth, the manual for GNU grep already says it can handle lines of arbitrary length, as long as there's enough memory. (But I now noticed you didn't say which grep you had.) I doubt there's any other tool that would be made to go over that, since it would be really hard to do anything else with a line that doesn't fit in memory. (Which is why I wonder why awk would work better: it needs to be able to do more than just print the line, depending on what the program does. Grep doesn't.) — ilkkachu, Nov 17 '21 at 14:47
@αғsнιη It's definitely related to grep exhausting memory, yes. — hilltothesouth, Nov 17 '21 at 14:51
@hilltothesouth, yeah, grep foo -f largefile would use the lines of largefile as patterns, it'd probably try to load them into memory all at the same time. — ilkkachu, Nov 17 '21 at 15:31

score 3 · Answer 1 · answered Nov 16 '21 at 12:15

Here's simple a partial solution which only works if there's a character that does not appear in the string (or regexp) to search for, but appears often enough on the line so that the interval between occurrences of this character always fits in memory. For example, suppose each line is a very long list of relatively short semicolon-separated fields.

<large-file-with-long-lines.txt tr ';' '\n' | grep 'search-string-here'

Here's a different partial solution that works if the occurrences always start at a multiple of N characters after the beginning of the line, for some fixed N. It uses fold to wrap lines and ag to do a multiline search. In this example, the occurrences are known to always start 3*x characters after the beginning of the line.

<large-file-with-long-lines.txt fold -w3 | ag $'cat\ntag\ngag\nact'

This can be generalized to arbitrary string searches by repeating the search for each offset.

<large-file-with-long-lines.txt fold -w3 | ag $'fee\n-fi\n-fo\n-fu\nm|fe\ne-f\ni-f\no-f\num|f\nee-\nfi-\nfo-\nfum'

Note that this can have false positives if the string happened to be almost present, but with a line break in the middle.

Ed Morton · Accepted Answer · 2021-11-16T19:07:56.660

This would be slow as it reads the input file several times (once per char in the string you want to find, each time in strings the size of that search string) but it should work on files of any size with lines of any size, using GNU awk for mult-char RS and RT (untested):

awk -v str='search-string-here' '
    BEGIN {
        lgth = length(str)
        RS = sprintf("%*s",lgth,"")
        gsub(/ /,".",RS)
        for (i=1; i<lgth; i++) {
            ARGV[ARGC++] = ARGV[1]
        }
    }
    RT == str {
        print "found"
        exit
    }
' file

It's setting RS to N .s, where N is the length of your search string, reading every chain of N characters from the input starting at char #1 comparing them to the search string and exiting if the current N chars from the input matches the search string. If there's no match on that pass it does the same again but starting from char #2, and so on til it's done that N times and so there are no more strings of length N in the input file to compare to the search string.

Note that the above is doing a string comparison, not a regexp comparison. To do a regexp comparison you'd have to determine the max length of the matching string some other way and then use the regexp comparison operator ~ instead of ==, e.g. if you know the matching string in the input can't be longer than 20 chars then you could do something like:

awk -v regexp='search-regexp-here' '
    BEGIN {
        lgth = 20
        RS = sprintf("%*s",lgth,"")
        gsub(/ /,".",RS)
        for (i=1; i<lgth; i++) {
            ARGV[ARGC++] = ARGV[1]
        }
    }
    RT ~ regexp {
        print "found"
        exit
    }
' file

but that regexp search has some flaws that the string search does not that you'd have to think about, e.g. if your search regexp includes ^ or $ for boundaries then you could get a false match above as it creates string start/end boundaries around every char as the N-chars-long strings are being read.

How to search for a string in a very large file with very long lines?

2 Answers2