4

I'd like to grep a file for a string, but ignore any matches on lines that do not end with a trailing newline character. In other words, if the file does not end with a newline character, I'd like to ignore the last line of the file.

What is the best way to do this?

I encountered this issue in a python script that calls grep via the subprocess module to filter a large text log file before processing. The last line of the file might be mid-write, in which case I don't want to process that line.

ilkkachu
  • 138,973
dshin
  • 183

4 Answers4

3

With gawk (using EREs similar to grep -E):

gawk '/pattern/ && RT' file

RT in gawk contains what is matched by RS the record separator. With the default value of RS (\n) that would be \n except for a non-delimited last record where RT would then be empty.

With perl (perl REs similar to grep -P where available):

perl -ne 'print if /pattern/ && /\n\z/'

Note that contrary to gawk or grep, perl by default does work on bytes not characters. For instance, it's . regexp operator would match on each of the two bytes of a UTF-8-encoded £. For it to work on characters as per the locale's definition of characters like for awk/grep, you'd use:

perl -Mopen=locale -ne 'print if /pattern/ && /\n\z/'
1

Something like this could do the job:

#!/usr/bin/env sh

if [ "$(tail -c 1 FILE)" = "" ]
then
    printf "Trailing newline found\n"
    # grep whole file
    # grep ....
else
    printf "No trailing newline found\n"
    # ignore last line
    # head -n -1 FILE | grep ...
fi

We rely on the following characteristic of command substitution described in man bash:

Bash performs the expansion by executing command and replacing the command substitution with the standard output of the command, with any trailing newlines deleted.

  • 1
    Unfortunately if the file is being written to, there is a potential race condition. The else-case race condition doesn't worry me too much, but the then-case race condition can lead to processing of a partial log line, which is the problem I'm trying to avoid. – dshin Jun 13 '18 at 18:02
  • I've solved this sort of race condition in similar contexts by first doing a du -b to get an exact byte size, and then doing a tail -c to only fetch that many bytes. I could do that here. – dshin Jun 13 '18 at 18:09
  • On the other hand, in practice skipping the last line unconditionally is probably going to be ok for my purposes. So if the simple trick I was hoping for doesn't exist, I might just do that. – dshin Jun 13 '18 at 18:12
  • @dshin: OK, I'm glad you found the right solution that works for you. – Arkadiusz Drabczyk Jun 13 '18 at 18:13
1

grep is explicitly defined to ignore newlines, so you can't really use that. sed knows internally if the current line (fragment) ends in a newline or not, but I can't see how it could be coerced to reveal that information. awk separates records by newlines (RS), but doesn't really care if there was one, the default action of print is to print a newline (ORS) at the end in any case.

So the usual tools don't seem too helpful here.

However, sed does know when it's working on the last line, so if you don't mind losing the last intact line in cases where a partial one isn't seen, you could just have sed delete what it thinks is the last one. E.g.

sed -n -e '$d' -e '/pattern/p'  < somefile                   # or
< somefile sed '$d' | grep ...

If that's not an option, then there's always Perl. This should print only the lines that match /pattern/, and have a newline at the end:

perl -ne 'print if /pattern/ && /\n$/'
ilkkachu
  • 138,973
  • Thanks, I like the perl solution. It appears to be within a factor of 2 of the speed of grep, which is faster than sed/python/gawk solutions. Is there a simple way to extend it to do the equivalent of egrep "pattern1|pattern2|pattern3"? – dshin Jun 13 '18 at 18:35
  • @dshin, Perl regexes are pretty much an extension of extended regexes like what grep -E uses, so this|that works out of the box. (see this question and Perl's docs for details) – ilkkachu Jun 13 '18 at 18:39
  • @dshin, also, since you're aiming for speed, you could try if replacing the second regex with ... && substr($_, -1, 1) eq "\n"' would be faster. – ilkkachu Jun 13 '18 at 18:42
  • Hm, your original actually is clocking in slightly faster for me, but they are very close. – dshin Jun 13 '18 at 18:44
  • @dshin, allright, just goes to show that Perl's regex processing is pretty well optimized – ilkkachu Jun 13 '18 at 18:49
1

If you need speed then using PCRE (or some other possibly faster regex library) from C would allow the use of both a regular expression and a check whether there is a newline. Downsides: new code to maintain and debug, time to re-implementing portions of grep or perl depending on the complexity of the expression or if features such as --only-matching are used.

#include <err.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#include <pcre.h>
#define MAX_OFFSET 3

int main(int argc, char *argv[])
{
    // getline
    char *line = NULL;
    size_t linebuflen = 0;
    ssize_t numchars;
    // PCRE
    const char *error;
    int erroffset, rc;
    int offsets[MAX_OFFSET];
    pcre *re;

    if (argc < 2) errx(1, "need regex");
    argv++;
    if ((re = pcre_compile(*argv, 0, &error, &erroffset, NULL)) == NULL)
        err(1, "pcre_compile failed at offset %d: %s", erroffset, error);

    while ((numchars = getline(&line, &linebuflen, stdin)) > 0) {
        if (line[numchars-1] != '\n') break;
        rc = pcre_exec(re, NULL, line, numchars, 0, 0, offsets, MAX_OFFSET);
        if (rc > 0) fwrite(line, numchars, 1, stdout);
    }
    exit(EXIT_SUCCESS);
}

This is about 49% faster than perl -ne 'print if /.../ && /\n\z/'.

thrig
  • 34,938