0

Question:

How could I find matches of a multi-line regular expression in files, without pcregrep?

I need to find/print the position of each occurrence.

Unfortunately, pcregrep is not present and I have no rights to install it. Other alternatives are grep perl sed python etc.

An example of regular expression to search is:

Text\nLine

Context:

A script provides hundreds MB of structured text in a few tens of files, but unfortunately some lines are missing (due to many reasons). I do need to check where those lines are missing, thus searching for the sequence of the previous and following lines.

Text
Missing //this line is sometimes missing.
Line

EDITED:

Possible input

example.txt

Text
Missing
Line

Text
Missing
Line

Text
Line

Text
Missing
Line

Possible output:

example.txt, line 10

Some of the tries with no success:

pcregrep 
    # command not found
apt-get install pcregrep 
    # no permission, no su credentials, distro don't provide pcregrep, outdated sources, customer does not want changes on the serve, etc.
sed -r 's#(Text\nLine)#\1#' ./* 
    # print all lines, not only matches, no indication of file or line, etc.
grep 'Text\nLine' ./* 
    # Does not works on multi-lines
sed -n '/Text/,/Line/{p}' ./* 
    # Not the same regex, does not indicate result lines, etc.
Kusalananda
  • 333,661
Adrian Maire
  • 1,916
  • 4
  • 20
  • 30

3 Answers3

2

Unix tools are most often line-oriented, and there is therefore no way to apply a regular expression over several lines of input using the standard toolbox.

sed can be made to process the file in such a way that it's able to detect the lines you are looking for, but we do this strictly using operations on individual lines:

$ sed -n '/^Text/{N;/^Text\nLine/=;D;}' file
10

This sed script looks for the string Text at the start of a line. When found, it appends the next line to its buffer with a \n in-between.

If the buffer now matches ^Text\nLine then the current line number is outputted using the = command in sed. The line number outputted is that of the Line line in the file.

Note that while the second regular expression appears to match across a newline in the file, it does not. It matches across a newline in its internal buffer, which we put there using the N command when we read the next line from the file.

You would probably use this in a loop if you want to apply it to multiple files:

for name in pattern; do
    printf 'Processing %s...\n' "$name"
    sed -n '/^Text/{N;/^Text\nLine/=;D;}' "$name"
done

where pattern would be an ordinary filename globbing pattern that matches the files that you are interested in.

Kusalananda
  • 333,661
  • Seem the N has difficulties with multiple files, but I may iterate the directory with a bash loop. – Adrian Maire Jun 11 '18 at 08:57
  • 1
    @AdrianMaire What difficulties? The fact that the line numbers are not reset between files? If you're using GNU sed, try sed -s -n .... But using a loop would be better as you would then be able to tell what lines refers to what files. – Kusalananda Jun 11 '18 at 08:59
  • @Kiwy The answers to your proposed dupe are less than helpful in this instance. – Kusalananda Jun 11 '18 at 09:19
  • @StéphaneChazelas Thanks! I forgot that the user supplied dummy text. – Kusalananda Jun 11 '18 at 15:41
1

If vim is installed, you could use it in ex mode as:

vim -e -s -c 'argdo g/^Text\nLine/#' -c q ./*.txt

See also the z command to give context.

vim -e -s -c 'argdo g/^Text\nLine/z#.5' -c q ./*.txt

That doesn't print the file names though. A not very efficient perl approach could be:

perl -l -0777 -ne 'while (/Text\nLine/g) {
   print "$ARGV, line " . ++(() = $` =~ /\n/g)}' ./*.txt
0
 perl -ne 'eof and $. = 0 or /^Text/ && ($_ .= <>) =~ /^Line/m && print "$ARGV: $.\n"' ./*

This will print the file name alongwith the line number where the match occurred.

Also, the line counter ($.) is reset upon reaching eof of each file.