0

I was trying to search for lines that start with create and end in ;. The match may span multiple lines. I was trying to use grep for that and after searching internet I found out how to do it.

The following query does it

grep -zioE 'create (\w|\W|\n)*?;' Day1.sql  | less

Output

create schema sigmoid_db; create table instructor( ID char(5), name varchar(20), dept_name varchar(20), salary numeric(8,2));

What I want to ask why wouldn't the same query without \n work? Like the following query should produce the same output

grep -zioE 'create (\w|\W)*?;' Day1.sql  | less

Output

create schema sigmoid_db;

My reasoning is \w|\W should match any character. But the second command doesn't print the patterns that span multiple lines.

Can anyone tell why so?

Dhruv
  • 13

1 Answers1

-1

The \n symbol is a Carriage Return. That is a special symbol which separates one line from another.

Any text file is actually a one long string like:

first\nsecond\nthird\n

which is printed on the screen as

first
second
third

The grep splits the input file into lines and process them one by one. If you want to have a multi-line pattern to be found you must use \n in the appropriate place of regular expression.

That is why the pattern create (\w|\W)*?; found only a single-line match.

And no, control symbols (and \n is one of them) are not considered to be a member of groups "letter" (\w) or "non-letter" (\W). They are in a group of their own and have to be used by itself.

White Owl
  • 5,129
  • I tried my expression on online regex, it works fine. \W means any character other than [a-zA-Z0-9_] which includes \n as well. – Dhruv Feb 10 '23 at 15:41
  • @Dhruv Yes, that is possible. There are several regexp variations. With grep you can choose one of the four major regexp dialects (keys -E, -F, -G, -P). And even then, by using different versions of grep you can encounter differences in regexp processing. Your grep in "extended" mode (you have -E key in the command) does not include \n in \W. The online regexp checker most likely using javascript or Perl versions - they do include it. Run your command with -P instead of -E and see the difference. – White Owl Feb 11 '23 at 02:39
  • I think it's a bug related to \n as pointed out by @Stéphane Chazelas. Because I tried a similar thing for \t and it worked. eg - echo $'a\tb\nc\td' | grep -zEo 'c[^;]*d' outputs c[8 spaces]d but echo $'a\tb\nc\td' | grep -zEo 'a[^;]*d' outputs nothing. – Dhruv Feb 11 '23 at 12:06
  • @Dhruv No bugs. Your sample string is treated as two separate lines by grep : "a\tb" and "c\td". So it can find pattern which includes "c" and "d", but letters "a" and "d" are on different lines - so nothing is found. – White Owl Feb 11 '23 at 13:02
  • No I used -z flag so it doesn't treat the input as two separate lines. eg - echo $'a\nb' | grep -z '\n' | less the output is a b ^@ (each in a new line). ^@ represents null character. The input is treated as null terminated string. So \n matches a part of line and grep prints the matching line (the whole line itself). – Dhruv Feb 11 '23 at 13:13
  • Even echo $'a\tb\nc\td' | grep -z 'c[^;]*d' | less prints all the four characters, with tabs, newline and ^@(at the end). – Dhruv Feb 11 '23 at 13:17