Regular expression to replace an instance of two consecutive strings that might be separated by whitespace

Question

I want to write a perl one-liner that replaces every instance of two specific consecutive strings that may or may not be separated by whitespace.

For instance, say my two strings are john paul and george and I want to replace consecutive instances of these strings (in this order) with pete. Running the one-liner on

$ cat ~/foo

john paulgeorge
john paul george
john paul

    george

george john paul

should result in

$ cat ~/foo

pete
pete
pete

george john paul

The only thing I've thought of is

$ perl -p -i -e 's/john paul\s*george/pete/g' ~/foo

but this results in

$ cat ~/foo

pete
pete
john paul

    george

george john paul

Is there a way to alter my one-liner?

i dunno perl, but with bsd or gnu sed: sed -E 's/((george|john|paul) *){2}/pete/g' should work... — mikeserv, Jan 15 '16 at 20:16
you're running line-based, like awk or sed. unset $/; see http://perldoc.perl.org/perlvar.html — Jeff Schaller, Jan 15 '16 at 20:26
@mikeserv, I think you misread the question. john paul is a single string, not two strings. And that sed command wouldn't handle matches broken across multiple lines. — Wildcard, Jan 15 '16 at 22:31
@Wildcard - it's quite possible i did - i didn't look very hard at it, honestly. it was deemed favorable when ringo was ousted, though. i never liked that guy. notice i didn't actually answer, though. — mikeserv, Jan 15 '16 at 22:35

score 5 · Accepted Answer · answered Jan 15 '16 at 22:20

5

The only thing you need to add to your one-liner is the option to slurp the file as a single string:

perl -0777 -p -i -e 's/john paul\s*george/pete/g' ~/foo
#    ^^^^^

See http://perldoc.perl.org/perlrun.html#Command-Switches

answered Jan 15 '16 at 22:20

glenn jackman

85,964

Accepted this one as it most easily corrects my code. Thank you! – Brian Fitzpatrick Jan 15 '16 at 22:32

score 4 · Answer 2 · answered Jan 15 '16 at 20:43

perl's -n and -p options put variants of while (<>) { ... } around your program, which makes them process input linewise. If you want to replace across multiple lines, you need to read the whole thing into a string, which you need to do yourself.

perl -e 'local $/;$_=<>;s/john paul\s*george/pete/g;print'

This undefines $/, the record separator, so that <> slurping won't do line splitting any more, reads the entire input into $_ at once, and then does the replacement on that long string. You have to do your own printing, too.

There's not much magic here any more - it's just writing a complete Perl program in a slightly uncomfortable way. -i will still work for in-place replacement, though.

If you have a large file this is going to be fairly inefficient (or exhaust your memory), but that seems more or less unavoidable without building a better parser. You can also see perldoc -q 'entire file' for other alternatives and a lot of telling you you don't really mean it.

There must be a way in Perl to do what I did in sed, right? I slurped multiple lines only when there is a possibility of a match. — Wildcard, Jan 15 '16 at 22:23

Wildcard · Answer 3 · 2016-01-15T22:41:32.337

With sed you can do this without slurping the entire file:

sed -e ':top' -e 's/john paul[[:space:]]*george/pete/g;$b' -e '/john paul[[:space:]]*$/!b' -e 'N;btop' input

This is much lighter on memory usage; it only slurps multiple lines when there is a possibility of a multi-line match starting from the current line. And then it only slurps until either the match is found, or until there is no further possibility of a match.

As a bonus, it's POSIX-compliant. (Perl isn't part of POSIX.) Thanks to mikeserv for pointing this out in the comments.

Explanation:

:top sets a label named top.

s/john paul[[:space:]]*george/pete/g does the substitution you want for whatever is in the pattern space. (Default is line by line.)

$b skips to the end and prints if the current line is the last line of the file.

/john paul[[:space:]]*$/!b:

The pattern /john paul[[:space:]]*$/ will match john paul at the end of the pattern space followed by any amount of whitespace (but nothing other than whitespace), then ! inverts the pattern. So the effect here is to execute the b command (skip to the end of the script, thus printing the pattern space, reading the next line from the file, and starting from the top of the script) only if there is no possibility of a multi-line match starting with the current pattern space.

N appends the next line from the file to the pattern space (after appending a newline).

btop branches to the :top label without clearing out the pattern space.

there's nothing specific to GNU in this that I see... i would recommend <input in the general case, though. sed -e:top -e'$!N; s/....stuff....; ttop' -e'P;D' is more simple and usually more efficient as well, though. — mikeserv, Jan 15 '16 at 22:36
@mikeserv, thanks, fixed. I got [[:space:]] and \s mixed up; the latter is GNU-specific, the former is not. — Wildcard, Jan 15 '16 at 22:44

mikeserv · Answer 4 · 2016-01-16T00:16:38.363

Another sed:

s=[:space:]
sed -e:t -e$\!"N;s/john paul[$s]*george/pete/g;/\n/"\!tt -e"P;D" <in >out

That will handle any/all occurrences of your string in a single substitution, and only buffer as little as is absolutely necessary. It works via a sliding window on input, and only branches back to pull in newlines if the previous substitution successfully replaced your string, and, as a result, removed a newline character in the process.

The weird ! quoting is only necessary in a default (read: insane) interactive (ba|z|t?c)sh shell, but is generally not a problem in a scripted shell (unless you've got a csh variant).

Samir Sadek · Answer 5 · 2016-08-12T05:17:32.393

You will need to slurp the file with the option -0777. But you also should add the m modifiers at the end in order to make sure that \s will also match the \n.

When Perl see -0, it will update the input record separator ($/) with what comes next. For instance if I would have put -00, Perl would have put the the $/ in paragraph mode. So

perl -0777 -pe 's/^john paul\s*george/pete/gm' george.txt

is equivalent to :

perl  -pe 'BEGIN { undef $/ ; } s/^john paul\s*george/pete/gm' george.txt

Regular expression to replace an instance of two consecutive strings that might be separated by whitespace

5 Answers5

Linked