I need to parse large amounts text in command line and replace all (possibly nested) text quotes with spaces. Quotes are marked with specific syntax: [quote=username]quoted text[/quote]
.
Example input with nested quotes could be something like:
text part 1 [quote=foo] outer quote 1 [quote=bar] inner quote [/quote] outer quote 2 [/quote] text part 2 [quote=foo-bar] next quote [/quote] text part 3
And expected output would be:
text part 1 text part 2 text part 3
With help of this question I got it somehow work (got output above) with sed ':b; s/\[quote=[^]]*\][^[\/]*\[\/quote\]/ /g; t b'
but middle part ([^[\/]
] is problematic since quotes can contain characters like [
or ]
.
That being said, my sed
command doesn't work if input is eg.
text part 1 [quote=foo] outer quote 1 [quote=bar] inner quote [foo] [/quote] outer quote 2 [/quote] text part 2 [quote=foo-bar] next quote [/quote] text part 3
One problem is that sed
doesn't seem to support non-greedy qualifier and thus catches always longest possible match from the input. That makes it hard to deal with a) usernames and b) quoted texts in general.
I also guess that sed
is not the best tool to solve this and it might not be even capable of doing things like that. Maybe eg. perl
or awk
could work better?
Now the final question is that what would be the best and most efficient way to solve this?
[^\]]*
with.*?
and since perl's non-greedy quantifier solves the issue I was trying to tackle with original version. So I ended up toperl -pe 's@(\[quote=.*?\](?:(?1)|.)*?\[/quote\])@@g'
– pipo Mar 01 '19 at 12:45perl
one would also have problems with[quote=foo] [quote= [/quote]
and would struggle for mismatched quotes. – Stéphane Chazelas Mar 01 '19 at 13:01[foo]
I can see no reason why[quote
should be invalid input. – Freddy Mar 01 '19 at 14:06[quote=x] [quot= [/quote]
valid for instance? Is[quote=some [quote] user]
valid? Does the format have a way to escape[
s or[quote
?... Anyway, I've added the=
in the sed regexp so[quote=foo] [quote [/quote]
would no longer be a problem.[quote=foo] [quote= [/quote]
would still be. – Stéphane Chazelas Mar 01 '19 at 14:50[quote=x] [quote= [/quote]
is a possible input and should be removed as a quote.[quote=some [quote] user]
is also possible (since someone could write that kind of message) but should not be removed as a quote since quote start tag is always in the form of[quote=username]
. And even these are edge cases,perl
script with.*?
seems to handle them both correctly. – pipo Mar 01 '19 at 21:00username
can not contain]
which would be something thatperl
script wouldn't be able to handle. – pipo Mar 01 '19 at 21:04.
matches the tags themselves too, allowing[quote …] AAA [/quote] XXX [quote …] BBB [/quote]
to be matched as if it was just[quote …] … [/quote]
, resulting inXXX
being removed, even though it shouldn’t. … This posed a problem in my version of this, where I wanted to add a$
at the end to remove only the last one. (I had[
and]
as start and end markers instead of tags. so I haven’t checked if YMMV.) – Sep 18 '22 at 17:26perl -pe 's@(\[(?:(?1)|.)*?])$@@g'
byperl -pe 's@(\[(?:(?1)|[^][]*)*?])$@@g'
. Note the.
being replaced by[^][]
to not match[
and]
, aside from the added$
. I don’t know how to translate this for OP’s[quote …]
/[/quote]
case. – Sep 18 '22 at 17:30