Removing (possibly nested) text quotes in command line

Question

I need to parse large amounts text in command line and replace all (possibly nested) text quotes with spaces. Quotes are marked with specific syntax: [quote=username]quoted text[/quote].

Example input with nested quotes could be something like:

text part 1 [quote=foo] outer quote 1 [quote=bar] inner quote [/quote] outer quote 2 [/quote] text part 2 [quote=foo-bar] next quote [/quote] text part 3

And expected output would be:

text part 1   text part 2   text part 3

With help of this question I got it somehow work (got output above) with sed ':b; s/\[quote=[^]]*\][^[\/]*\[\/quote\]/ /g; t b' but middle part ([^[\/]] is problematic since quotes can contain characters like [ or ].

That being said, my sed command doesn't work if input is eg.

text part 1 [quote=foo] outer quote 1 [quote=bar] inner quote [foo] [/quote] outer quote 2 [/quote] text part 2 [quote=foo-bar] next quote [/quote] text part 3

One problem is that sed doesn't seem to support non-greedy qualifier and thus catches always longest possible match from the input. That makes it hard to deal with a) usernames and b) quoted texts in general.

I also guess that sed is not the best tool to solve this and it might not be even capable of doing things like that. Maybe eg. perl or awk could work better?

Now the final question is that what would be the best and most efficient way to solve this?

Stéphane Chazelas · Accepted Answer · 2019-03-01T14:53:06.070

4

If you know the input doesn't contain < or > characters, you could do:

sed '
  # replace opening quote with <
  s|\[quote=[^]]*\]|<|g
  # and closing quotes with >
  s|\[/quote\]|>|g
  :1
    # work our way from the inner quotes
    s|<[^<>]*>||g
  t1'

If it may contain < or > characters, you can escape them using a scheme like:

sed '
  # escape < and > (and the escaping character _ itself)
  s/_/_u/g; s/</_l/g; s/>/_r/g

  <code-above>

  # undo escaping after the work has been done
  s/_r/>/g; s/_l/</g; s/_u/_/g'

With perl, using recursive regexps:

perl -pe 's@(\[quote=[^\]]*\](?:(?1)|.)*?\[/quote\])@@g'

Or even, as you mention:

perl -pe 's@(\[quote=.*?\](?:(?1)|.)*?\[/quote\])@@g'

With perl, you can handle multiline input by adding the -0777 option. With sed, you'd need to prefix the code with:

:0
$!{
  N;b0
}

So as to load the whole input into the pattern space.

edited Mar 01 '19 at 14:53

answered Mar 01 '19 at 12:27

Stéphane Chazelas

544,893

1

Thanks, your perl solution here looks clean and simple and seems to work nicely. I replaced [^\]]* with .*? and since perl's non-greedy quantifier solves the issue I was trying to tackle with original version. So I ended up to perl -pe 's@(\[quote=.*?\](?:(?1)|.)*?\[/quote\])@@g' – pipo Mar 01 '19 at 12:45
The sed script outputs "< <" with input "[quote=foo] [quote [/quote]". – Freddy Mar 01 '19 at 12:56
@Freddy, that doesn't appear to be valid input as per the OP's description of its format. The perl one would also have problems with [quote=foo] [quote= [/quote] and would struggle for mismatched quotes. – Stéphane Chazelas Mar 01 '19 at 13:01
@StéphaneChazelas OP said "... quotes can contain characters like [ or ]" and since the example text contains [foo] I can see no reason why [quote should be invalid input. – Freddy Mar 01 '19 at 14:06
1

@Freddy, but then at some point we need to decide where we stop. Is [quote=x] [quot= [/quote] valid for instance? Is [quote=some [quote] user] valid? Does the format have a way to escape [s or [quote?... Anyway, I've added the = in the sed regexp so [quote=foo] [quote [/quote] would no longer be a problem. [quote=foo] [quote= [/quote] would still be. – Stéphane Chazelas Mar 01 '19 at 14:50
@StéphaneChazelas @Freddy, I this particular case [quote=x] [quote= [/quote] is a possible input and should be removed as a quote. [quote=some [quote] user] is also possible (since someone could write that kind of message) but should not be removed as a quote since quote start tag is always in the form of [quote=username]. And even these are edge cases, perl script with .*? seems to handle them both correctly. – pipo Mar 01 '19 at 21:00
And also, username can not contain ] which would be something that perl script wouldn't be able to handle. – pipo Mar 01 '19 at 21:04
I think this perl script has a bug though, as . matches the tags themselves too, allowing [quote …] AAA [/quote] XXX [quote …] BBB [/quote] to be matched as if it was just [quote …] … [/quote], resulting in XXX being removed, even though it shouldn’t. … This posed a problem in my version of this, where I wanted to add a $ at the end to remove only the last one. (I had [ and ] as start and end markers instead of tags. so I haven’t checked if YMMV.) – Sep 18 '22 at 17:26
I solved it for my use case, by replacing perl -pe 's@(\[(?:(?1)|.)*?])$@@g' by perl -pe 's@(\[(?:(?1)|[^][]*)*?])$@@g'. Note the . being replaced by [^][] to not match [ and ], aside from the added $. I don’t know how to translate this for OP’s [quote …]/[/quote] case. – Sep 18 '22 at 17:30

Freddy · Answer 2 · 2019-03-01T13:31:55.983

A little script that increments a counter variable on each start-quote and decrements it on each end-quote. If the counter variable is greater 0, then text snippets are skipped.

#!/bin/bash

# disable pathname expansion
set -f    
cnt=0
for i in $(<$1); do
        # start quote
        if [ "${i##[quote=}" != "$i" ] && [ "${i: -1}" = "]" ]; then
                ((++cnt))
        elif [ "$i" = "[/quote]" ]; then
                ((--cnt))
        elif [ $cnt -eq 0 ]; then
                echo -n "$i "
        fi
done
echo

Output:

$ cat q1
text part 1 [quote=foo] outer quote 1 [quote=bar] inner quote [/quote] outer quote 2 [/quote] text part 2 [quote=foo-bar] next quote [/quote] text part 3
$ ./parse.sh q1
text part 1 text part 2 text part 3
$ cat q2
text part 1 [quote=foo] outer quote 1 [quote=bar] inner quote [foo] [/quote] outer quote 2 [/quote] text part 2 [quote=foo-bar] next quote [/quote] text part 3
$ ./parse.sh q2
text part 1 text part 2 text part 3

Leaving that $(<$1) unquoted is the split+glob operator in bash. [quote=foo] happens to be a glob (expands to the filenames in the current directory that are either q, u, o, t, e, = or f). So, for instance, if there were a f and o files in the current directory, [quote=foo] would be expanded to two words f and o. It would be worse if there were * words in the input for instance. — Stéphane Chazelas, Mar 01 '19 at 13:09

Igor Voltaic · Answer 3 · 2019-03-01T12:32:33.767

0

I checked this one and it worked for me. You might want to choose another temporary pattern instead of foobar. Without it sed deleted everything between tags leaving just text part 1 text part 3

sed -e 's/\/quote\]/foobar\]/3' -e 's/\[.*\/quote\]//' -e 's/\[.*foobar]//' testfile

instead if testfile you may just pipe it with cat

edited Mar 01 '19 at 12:32

answered Mar 01 '19 at 12:20

Igor Voltaic

113

score 0 · Answer 4 · answered Mar 03 '19 at 04:24

You can do this with POSIX sed as detailed here. Note this solution applies to both kind of inputs shown by you. The limitations the input is not mulitiline, as we make use of newlines as markers to effect transformation required.

$ sed -e '
      :top
      /\[\/quote]/!b
      s//\
&/
      s/\[quote=/\
\
&/

     :loop
        s/\(\n\n\)\(\[quote=.*\)\(\[quote=.*\n\)/\2\1\3/
     tloop

     s/\n\n.*\n\[\/quote]//
     btop
 '  input.txt

Removing (possibly nested) text quotes in command line

4 Answers4