How do I display all the characters between two specific strings?

Question

I want to display all the characters in a file between strings "xxx" and "yyy" (the quotes are not part of the delimiters). How can I do that ? For example, if i have input "Hello world xxx this is a file yyy", the output should be " this is a file "

It's very similar to How to find/grep what is between string1 and string2? so if you want to match multi-line you could run: pcregrep -Mo '(?<=STRING1)(\n|.)*?(?=STRING2)' infile — don_crissti, Mar 31 '16 at 21:33
What yyy would be your xxx expected output xxx if this line yyy were used yyy as input? In general, the answer to this question is "regular expressions" but there are some details you haven't thought about. — Wildcard, Mar 31 '16 at 21:33
So the strings are not on the same line but you accept an answer that only works if they are on the same line (and even then...) ? — don_crissti, Apr 02 '16 at 19:19

score 7 · Accepted Answer · edited Apr 02 '16 at 20:55

7

You can use the pattern matching flag in sed as follows:

echo "Hello world xxx this is a file yyy" | sed 's/.*xxx \(.*\)yyy/\1/'

So .*xxx will match from the beginning up to xxx. This is best shown using grep:

\1 is a 'Remember pattern' that remembers everything that is within $.*$ so from xxx up to yyy but not yyy.

Finally the remembered string is printed.

edited Apr 02 '16 at 20:55

Scott - Слава Україні

10,519

answered Mar 31 '16 at 21:14

Valentin Bajrami

9,344

2

It would be better to use grep -o '.*xxx' than a screenshot. – cas Apr 01 '16 at 23:11

MelBurslan · Answer 2 · 2016-04-02T22:12:36.170

4

This should do what you are trying to do :

sed -e 's/xxx\(.*\)yyy/\1/'

This assumes both delimiter strings are on the same line

edited Apr 02 '16 at 22:12

answered Mar 31 '16 at 21:05

MelBurslan

6,966

What if the delimiters are not on the same line ? – Out Of Bounds Mar 31 '16 at 21:12
1

Also can you explain to me the line you wrote ? – Out Of Bounds Mar 31 '16 at 21:15
1

It uses substitution. It finds the string inclusive of xxx and yyy. The $.*$ construct matches anything in between the delimiters (callled remembered pattern) and replaces it with the first remembered pattern denoted by \1(sed can facilitate up to 9 remembered patterns). – MelBurslan Mar 31 '16 at 21:43
1

The two sed solutions do not handle the case where the tags are on separate lines (because sed works on one line at a time). That could be done with a (fairly) complicated sed script, but other tools are more suited to this. – Thomas Dickey Apr 01 '16 at 01:22

Thomas Dickey · Answer 3 · 2016-04-01T08:16:22.727

The question is only interesting if the delimiters are not necessarily on the same line. It can be done several ways (even with sed), but awk is more flexible:

    #!/bin/sh
    awk '
    BEGIN { found = 0; }
    /xxx/ {
        if (!found) {
            found = 1;
            $0 = substr($0, index($0, "xxx") + 3);
        }
    }
    /yyy/ {
        if (found) {
            found = 2;
            $0 = substr($0, 0, index($0, "yyy") - 1);
        }
    }   
        { if (found) {
            print;
            if (found == 2)
                found = 0;
        }
    }
    '

This is tested lightly for the cases where at most one substring is on a line, using this data:

    this is xxx yy
    first
    second yyy

    xxx.x
    yyy

    xxx#yyy

and this output (script is "foo", data is "foo.in"):

    $ cat foo.in|./foo
     yy
    first
    second 
    .x

    #

The way it works, is that the input data is in $0, and awk matches the patterns xxx and yyy in sequence, allowing more than one thing to change $0 on its way to the last step, where it is printed.

By the way, this example would not work for

xxxxHelloyyyxxxWorldyyy

since it checks only the first match. The Perl script will give different results, since it uses a greedy match rather than the index/substr which I used in the awk example. Perl, of course, can do the same -- with a script.

Awk (like Perl) is free-format, so one could express the command as something like

awk 'BEGIN{found=0;}/xxx/{if(!found){found=1;$0=substr($0,index($0, "xxx")+3);}}/yyy/{if(found){found=2;$0=substr($0,0,index($0,"yyy")-1);}}{ if(found){print;if(found==2)found=0;}}'

but that is rarely done, except for the sake of example. Likewise, sed scripts (line-oriented), can be combined into a single line with some restrictions. Again, complex scripts in sed are rarely dealt with in that manner. Rather, they are treated like real programs (see example).

Further reading:

Since I'm the only one who used perl, I guess you're referring to my solution. And yes, that too fails horribly on your last example. — Henrik supports the community, Mar 31 '16 at 22:12

manu190466 · Answer 4 · 2016-04-04T18:32:04.793

2

Here is a solution with python :

import sys
import re
F=open(sys.argv[1])
text=F.read()
reg=re.compile("xxx((?:.|\n)*)yyy")
for match in reg.finditer(text):
    print match.groups()[0]

Save this script as a file "post.py" and launch it with:

python post.py your_file_to_search_in.txt

The script compiles a regular expression and print all occurences found in the text of the file.

(?:.|\n) is a non capturing group matching any character including newline

Edit : solution improved thanks to 1_CR tips :

import sys
import re
F=open(sys.argv[1])
text=F.read()
reg=re.compile(r'xxx(.*)yyy',re.DOTALL)
for match in reg.finditer(text):
    print match.groups()[0]

edited Apr 04 '16 at 18:32

answered Mar 31 '16 at 22:46

manu190466

206

Just pass re.DOTALL to avoid having to specify \n separately. Place xxx in look behind assertion and yyy in look ahead assertion since OP only needs the text in between these two. Lastly, as good practice always raw-string your regexes – iruvar Apr 02 '16 at 22:41
Thank you for the advice. I've edited my answer to take into account except the lookbehind and lookahead part. My grouping parenthesis dont include xxx nor yyy so OP will only have the text between these two. – manu190466 Apr 04 '16 at 18:45

Henrik supports the community · Answer 5 · 2016-03-31T21:30:06.377

A solution that also works when xxxand yyy is not on the same line: cat /tmp/xxx-to-yyy| perl -ne '(/xxx/../yyy/) && print' | perl -pe 's/.*(xxx.*)/$1/' | perl -pe 's/(.*yyy).*/$1/'

Not exactly pretty...

The -e switch to perl is just to give the script on the command line. The -n and -p makes it loop over the input lines, with -p they are printed after the script, with -n they aren't. So basically this just sends the file through three perl loops.

.. is a range operator, that returns false until the left condition returns true, and false after the right condition returns true, so the first loop cut down the file to the lines between the two strings (both included. The last two perl commands remove the text before xxx and after yyy.

How do I display all the characters between two specific strings?

5 Answers5

Linked

Related