8

I want to display all the characters in a file between strings "xxx" and "yyy" (the quotes are not part of the delimiters). How can I do that ? For example, if i have input "Hello world xxx this is a file yyy", the output should be " this is a file "

5 Answers5

7

You can use the pattern matching flag in sed as follows:

echo "Hello world xxx this is a file yyy" | sed 's/.*xxx \(.*\)yyy/\1/'

So .*xxx will match from the beginning up to xxx. This is best shown using grep:

enter image description here

\1 is a 'Remember pattern' that remembers everything that is within \(.*\) so from xxx up to yyy but not yyy.

Finally the remembered string is printed.

4

This should do what you are trying to do :

sed -e 's/xxx\(.*\)yyy/\1/'

This assumes both delimiter strings are on the same line

MelBurslan
  • 6,966
  • What if the delimiters are not on the same line ? – Out Of Bounds Mar 31 '16 at 21:12
  • 1
    Also can you explain to me the line you wrote ? – Out Of Bounds Mar 31 '16 at 21:15
  • 1
    It uses substitution. It finds the string inclusive of xxx and yyy. The \(.*\) construct matches anything in between the delimiters (callled remembered pattern) and replaces it with the first remembered pattern denoted by \1(sed can facilitate up to 9 remembered patterns). – MelBurslan Mar 31 '16 at 21:43
  • 1
    The two sed solutions do not handle the case where the tags are on separate lines (because sed works on one line at a time). That could be done with a (fairly) complicated sed script, but other tools are more suited to this. – Thomas Dickey Apr 01 '16 at 01:22
2

The question is only interesting if the delimiters are not necessarily on the same line. It can be done several ways (even with sed), but awk is more flexible:

    #!/bin/sh
    awk '
    BEGIN { found = 0; }
    /xxx/ {
        if (!found) {
            found = 1;
            $0 = substr($0, index($0, "xxx") + 3);
        }
    }
    /yyy/ {
        if (found) {
            found = 2;
            $0 = substr($0, 0, index($0, "yyy") - 1);
        }
    }   
        { if (found) {
            print;
            if (found == 2)
                found = 0;
        }
    }
    '

This is tested lightly for the cases where at most one substring is on a line, using this data:

    this is xxx yy
    first
    second yyy

    xxx.x
    yyy

    xxx#yyy

and this output (script is "foo", data is "foo.in"):

    $ cat foo.in|./foo
     yy
    first
    second 
    .x

    #

The way it works, is that the input data is in $0, and awk matches the patterns xxx and yyy in sequence, allowing more than one thing to change $0 on its way to the last step, where it is printed.

By the way, this example would not work for

xxxxHelloyyyxxxWorldyyy

since it checks only the first match. The Perl script will give different results, since it uses a greedy match rather than the index/substr which I used in the awk example. Perl, of course, can do the same -- with a script.

Awk (like Perl) is free-format, so one could express the command as something like

awk 'BEGIN{found=0;}/xxx/{if(!found){found=1;$0=substr($0,index($0, "xxx")+3);}}/yyy/{if(found){found=2;$0=substr($0,0,index($0,"yyy")-1);}}{ if(found){print;if(found==2)found=0;}}'

but that is rarely done, except for the sake of example. Likewise, sed scripts (line-oriented), can be combined into a single line with some restrictions. Again, complex scripts in sed are rarely dealt with in that manner. Rather, they are treated like real programs (see example).

Further reading:

Thomas Dickey
  • 76,765
2

Here is a solution with python :

import sys
import re
F=open(sys.argv[1])
text=F.read()
reg=re.compile("xxx((?:.|\n)*)yyy")
for match in reg.finditer(text):
    print match.groups()[0]

Save this script as a file "post.py" and launch it with:

python post.py your_file_to_search_in.txt

The script compiles a regular expression and print all occurences found in the text of the file.

(?:.|\n) is a non capturing group matching any character including newline

Edit : solution improved thanks to 1_CR tips :

import sys
import re
F=open(sys.argv[1])
text=F.read()
reg=re.compile(r'xxx(.*)yyy',re.DOTALL)
for match in reg.finditer(text):
    print match.groups()[0]
  • Just pass re.DOTALL to avoid having to specify \n separately. Place xxx in look behind assertion and yyy in look ahead assertion since OP only needs the text in between these two. Lastly, as good practice always raw-string your regexes – iruvar Apr 02 '16 at 22:41
  • Thank you for the advice. I've edited my answer to take into account except the lookbehind and lookahead part. My grouping parenthesis dont include xxx nor yyy so OP will only have the text between these two. – manu190466 Apr 04 '16 at 18:45
1

A solution that also works when xxxand yyy is not on the same line: cat /tmp/xxx-to-yyy| perl -ne '(/xxx/../yyy/) && print' | perl -pe 's/.*(xxx.*)/$1/' | perl -pe 's/(.*yyy).*/$1/'

Not exactly pretty...

The -e switch to perl is just to give the script on the command line. The -n and -p makes it loop over the input lines, with -p they are printed after the script, with -n they aren't. So basically this just sends the file through three perl loops.

.. is a range operator, that returns false until the left condition returns true, and false after the right condition returns true, so the first loop cut down the file to the lines between the two strings (both included. The last two perl commands remove the text before xxx and after yyy.