I want to display all the characters in a file between strings "xxx" and "yyy" (the quotes are not part of the delimiters). How can I do that ? For example, if i have input "Hello world xxx this is a file yyy", the output should be " this is a file "
5 Answers
You can use the pattern matching flag in sed
as follows:
echo "Hello world xxx this is a file yyy" | sed 's/.*xxx \(.*\)yyy/\1/'
So .*xxx
will match from the beginning up to xxx
. This is best shown using grep
:
\1
is a 'Remember pattern' that remembers everything that is within \(.*\)
so from xxx
up to yyy
but not yyy
.
Finally the remembered string is printed.

- 10,519

- 9,344
This should do what you are trying to do :
sed -e 's/xxx\(.*\)yyy/\1/'
This assumes both delimiter strings are on the same line

- 6,966
-
-
1
-
1It uses substitution. It finds the string inclusive of
xxx
andyyy
. The\(.*\)
construct matches anything in between the delimiters (callled remembered pattern) and replaces it with the first remembered pattern denoted by\1
(sed can facilitate up to 9 remembered patterns). – MelBurslan Mar 31 '16 at 21:43 -
1The two sed solutions do not handle the case where the tags are on separate lines (because sed works on one line at a time). That could be done with a (fairly) complicated sed script, but other tools are more suited to this. – Thomas Dickey Apr 01 '16 at 01:22
The question is only interesting if the delimiters are not necessarily on the same line. It can be done several ways (even with sed
), but awk
is more flexible:
#!/bin/sh awk ' BEGIN { found = 0; } /xxx/ { if (!found) { found = 1; $0 = substr($0, index($0, "xxx") + 3); } } /yyy/ { if (found) { found = 2; $0 = substr($0, 0, index($0, "yyy") - 1); } } { if (found) { print; if (found == 2) found = 0; } } '
This is tested lightly for the cases where at most one substring is on a line, using this data:
this is xxx yy first second yyy xxx.x yyy xxx#yyy
and this output (script is "foo", data is "foo.in"):
$ cat foo.in|./foo yy first second .x #
The way it works, is that the input data is in $0
, and awk matches the patterns xxx
and yyy
in sequence, allowing more than one thing to change $0
on its way to the last step, where it is printed.
By the way, this example would not work for
xxxxHelloyyyxxxWorldyyy
since it checks only the first match. The Perl script will give different results, since it uses a greedy match rather than the index/substr which I used in the awk example. Perl, of course, can do the same -- with a script.
Awk (like Perl) is free-format, so one could express the command as something like
awk 'BEGIN{found=0;}/xxx/{if(!found){found=1;$0=substr($0,index($0, "xxx")+3);}}/yyy/{if(found){found=2;$0=substr($0,0,index($0,"yyy")-1);}}{ if(found){print;if(found==2)found=0;}}'
but that is rarely done, except for the sake of example. Likewise, sed
scripts (line-oriented), can be combined into a single line with some restrictions. Again, complex scripts in sed
are rarely dealt with in that manner. Rather, they are treated like real programs (see example).
Further reading:

- 76,765
-
Since I'm the only one who used perl, I guess you're referring to my solution. And yes, that too fails horribly on your last example. – Henrik supports the community Mar 31 '16 at 22:12
Here is a solution with python :
import sys
import re
F=open(sys.argv[1])
text=F.read()
reg=re.compile("xxx((?:.|\n)*)yyy")
for match in reg.finditer(text):
print match.groups()[0]
Save this script as a file "post.py" and launch it with:
python post.py your_file_to_search_in.txt
The script compiles a regular expression and print all occurences found in the text of the file.
(?:.|\n) is a non capturing group matching any character including newline
Edit : solution improved thanks to 1_CR tips :
import sys
import re
F=open(sys.argv[1])
text=F.read()
reg=re.compile(r'xxx(.*)yyy',re.DOTALL)
for match in reg.finditer(text):
print match.groups()[0]

- 206
-
Just pass
re.DOTALL
to avoid having to specify\n
separately. Placexxx
in look behind assertion andyyy
in look ahead assertion since OP only needs the text in between these two. Lastly, as good practice always raw-string your regexes – iruvar Apr 02 '16 at 22:41 -
Thank you for the advice. I've edited my answer to take into account except the lookbehind and lookahead part. My grouping parenthesis dont include xxx nor yyy so OP will only have the text between these two. – manu190466 Apr 04 '16 at 18:45
A solution that also works when xxx
and yyy
is not on the same line:
cat /tmp/xxx-to-yyy| perl -ne '(/xxx/../yyy/) && print' | perl -pe 's/.*(xxx.*)/$1/' | perl -pe 's/(.*yyy).*/$1/'
Not exactly pretty...
The -e
switch to perl
is just to give the script on the command line.
The -n
and -p
makes it loop over the input lines, with -p
they are printed after the script, with -n
they aren't. So basically this just sends the file through three perl loops.
..
is a range operator, that returns false until the left condition returns true, and false after the right condition returns true, so the first loop cut down the file to the lines between the two strings (both included. The last two perl commands remove the text before xxx
and after yyy
.
pcregrep -Mo '(?<=STRING1)(\n|.)*?(?=STRING2)' infile
– don_crissti Mar 31 '16 at 21:33