Extract a substring with sed that stops at the first occurrence of the end

Question

I have a string where I need to extract a substring from, but the end of my regex is repeated. I would like to sed to stop at the first instance of the end of my regex, much like instr() functions in many languages return the first instance. Example:

echo "This is a test some stuff I want string junk string end" | sed -n 's/.*\(.te.*ng\).*/\1/p' 
returns: test some stuff I want string junk string
I want to return: test some stuff I want string

See Non-greedy match with SED regex (emulate perl's .*?) – steeldriver May 27 '17 at 20:36 — steeldriver, May 27 '17 at 20:36

RomanPerekhrest · Accepted Answer · 2017-05-27T20:56:32.627

2

grep approach (requires PCRE support):

s="This is a test some stuff I want string junk string end"
grep -Po 'te.*?ng' <<< $s

Alternative perl approach:

perl -ne 'print "$&\n" if /te.*?ng/' <<< $s

The output (for both approaches):

test some stuff I want string

.*? - ? here is non-greedy modifier, tells to match in minimal fashion

edited May 27 '17 at 20:56

answered May 27 '17 at 20:44

RomanPerekhrest

30,212

s="This is a test some stuff I want string junk string end" grep -Po 'te.*?ng' <<< $s – Ethan May 27 '17 at 21:15

score 1 · Answer 2 · answered May 27 '17 at 23:10

Do it in two steps: first remove the prefix (in case the terminator was present in the prefix), then remove everything after the prefix. Use the T command to skip a line if it doesn't match:

echo "This is a test some stuff I want string junk string end" |
sed -n 's/.*\(.te.*ng\)/\1/; T; s/\(ng\).*/\1/p'

Alternatively, delete the non-matching lines first, then perform the replacement at your leisure.

echo "This is a test some stuff I want string junk string end" |
sed '/.*\(.te.*ng\)/!d; s/.*\(.te.*ng\)/\1/; s/\(ng\).*/\1/'

Alternatively, perform the replacements and final printing only on matching lines.

echo "This is a test some stuff I want string junk string end" |
sed '/.*\(.te.*ng\)/ { s/.*\(.te.*ng\)/\1/; s/\(ng\).*/\1/p; }'

ADDB · Answer 3 · 2017-05-27T20:40:44.280

I would suggest to use the cut command in you case

echo "I am a useful and I am a string. Did I mention that I'm a string?" | cut -d "string" -f1

That would cut cut the string in three parts(before the first, after the 2. And between the 'string') with -d"" you can choose what pattern you want to use as cutter and with -fNumber you choose which part to take. Problem: the 'string' will be removed Solution:

String=`echo "I am a useful and I am a string. Did I mention that I'm a string?" | cut -d "string" -f1`
String="$(String) string"
echo $String

It adds the delimiter "string" that was removed to the end of the $String Variable that was defined with the output

score 0 · Answer 4 · answered May 27 '17 at 23:54

steeldriver has properly pointed out the Non-greedy match with SED regex (emulate perl's .*?) where John1024 clearly states:

Sed regexes match the longest match. Sed has no equivalent of non-greedy.

Thus, there's two alternative ways we can use to get around the issue. One, use what actually has non-greedy matching, like perl:

$ str="This is a test some stuff I want string junk string end"
$ perl -pe 's/^.*(te.*?ng).*/\1/' <<<  "$str"                                                                            
test some stuff I want string

Alternatively, you could give sed more context for grouping the match , i.e. add what is going to follow the first "string" word:

$ sed -r 's/^.*(te.*?ng)\ junk.*/\1/' <<<  "$str"                                                                        
test some stuff I want string

score 0 · Answer 5 · 2017-05-28T11:05:32.030

# How to perform the greedy match: "test .*? string" using POSIX sed

sed -e '
   /test.*string/!d;      # non-interesting line
   /^test/s/string/&\
/;                        # append marker after the first substring "string"
   /\n/{P;d;}             # initial portion of pattern space is our result
   s/test/\
&/;D;                     # remove portion before the substring "test"
' yourfile

Another POSIX-ly method is to take away the substring "string", 1 at a time from the end of pattern space, till there is just one left (after the substring "test"). Then what remains is to bring the substring "test" to the fore:

sed -e '
   :loop
      s/\(test.*string\).*string.*/\1/
   tloop
   /^test/!s/test/\
&/;/\n/D
' yourfile

Extract a substring with sed that stops at the first occurrence of the end

5 Answers5