4

I have a text file and want to extract only the text beginning and ending with a certain strings using sed.

For example, in the line:

string>![TEST[Extract this string]>/string>

I want to get

Extract this string 

How would you implement this with sed? Basically I want to get text that start with the expression "string>![TEST[" and end with the expression "]>/string>".

рüффп
  • 1,707
Vivek
  • 65

6 Answers6

8
sed -e 's/string>!\[TEST\[\(.*\)]>\/string>/\1/' file

or

sed -e 's|string>!\[TEST\[\(.*\)]>/string>|\1|' file

Output:

Extract this string
Cyrus
  • 12,309
5

You need to tell the string not only what to match, but what to save as well:

sed -ne 's@string>!\[TEST\[\([^]]*\)\]>/string>@\1@gp'

The s command in sed takes two arguments: a regular expression and a replacement string. Typically, the / delimiter is used to separate the two, but you can use any character, in this case @. There are some special characters in the regular expressions, like [, ]. These would need to be quoted with \ if you want the real character, e.g. string>!\[. The \([^]]*\) captures everything between the square brackets. And the \1 replaces the string with what matched the regular expression. At the end is @gp, which tells that sending to match multiple times on the line (g) and print the replaced line (after we tell sed to not print lines with the -n option.

John
  • 298
Arcege
  • 22,536
4

A simple approach with Awk:

awk -F'[][]' '{print $3}' file
jasonwryan
  • 73,126
3
sed '/\n/P;//D;y|]|\n|
    s|\n>/string>|]|
    y|[]\n|\n[]|
    s|string>!\nTEST\n\(.*\[\)|[\1|
    y|\n[|[\n|;D' <<\IN
    string>![TEST[][]Extract[ ]this[ ]string[][]>/string>
IN

It maybe that you can specify that the square brackets are acceptable delimiters here, but, if so, it seems strange that the end delimiters would be so elaborate in that case. And anyway, as the question only states that you need to get text from between string>![TEST[ and ]>/string> and so that's what this tries to do - though it does fail if text should span newline boundaries.I

Anyway, it works by:

  1. y|]|\n| - It first translates all occurrences of ] on a line for a \newline.
  2. s|\n>/string>|]| - It next replaces the first occurring \newline which is followed immediately by your right end delimiter with ] (which makes it the only possible ] on a line at that time).
  3. y|[]\n|\n[]| - If the last substitution was successful that one ] is translated to a [ while all \newlines are translated back to ] and all [ are simultaneously translated to \newlines - the three character types are shifted, basically.
  4. s|string>!\nTEST\n\(.*\[\)|[\1| - If the left end delimiter is found preceding a [ at that time then it must be that both ends of the first occurrence of text have been found. That match is substituted for [.
  5. y|\n[|[\n| - And so in the last translation if there are any [ on a line at all they will become newlines and all newlines will become [.

At this point everything up to the first occurring newline (or the entire line if there are none at all) is Deleted. If anything remains it is sent to the top of the script. If the previous iteration resulted in two \newlines in pattern space - both ends of your delimited text then it is Printed up to the first occurring \newline. Else the pattern space already tested is cleared and the cycle continues.

And so the above example prints:

][]Extract[ ]this[ ]string[][

...and it will print each on a separate line as many similarly delimited strings as can be fully left and right delimited per line or nothing at all.

mikeserv
  • 58,310
  • In the case cn be more simple sed -e 's#[^!]*string>!\[TEST\[\|\]>/string>[^!]*##g'? – Costas Mar 07 '15 at 20:34
  • @Costas - no. That doesn't work if there is more than one per line - or if the left-end delimiter occurs within text. – mikeserv Mar 07 '15 at 20:38
  • Agree. What you say re's#string>!\[TEST\[\|\]>/string>$#^#g;s/[^^]*^\([^^]*\)^[^^]*/\1 /g' If you do not sure in ^ free to change it by some more exotic. – Costas Mar 07 '15 at 20:45
  • @Costas - exotic doesn't work for unknown strings - it doesn't matter how rare it is - if it is a possibility then it is a bug. You have to sanitize it first, or use delimiters that cannot otherwise occur. – mikeserv Mar 07 '15 at 20:58
1

Through GNU grep,

$ echo 'string>![TEST[Extract this string]>/string> foo bar string>![TEST[Extract this string]>/string>' | grep -oP 'string>!\[TEST\[\K.*?(?=]>/string>)'
Extract this string
Extract this string
mikeserv
  • 58,310
Avinash Raj
  • 3,703
0

Also answered & tested here including extract Key Value from JSON example using only grep :

https://unix.stackexchange.com/a/694241/518235

No jq, awk, sed:

#!/bin/bash
json='{"access_token":"kjdshfsd", "key2":"value"}'

echo $json | grep -o '"access_token":"[^"]' | grep -o '[^"]$'

Tested & working here: https://ideone.com/Fw4How

source: https://brianchildress.co/parse-json-using-grep

social
  • 299
  • I'm not sure if this might have served better as a "Related:" comment under the question, unless you [edit] and include the details from your previous answer in this answer. – Greenonline Mar 14 '22 at 03:31