How to extract text using sed

Question

I have a text file and want to extract only the text beginning and ending with a certain strings using sed.

For example, in the line:

string>![TEST[Extract this string]>/string>

I want to get

Extract this string

How would you implement this with sed? Basically I want to get text that start with the expression "string>![TEST[" and end with the expression "]>/string>".

It's ppossible with sed,awk , perl and so on, It's related to your job! — PersianGulf, Mar 07 '15 at 19:27
this look like XML CDATA part. are you sure you dont want to use xml/xsl tools ? — Archemar, Mar 07 '15 at 21:59

score 8 · Answer 1 · answered Mar 07 '15 at 19:18

8

sed -e 's/string>!\[TEST\[\(.*\)]>\/string>/\1/' file

or

sed -e 's|string>!\[TEST\[\(.*\)]>/string>|\1|' file

Output:

Extract this string

answered Mar 07 '15 at 19:18

Cyrus

12,309

score 5 · Answer 2 · edited Dec 08 '21 at 23:56

You need to tell the string not only what to match, but what to save as well:

sed -ne 's@string>!\[TEST\[\([^]]*\)\]>/string>@\1@gp'

The s command in sed takes two arguments: a regular expression and a replacement string. Typically, the / delimiter is used to separate the two, but you can use any character, in this case @. There are some special characters in the regular expressions, like [, ]. These would need to be quoted with \ if you want the real character, e.g. string>!\[. The $[^]]*$ captures everything between the square brackets. And the \1 replaces the string with what matched the regular expression. At the end is @gp, which tells that sending to match multiple times on the line (g) and print the replaced line (after we tell sed to not print lines with the -n option.

To make a pattern shorter sed -ne 's@$string>$!\[TEST\[$[^]]*$\]>/\1@\2@gp' — Costas, Mar 07 '15 at 19:31
It captures everything which is not a right-side square bracket. — mikeserv, Mar 07 '15 at 19:31

score 4 · Answer 3 · answered Mar 07 '15 at 19:26

4

A simple approach with Awk:

awk -F'[][]' '{print $3}' file

answered Mar 07 '15 at 19:26

jasonwryan

73,126

Correct if there is the one string occurance in line. – Costas Mar 07 '15 at 19:32
@Costas Like I said, a simple approach (and based on the total amount of information provided by the OP)... – jasonwryan Mar 07 '15 at 19:34
1

So to be fully correct to OP it shoul be awk -F'string>\\!\\[TEST\\[|\\]' '$0 ~ FS{print $2}' – Costas Mar 07 '15 at 20:14
@Costas Sure, but that no longer qualifies as "simple". :) – jasonwryan Mar 07 '15 at 20:16
Simple? See mikeserv answer – Costas Mar 07 '15 at 20:25
@Costas Well, pretty much anything is simple by that comparison... – jasonwryan Mar 07 '15 at 20:32
@mikeserv I agree that the question could be read that way; but equally, given the lack of specificity, it could be read exactly the way I interpreted it. You approach is undoubtedly more comprehensive, but there is nothing wrong with taking the question on face value either, IMO. – jasonwryan Mar 07 '15 at 21:21
1

@mikeserv There was no need to defend your own: your familiarity with sed is truly impressive and my comment was a mixture of jest and admiration... – jasonwryan Mar 07 '15 at 22:03
@jasonwryan What's -F'[][]'? Could you please explain that in more detail for me? – John Dec 07 '21 at 06:53
@jasonwryan Why awk -F'[[]]' does not work? – John Dec 07 '21 at 07:15

mikeserv · Answer 4 · 2015-03-07T21:37:58.463

sed '/\n/P;//D;y|]|\n|
    s|\n>/string>|]|
    y|[]\n|\n[]|
    s|string>!\nTEST\n\(.*\[\)|[\1|
    y|\n[|[\n|;D' <<\IN
    string>![TEST[][]Extract[ ]this[ ]string[][]>/string>
IN

It maybe that you can specify that the square brackets are acceptable delimiters here, but, if so, it seems strange that the end delimiters would be so elaborate in that case. And anyway, as the question only states that you need to get text from between string>![TEST[ and ]>/string> and so that's what this tries to do - though it does fail if text should span newline boundaries.I

Anyway, it works by:

y|]|\n| - It first translates all occurrences of ] on a line for a \newline.
s|\n>/string>|]| - It next replaces the first occurring \newline which is followed immediately by your right end delimiter with ] (which makes it the only possible ] on a line at that time).
y|[]\n|\n[]| - If the last substitution was successful that one ] is translated to a [ while all \newlines are translated back to ] and all [ are simultaneously translated to \newlines - the three character types are shifted, basically.
s|string>!\nTEST\n$.*\[$|[\1| - If the left end delimiter is found preceding a [ at that time then it must be that both ends of the first occurrence of text have been found. That match is substituted for [.
y|\n[|[\n| - And so in the last translation if there are any [ on a line at all they will become newlines and all newlines will become [.

At this point everything up to the first occurring newline (or the entire line if there are none at all) is Deleted. If anything remains it is sent to the top of the script. If the previous iteration resulted in two \newlines in pattern space - both ends of your delimited text then it is Printed up to the first occurring \newline. Else the pattern space already tested is cleared and the cycle continues.

And so the above example prints:

][]Extract[ ]this[ ]string[][

...and it will print each on a separate line as many similarly delimited strings as can be fully left and right delimited per line or nothing at all.

In the case cn be more simple sed -e 's#[^!]*string>!\[TEST\[\|\]>/string>[^!]*##g'? — Costas, Mar 07 '15 at 20:34
@Costas - no. That doesn't work if there is more than one per line - or if the left-end delimiter occurs within text. — mikeserv, Mar 07 '15 at 20:38
Agree. What you say re's#string>!\[TEST\[\|\]>/string>$#^#g;s/[^^]*^$[^^]*$^[^^]*/\1 /g' If you do not sure in ^ free to change it by some more exotic. — Costas, Mar 07 '15 at 20:45
@Costas - exotic doesn't work for unknown strings - it doesn't matter how rare it is - if it is a possibility then it is a bug. You have to sanitize it first, or use delimiters that cannot otherwise occur. — mikeserv, Mar 07 '15 at 20:58

score 1 · Answer 5 · edited Mar 08 '15 at 05:18

1

Through GNU grep,

$ echo 'string>![TEST[Extract this string]>/string> foo bar string>![TEST[Extract this string]>/string>' | grep -oP 'string>!\[TEST\[\K.*?(?=]>/string>)'
Extract this string
Extract this string

edited Mar 08 '15 at 05:18

mikeserv

58,310

answered Mar 08 '15 at 04:45

Avinash Raj

3,703

social · Answer 6 · 2022-03-16T00:30:51.087

0

Also answered & tested here including extract Key Value from JSON example using only grep :

https://unix.stackexchange.com/a/694241/518235

No jq, awk, sed:

#!/bin/bash
json='{"access_token":"kjdshfsd", "key2":"value"}'
echo $json | grep -o '"access_token":"[^"]' | grep -o '[^"]$'

Tested & working here: https://ideone.com/Fw4How

source: https://brianchildress.co/parse-json-using-grep

edited Mar 16 '22 at 00:30

answered Mar 13 '22 at 19:04

social

299

I'm not sure if this might have served better as a "Related:" comment under the question, unless you [edit] and include the details from your previous answer in this answer. – Greenonline Mar 14 '22 at 03:31

How to extract text using sed

6 Answers6