How to print only occurences of pattern in sed?

Question

I have some data separated with semicolons on my Linux machine. I need to find the Nth (for example 3d) word and print it instead of the entire line. I have the following script that finds the wanted pattern and puts it between _, so I can see it works correctly:

sed 's/\;[^;]*\;/_&_/3'

For example for this input:

A1a 77l;a3sSs 2 smm;AS 3N123N8j a5njs;M3Xa 4 4a 3n1J  S2a;sm i;A9S;dd d3

it outputs:

A1a 77l;a3sSs 2 smm;AS 3N123N8j a5njs;M3Xa 4 4a 3n1J  S2a;sm i_;A9S;_dd d3

Now, when I already found pattern, I want to simply print it instead of the whole line so that the output will be:

A9S

Why does that look like a screenscrape of color codes? I wonder if https://unix.stackexchange.com/questions/287744/convert-output-of-script1-to-pdf/287766#287766 or https://unix.stackexchange.com/questions/276202/how-to-properly-log-the-output-of-a-console-program-that-frequently-updates-par/276288#276288 or https://stackoverflow.com/questions/28269278/can-i-programmatically-burn-in-ansi-control-codes-to-a-file-using-unix-utils would help? — Jeff Schaller, Apr 21 '20 at 17:16
I suggest to take a look at awk: awk -v c="6" -F ';' '{print $c}' file — Cyrus, Apr 21 '20 at 17:36
Hmm ... So sed is too weak and the only option is to use an ugly script like this - sed -e 's/[^;];[^;];//;s/[^;];[^;];[^;];//' -e 's/([^;]).*/\1/' or awk/perl/anything? — yomol777, Apr 21 '20 at 17:49

score 3 · Accepted Answer · edited Apr 21 '20 at 19:45

sed -E 's/(([^;]*);){6}.*/\2/'

will do it, where 6 is the the field number that you want to capture.

(If you specify a field number greater than the number of fields in your input, it just echoes the input without doing any substitution.)

I've used the -E option, which enables extended regular expressions. Depending on the version of sed you have, you might need to use -r instead. Alternatively, skip the option, so that you're using basic regular expressions, and escape the parentheses and curly braces:

sed 's/\(\([^;]*\);\)\{6\}.*/\2/'

How it works:

sed will find a match at the earliest possible position, and in this case, there's a match starting at the first character (assuming that there are at least 6 fields in your input). The outer parenthetical expression matches a field followed by a ; delimiter. The command will match 6 of these successively (or whatever number you specify). The .* at the end matches the rest of the line. As a result, the entire line gets replaced.

What does it get replaced with? \2 refers to the inner parenthesized expression (the one that starts with the 2nd left parenthesis). That inner parenthetical expression actually gets matched 6 times, but sed will use the very last match, which is what you want.

A version with better functionality:

This version will replace the entire line with an empty string if the indicated field doesn't exist (in the example, if there are fewer than 6 fields in your input):

sed -E 's/(([^;]*);){6}.*/\2/;t;d'

On OS X's versions of sed (and maybe BSD in general?), this seems to need to be written on two lines:

sed -E 's/(([^;]*);){6}.*/\2/;t
d'

The command t will terminate sed's processing of this input line if a substitution was made.

So if the 6th field exists, the substitution is made as before, and the t command ends processing of this input line. But if the 6th field doesn't exist, no substitution is made by the s command, so the t doesn't branch; sed just goes on to the d command, which deletes the input line (that's what we want to do if there are fewer than 6 fields in the input line).

This is rather neat. Well done, and thanks for turning my brain inside out with the back references. — Kusalananda, Apr 21 '20 at 18:08
Thanks. I just added in an improved version with better functionality in case that's useful to you. — Mitchell Spector, Apr 21 '20 at 18:34
Is there any sed out there that supports -r but not -E? My impression was that while -r is a GNU thing, -E is more common and although neither is POSIX, if a sed flavor supports one it will be -E. Is that wrong? — terdon, Apr 21 '20 at 19:47
@terdon My recollection is that GNU sed may have once just supported -r, while BSD sed just supported -E. Then GNU added support for -E for compatibility with BSD, but it was undocumented for a long time (it's in the GNU manual now). — Mitchell Spector, Apr 21 '20 at 20:25
@terdon According to the GNU sed manual at https://www.gnu.org/software/sed/manual/sed.html : -E "was a GNU extension, but the -E extension has since been added to the POSIX standard (http://austingroupbugs.net/view.php?id=528), so use -E for portability. GNU sed has accepted -E as an undocumented option for years, and *BSD seds have accepted -E for years as well, but scripts that use -E might not port to other older systems." — Mitchell Spector, Apr 21 '20 at 20:26

Kusalananda · Answer 2 · 2020-04-21T18:10:19.260

To get the 3rd ;-delimited field from your file, use cut:

$ cut -d ';' -f 3 file
AS 3N123N8j a5njs

To get the field you're showing, cut out the 6th field:

$ cut -d ';' -f 6 file
A9S

You may also use awk to do this with awk -F ';' '{ print $6 }' file.

With sed, you can't use the /n flag for the s command (with n being a digit), because you need to replace the whole line. This involves actually matching the whole line and not just a particular field.

One way to get the 6th ;-delimited field would thus be to use

$ sed 's/^\([^;]*;\)\{5\}\([^;]*\);.*/\2/' file
A9S

or, if your sed supports extended regular expression with -E,

$ sed -E 's/^([^;]*;){5}([^;]*);.*/\2/' file
A9S

That is, match five fields, where each field matches [^;]+; (inculdes the terminating ; for each field), and then the field we're after, followed by the rest of the line. Substitute all that with the field we're after.

In short, you are better off using cut or awk for this task.

How to print only occurences of pattern in sed?

2 Answers2