Grep Match and extract

Question

I have a file which contains lines as

proto=tcp/http  sent=144        rcvd=52 spkt=3 
proto=tcp/https  sent=145        rcvd=52 spkt=3
proto=udp/dns  sent=144        rcvd=52 spkt=3

I need to extract the value of proto which is tcp/http, tcp/https, udp/dns.

So far I have tried this grep -o 'proto=[^/]*/' but only able to extract the value as proto=tcp/.

Possible duplicate of Can grep output only specified groupings that match? — Julien Lopez, Jun 08 '19 at 08:15

Kusalananda · Answer 1 · 2019-06-08T22:06:55.453

With grep -o, you will have to match exactly what you want to extract. Since you don't want to extract the proto= string, you should not match it.

An extended regular expression that would match either tcp or udp followed by a slash and some non-empty alphanumeric string is

(tcp|udp)/[[:alnum:]]+

Applying this on your data:

$ grep -E -o '(tcp|udp)/[[:alnum:]]+' file
tcp/http
tcp/https
udp/dns

To make sure that we only do this on lines that start with the string proto=:

grep '^proto=' file | grep -E -o '(tcp|udp)/[[:alnum:]]+'

With sed, removing everything before the first = and after the first blank character:

$ sed 's/^[^=]*=//; s/[[:blank:]].*//' file
tcp/http
tcp/https
udp/dns

To make sure that we only do this on lines that start with the string proto=, you could insert the same pre-processing step with grep as above, or you could use

sed -n '/^proto=/{ s/^[^=]*=//; s/[[:blank:]].*//; p; }' file

Here, we suppress the default output with the -n option, and then we trigger the substitutions and an explicit print of the line only if the line matches ^proto=.

With awk, using the default field separator, and then splitting the first field on = and printing the second bit of it:

$ awk '{ split($1, a, "="); print a[2] }' file
tcp/http
tcp/https
udp/dns

To make sure that we only do this on lines that start with the string proto=, you could insert the same pre-processing step with grep as above, or you could use

awk '/^proto=/ { split($1, a, "="); print a[2] }' file

user000001 · Answer 2 · 2019-06-09T07:24:49.887

11

If you are on GNU grep (for the -P option), you could use:

$ grep -oP 'proto=\K[^ ]*' file
tcp/http
tcp/https
udp/dns

Here we match the proto= string, to make sure that we are extracting the correct column, but then we discard it from the output with the \K flag.

The above assumes that the columns are space-separated. If tabs are also a valid separator, you would use \S to match the non-whitespace characters, so the command would be:

grep -oP 'proto=\K\S*' file

If you also want to protect against match fields where proto= is a substring, such as a thisisnotaproto=tcp/https, you can add word boundary with \b like so:

grep -oP '\bproto=\K\S*' file

edited Jun 09 '19 at 07:24

answered Jun 08 '19 at 06:58

user000001

3,635

1

You can improve that by writing just grep -oP 'proto=\K\S+'. The proto=tcp/http may be followed by a tab instead of spaces, and \S unlike [^ ] will match any non-space character. – Jun 08 '19 at 22:19
@mosvy: That's a good suggestion, thanks. – user000001 Jun 09 '19 at 07:25
1

Anyway, -o is a GNUism as well. -P is only supported by GNU grep if built with PCRE support (optional at build time). – Stéphane Chazelas Jun 09 '19 at 09:21

score 6 · Answer 3 · answered Jun 07 '19 at 22:50

Using awk:

awk '$1 ~ "proto" { sub(/proto=/, ""); print $1 }' input

$1 ~ "proto" will ensure we only take action on lines with proto in the first column

sub(/proto=/, "") will remove proto= from the input

print $1 prints the remaining column

$ awk '$1 ~ "proto" { sub(/proto=/, ""); print $1 }' input
tcp/http
tcp/https
udp/dns

bu5hman · Answer 4 · 2019-06-08T14:55:51.823

3

Code golfing on the grep solutions

grep -Po "..p/[^ ]+" file

or even

grep -Po "..p/\S+" file

edited Jun 08 '19 at 14:55

answered Jun 08 '19 at 14:49

bu5hman

4,756

score 2 · Answer 5 · edited Jun 08 '19 at 21:10

2

Using the cut command:

cut -b 7-15 foo.txt

edited Jun 08 '19 at 21:10

user000001

3,635

answered Jun 08 '19 at 12:03

Capeya

21

3

This will include trailing spaces on the http and dns lines. – G-Man Says 'Reinstate Monica' Jun 08 '19 at 18:06

score 2 · Answer 6 · answered Jun 08 '19 at 17:04

2

Just another grep solution:

grep -o '[^=/]\+/[^ ]\+' file

And a similar one with sed printing only the matched captured group:

sed -n 's/.*=\([^/]\+\/[^ ]\+\).*/\1/p' file

answered Jun 08 '19 at 17:04

Freddy

25,565

score 1 · Answer 7 · answered Jun 08 '19 at 16:39

Another awk approach:

$ awk -F'[= ]' '/=(tc|ud)p/{print $2}' file
tcp/http
tcp/https
udp/dns

That will set awk's field separator to either = or a space. Then, if the line matches a =, then either ud or tc followed by a p, print the 2nd field.

Another sed approach (not portable to all versions of sed, but works with GNU sed):

$ sed -En 's/^proto=(\S+).*/\1/p' file 
tcp/http
tcp/https
udp/dns

The -n means "don't print" and the -E enables extended regular expressions which give us \S for "non-whitespace", + for "one or more" and the parentheses for capturing. Finally, the /p at the end will make sed print a line only if the operation was successful so if there was a match for the substitution operator.

And, a perl one:

$ perl -nle '/^proto=(\S+)/ && print $1' file 
tcp/http
tcp/https
udp/dns

The -n means "read the input file line by line and apply the script given by -e to each line". The -l adds a newline to each print call (and removes exiting newlines from the input). The script itself will print the longest stretch of non-whitespace characters found after a proto=.

-E is getting more and more portable, but \S isn't. [^[:space:]] is a more portable equivalent. — Stéphane Chazelas, Jun 09 '19 at 09:24

score 1 · Accepted Answer · answered Jun 09 '19 at 14:02

Assuming this is related to your previous question, you're going down the wrong track. Rather than trying to piece together bits of scripts that will kinda/sorta do what you want most of the time and needing to get a completely different script every time you need to do anything the slightest bit different, just create 1 script that can parse your input file into an array (f[] below) that maps your field names (tags) to their values and then you can do whatever you want with the result, e.g. given this input file from your previous question:

$ cat file
Feb             3       0:18:51 17.1.1.1                      id=firewall     sn=qasasdasd "time=""2018-02-03"     22:47:55        "UTC""" fw=111.111.111.111       pri=6    c=2644        m=88    "msg=""Connection"      "Opened"""      app=2   n=2437       src=12.1.1.11:49894:X0       dst=4.2.2.2:53:X1       dstMac=42:16:1b:af:8e:e1        proto=udp/dns   sent=83 "rule=""5"      "(LAN->WAN)"""

we can write an awk script that creates an array of the values indexed by their names/tags:

$ cat tst.awk
{
    f["hdDate"] = $1 " " $2
    f["hdTime"] = $3
    f["hdIp"]   = $4
    sub(/^([^[:space:]]+[[:space:]]+){4}/,"")

    while ( match($0,/[^[:space:]]+="?/) ) {
        if ( tag != "" ) {
            val = substr($0,1,RSTART-1)
            gsub(/^[[:space:]]+|("")?[[:space:]]*$/,"",val)
            f[tag] = val
        }

        tag = substr($0,RSTART,RLENGTH-1)
        gsub(/^"|="?$/,"",tag)

        $0 = substr($0,RSTART+RLENGTH)
    }

    val = $0
    gsub(/^[[:space:]]+|("")?[[:space:]]*$/,"",val)
    f[tag] = val
}

and given that you can do whatever you like with your data just be referencing it by the field names, e.g. using GNU awk for -e for ease of mixing a script in a file with a command-line script:

$ awk -f tst.awk -e '{for (tag in f) printf "f[%s]=%s\n", tag, f[tag]}' file
f[fw]=111.111.111.111
f[dst]=4.2.2.2:53:X1
f[sn]=qasasdasd
f[hdTime]=0:18:51
f[sent]=83
f[m]=88
f[hdDate]=Feb 3
f[n]=2437
f[app]=2
f[hdIp]=17.1.1.1
f[src]=12.1.1.11:49894:X0
f[c]=2644
f[dstMac]=42:16:1b:af:8e:e1
f[msg]="Connection"      "Opened"
f[rule]="5"      "(LAN->WAN)"
f[proto]=udp/dns
f[id]=firewall
f[time]="2018-02-03"     22:47:55        "UTC"
f[pri]=6

$ awk -f tst.awk -e '{print f["proto"]}' file
udp/dns

$ awk -f tst.awk -e 'f["proto"] ~ /udp/ {print f["sent"], f["src"]}' file
83 12.1.1.11:49894:X0

@OrangeDog why do you think that? I'd actually like to see the equivalent in perl if you wouldn't mind posting such an answer. Perl definitely won't be easier to use if I don't have it on my box and can't install it, though, which is something I've frequently had to deal with over the years. Awk on the other hand is a mandatory utility and so is always present on UNIX installations, just like sed, grep, sort, etc. — Ed Morton, Jun 10 '19 at 15:17
@EdMorton true, though I have never personally encountered a distribution where perl was not included by default. Complex awk and sed scripts are usually simpler in perl because it's essentially a superset of them, with additional features for common tasks. — OrangeDog, Jun 10 '19 at 15:23
@OrangeDog no-one should ever write a sed script that's more complicated than s/old/new/g and sed is not awk so lets set that aside. I utterly disagree that complex awk scripts are simpler in perl. They can be briefer of course but brevity isn't a desirable attribute of software, conciseness is, and it's extremely rare for them to have any real benefit plus they are usually far more difficult to read which is why people post things like https://www.zoitz.com/archives/13 about perl and refer to it as a write-only language, unlike awk. I would still like to see a perl equivalent to this though — Ed Morton, Jun 10 '19 at 15:50

mkzia · Answer 9 · 2019-06-08T17:09:24.917

0

Here is another solution quite easy:

grep -o "[tc,ud]*p\\/.*  "   INPUTFile.txt  |   awk '{print $1}'

edited Jun 08 '19 at 17:09

answered Jun 08 '19 at 13:05

mkzia

39

Your grep doesn't match anything. [tc,ud]\*\\/.* looks for one occurrence of either t, or c, or , or u or d, followed by a literal * character, then a p and a backslash. You probably meant grep -Eo '(tc|ud)p/.* ' file | awk '{print $1}'. But then, if you're using awk, you may as well do the whole thing in awk: awk -F'[= ]' '/(tc|ud)p/{print $2}' file. – terdon Jun 08 '19 at 16:37
Someone modified my original, there was an extra Backslash before star, which I just removed Sir. – mkzia Jun 08 '19 at 17:10
Thanks for editing, but I'm afraid that only works by chance. As I explained before, [tc,ud]p means "one of t, c, ,, u or d followed by a p. So it matches here only because tcp has cp and udp has dp. But it would also match ,p or tp etc. Also, now that you have the *, it will match ppp as well (the * means "0 or more" so it will match even when it doesn't match). You don't want a character class ([ ]), what you want is a group: (tc|ud) (use with the -E flag of grep). Also, the .* makes it match the entire line. – terdon Jun 08 '19 at 17:35
@terdon: (1) No, actually it won’t match ppp. Of course you’re right that it will match ,p or tp — or uucp, ttp, cutp, ductp or d,up. – G-Man Says 'Reinstate Monica' Jun 08 '19 at 20:34
@G-Man It will also match ppp, try echo ppp | grep -o "[tc,ud]*p". And no, pointing out errors in my (or anyone else's) comments is absolutely fine! But the general issue you raised should be discussed on meta. Or, at least, by leaving comments under my answer, so we don't bombard poor mkzia with notifications about stuff not related to this answer. I moved the comments to chat, by the way. – terdon Jun 08 '19 at 22:21

score 0 · Answer 10 · edited Jun 12 '19 at 07:27

0

awk '{print $1}' filename|awk -F "=" '{print $NF}'

edited Jun 12 '19 at 07:27

Philippos

13,453

answered Jun 09 '19 at 12:51

Praveen Kumar BS

5,211

score 0 · Answer 11 · edited Jun 12 '19 at 07:28

0

cat file| cut -f1 -d' '| cut -f2 -d'='
tcp/http
tcp/https
udp/dns

cut options:

-f - field
-d - delimeter

edited Jun 12 '19 at 07:28

Philippos

13,453

answered Jun 12 '19 at 06:45

Dr. Alexander

361

Grep Match and extract

11 Answers11