3

I am trying to extract a value from a long string that may change over time. So for example the string could look something like this

....../filename-1.9.0.3.tar.gz"<....

And what I want to extract is the value between filename- and .tar.gz, essentially the file version (1.9.0.3 in this case). The reason I need to do it this way is because I may later run the command and the value will be 1.9.0.6 or 2.0.0.2 or something entirely different.

How can I do this? I'm currently only using grep, but I wouldn't mind using other utilities such as sed or awk or cut or whatever. To be perfectly clear, I need to extract only the file version part of the string, since it is very long (on both sides) everything else needs to be cut out somehow.

Cestarian
  • 2,051

2 Answers2

5

With grep -P/pcregrep, using a positive look-behind and a positive look-ahead:

grep -P -o '(?<=STRING1).*?(?=STRING2)' infile

in your case replace STRING1 with filename- and STRING2 with \.tar\.gz


If you don't have access to pcregrep and/or if your grep doesn't support -P you can do this with your favourite text processing tool. Here's a portable way with ed that gives you the same output:

ed -s infile <<\IN
g/STRING1/s//\ 
&/g
v/STRING1.*STRING2/d
,s/STRING1//
,s/STRING2.*//
,p
IN

How it works: a newline is prepended to each STRING1 occurrence (so now there's at most one occurrence per line) then all lines not matching STRING1.*STRING2 are deleted; on the remaining ones we only keep what's between STRING1 and STRING2 and print the result.

don_crissti
  • 82,805
  • If I am grepping a file, where do I put the name of that file? – Cestarian Mar 01 '16 at 22:55
  • 2
    Or grep -P -o 'filename-\K.*?(?=\.tar\.gz)' (with recent enough versions of PCRE). .*? would be better than .* if there may be more than one .tar.gz per line. – Stéphane Chazelas Mar 01 '16 at 22:57
  • This works perfectly for me, marking as answer. But as one final thing, how do I make it so that only one of the results is displayed (i.e. right now I am getting two results because there are two instances of this string) can I do that with just grep? Or will I need to use another program (like head -1)? – Cestarian Mar 01 '16 at 23:01
  • 1
    @StéphaneChazelas - thanks ! @Cestarian - use grep -P -om1 ... to stop at first match – don_crissti Mar 01 '16 at 23:03
  • Note that grep -Pom1 will print all the matches of the first matching line. echo abc | grep -Pom1 . will print a, b and c lines. pcregrep doesn't support -m, but supports pcregrep -o1 'filename-(.*?)\.tar\.gz' and pcregrep -Mo1 '(?s)filename-(.*?)\.tar\.gz.*' – Stéphane Chazelas Mar 01 '16 at 23:21
2

For the benefit of people without grep -P, you can do this with sed or awk on any POSIX system.

sed -n -e 's/^.*\/filename-\([^\/]*\)\.tar\.gz.*$/\1/p' -e T -e q

Explanation: turn off default printing, find a line containing the desired pattern and substitute everything away except the part you want to keep, print the result of the substitution, and exit if there was a match. Note that if there are multiple matches on the first matching line, this picks up the last one.

With awk (picking the first match on the line):

awk 'match($0, /filename-[^/]*\.tar\.gz/) {
    print substr(RSTART + 9, RSTART + RLENGTH - 9 - 6, $0);
    exit;
}'