Non-greedy matching in sed

Question

In a bash script, I have the following variable:

file_name='this_is_the_hart_part.csv'

Using

var2=$(echo $file_name | sed -e 's/_{2}\(.*\)_{3}/\1/')

I want to extract the substring "the" (between underscores number 2 and 3 in variable $file_name).

But I get back $var2 equal to $file_name. How do I have to change my sed command?

Kusalananda · Answer 1 · 2019-01-24T11:13:17.910

The types of regular expressions supported by sed does not allow for non-greedy matching with *.

You want to get the 3rd _-delimited field. This is easiest done with cut:

cut -d '_' -f 3

Or, with awk:

awk -F '_' '{ print $3 }'

Or, in the shell, by removing the first two such fields in succession, and then trimming the end:

str=${file_name#*_}
str=${str#*_}
str=${str%%_*}

"$str" would be the word the at the end. Using this last variation would likely be the fastest and most robust way out of these three.

The variable substitution ${variable#*_} would result in a string that is $variable with the leading bit up to and including the first underscore removed. The ${variable%%_*} would remove everything from the first underscore to the end of $variable. These are standard variable substitutions.

The benefit of using the variable substitution on a filename is that it would cope with filenames containing newlines, which neither awk nor sed or cut would do. In general, don't use line-oriented text editing tools on filenames.

In addition, you are using echo $file_name. Since $file_name is unquoted, it would undergo word-slitting (on every character that is also part of $IFS; a space, tab and newline by default) and the generated words, if they contain filename globbing characters, would be matched against filenames in the current directory by the shell. And backslashes in the filename may also disappear or have unwanted effects (even if you quote the expansion). The ksh shell would also do brace expansions on the value of $file_name when it's unquoted.

Note that it's not every whitespace or characters of $IFS, it's characters of $IFS only (and IFS by default contains space, TAB and NL which is not every whitespace). (Also note that ksh also does brace expansion upon variable expansion). — Stéphane Chazelas, Jan 24 '19 at 11:09
@StéphaneChazelas Thanks for being pedantic :-) I'll fix it up. — Kusalananda, Jan 24 '19 at 11:11

Stéphane Chazelas · Answer 2 · 2019-01-24T10:51:19.487

First note that sed is a text utility that works by default on one line at at a time while filenames can contain any character (including newline) and even non-characters (can be non-text).

Also, leaving a variable unquoted has a very special meaning, you almost never want to do that, it's also potentially very dangerous.

Also, you can't use echo to output arbitrary data, use printf instead.

Also, variable assignment syntax in Bourne-like shells is: var=value, not $var=value.

You can load the whole output of echo (or better, printf) into sed's pattern space with:

printf '%s\n' "$filename" | sed -e :1 -e '$!{N;b1' -e '}'

Then, you can add the code to extract the part between the second and third _:

var2=$(
  printf '%s\n' "$filename" |
   sed -ne :1 -e '$!{N;b1' -e '}' -e 's/^\([^_]*_\)\{2\}\([^_]*\)_.*/\2/p'
)

The non-greedy part is addressed by using [^_]* (a sequence of non-_ characters) which, contrary to .* guarantees we don't match past _ boundaries (though it would still choke on non-characters in many implementations).

In this case here, you could use shell parameter expansion operators instead:

case $filename in
  (*_*_*_*) var2=${filename#*_*_}; var2=${var2%%_*};;
  (*)       var2=;;
esac

Which would work better if the filename is not text or if the part you want to extract ends in a newline character (and would also be more efficient).

Some shells like zsh or ksh93 have more advanced operators:

zsh:

split on _ and get third field:
```
var2=${"${(@s:_:)filename}"[3]}
```
Using the ${var/pattern/replacement} and back-references (in that case, you want to verify first that the variable contains at least 3 underscores or there won't be any substitution).
```
set -o extendedglob
var2=${filename/(#b)*_*_(*)_*/$match[1]}
```
ksh93:
```
var2=${filename/*_*_@(*)_*/\1}
```

Why not use a here string to load the filename into sed? Like sed -e ... <<<"$filename" — L. Levrel, Jun 07 '21 at 10:33
You can as well, but that doesn't bring any advantage over printf ... | sed and is less portable (<<< is a zsh feature, which has not been copied by all other shells yet; whether bash will use a tempfile or a pipe for it depends on the version and the size of the data being fed) — Stéphane Chazelas, Jun 07 '21 at 11:52

pLumo · Accepted Answer · 2019-01-24T11:53:59.503

1

@Kusalananda is right that sed is the wrong tool and you cannot do non-greedy matching. But you can use a workaround for non-greedy matching: [^_]* will match any character that is not _

So in your case you could do something like this:

printf '%s\n' "$file_name" | sed -e 's/^[^_]*_[^_]*_\([^_]*\).*$/\1/g'

But ... for your use case, you should better use other tools ...

edited Jan 24 '19 at 11:53

answered Jan 24 '19 at 10:32

pLumo

22,565

1

You still need to use printf '%s\n' "$file_name" to be sure that the filename is not split by the shell or filename globbed, or get backslashes messed up. At least double quote the expansion... See e.g. Stéphane's answer – Kusalananda Jan 24 '19 at 10:40

Non-greedy matching in sed

3 Answers3