2

Looking for a guide to optimising regexp matches in bash.

I have a script that loops over a very long list of URLs looking for patterns. Currently it looks a little like the fragment below. Is there a guide to optimising these kinds of matches?

if [[ ${url} == */oai/request ]]
then
    echo first option
elif [[ ${url} =~ .*/index.php/[^/]+/journal=.* ]]
then
    echo second option
elif [[ ${url} =~ .*/[Ee][Tt][dD]-[Dd][Bb]/.* ]]
then
    echo third option
elif [[ ${url} =~ .*/handle/[0-9]+/[0-9].* || ${url} =~ .*/browse.* ]]
then
    echo fourth option
else
    echo no-match option
fi
  • 3
    I assume you'd like to speed up this pattern matching. Bash isn't really suitable for processing large amounts of text; a better (and faster) tool for this is awk, which has much better regex capabilities. BTW, "glob" and "regex" aren't synonyms and it can be confusing to talk as if they were. Your first test uses a glob, the remainder use regexes. – PM 2Ring Jan 05 '15 at 09:47
  • 1
    You can remove .* from the beginning and end of each regex. – choroba Jan 05 '15 at 10:28
  • Well spotted, @choroba. I guess I should've mentioned that, but I don't like to encourage people to use bash for stuff that it's poorly suited to do. That's my excuse, anyway. :) – PM 2Ring Jan 05 '15 at 10:34
  • Perhaps use case statements, if you don't mind using extended globs instead. Globs might be faster than regexes: http://stackoverflow.com/a/4555979/2072269 – muru Jan 05 '15 at 11:16
  • 2
    Use the tool for task. A shell is not a text processing utility it, it's an utility to invoke commands. Why is using a shell loop to process text considered bad practice?. Here use perl/python/ruby/awk. – Stéphane Chazelas Oct 27 '15 at 15:40

1 Answers1

1

As pointed out in comments, something like awk may be better suited for this than trying to do it in the shell:

/\/oai\/request/                        { print "first option" ; next   }
/\/index\.php\/[^/]+\/journal=/         { print "second option"; next   }
/\/[Ee][Tt][dD]-[Dd][Bb]\//             { print "third option" ; next   }
/\/handle\/[0-9]+\/[0-9]/ || /\/browse/ { print "fourth option"; next   }
                                        { print "no match"              }

Then:

$ awk -f script.awk inputfile

where inputfile is a file containing URLs, one per line (for example).

Related: Why is using a shell loop to process text considered bad practice?

Kusalananda
  • 333,661