Regular expression that ignores certain characters

Question

I need to find a regular expression, that ignores certain characters for usage in the bib2bib tool. For example: I need to find any occurence of the word "muller". But also a string like ''Hello, my name is Michael M\"uller, how are you?'' or ''There is M\"{u}ller''

should be found.

Edit: I need this to work not only for "muller" but dynamically for every word.

Can you give an example of how you want to use the bib2bib tool? — daniel kullmann, Mar 16 '15 at 21:46
If none of the answers presented so far meet your needs, it might help if you gave a more thorough explanation and more comprehensive examples of what you want. For example, would you want to match "muLLer"? (If you want case-insensitive matching across the board, say so.) How about "Mull"er"? If the search string is "rene", would you want to match "Ren'e"? — Scott - Слава Україні, Mar 16 '15 at 23:51

daniel kullmann · Accepted Answer · 2024-02-02T08:52:51.027

3

If you want to remove things like \" and \"{ and }, you will have to preprocess your input file with a tool like sed before feeding it into bib2bib.

Example:

 sed -e 's/\\"\{\|\\"\|\}//' input.bib > input.bib.preprocessed

Or to specifically convert things like \"{u} into u:

 sed -e 's/\\"{\(.\)}/\1/' -e 's/\\"//' input.bib > input.bib.preprocessed

edited Feb 02 '24 at 08:52

answered Mar 16 '15 at 21:55

daniel kullmann

9,527
11
39
46

But doesn't the preprocessing actually prohibit the match - as the characters which the asker wishes to match have actually been removed? – mikeserv Mar 16 '15 at 23:21
@mikeserv I thought he wanted to ignore the \" and so on to be able to match e.g. muller with M\"uller. – daniel kullmann Mar 17 '15 at 07:39
But how does it match it, though? I mean if you do : printf %s\\n 'M\"uller"' | sed 's/[\"]//g' | grep -i muller you don't match M\"uller" w/ grep but Muller - and that's what it prints. It doesn't print M\"uller". – mikeserv Mar 17 '15 at 07:42

mikeserv · Answer 2 · 2015-03-16T23:20:00.473

A fully portable solution could look like:

n='
';printf %s\\n muller wright dummy >/tmp/patterns
tr '[:lower:][:upper:]' '[:upper:][:lower:]' </tmp/patterns |
paste '-d\n\n' - /tmp/patterns |
sed "N;s/./\\$n&/;:ul$n s/\(\n\)\(.\)\(.*\n\)\(.\)/\2\4\1\3/;tul"'
       s/\n//g;s/../[{}\\"]*[&]/g'

The output from that last sed looks like:

[{}\"]*[mM][{}\"]*[uU][{}\"]*[lL][{}\"]*[lL][{}\"]*[eE][{}\"]*[rR]
[{}\"]*[wW][{}\"]*[rR][{}\"]*[iI][{}\"]*[gG][{}\"]*[hH][{}\"]*[tT]
[{}\"]*[Dd][{}\"]*[uU][{}\"]*[Mm][{}\"]*[mM][{}\"]*[Yy]

It would depend on the contents of patterns being only alphanumeric characters. If patterns contained, for example, either of [] it would require further testing to ensure that the square brackets were placed correctly each within their respective bracket expressions.

In any case, based on the example in question:

[{}\"]*[mM][{}\"]*[uU][{}\"]*[lL][{}\"]*[lL][{}\"]*[eE][{}\"]*[rR]

...is a regexp that will match a line containing any of muller or Muller or M"ulL\\\{"er.

With GNU sed you can handle the case conversions within sed itself, so:

sed -E 's/([[:upper:]]?)([[:lower:]]?)/\1\L\1\2\U\2/g' patterns

...prints...

mMuUlLlLeErR
wWrRiIgGhHtT
DduUMmmMYy

...fully fleshed out, you can get the same behavior as the previous tr|paste|sed combination (except that, this way, the aforementioned square-bracket problem is handled correctly) with just GNU sed like:

sed -E '
    s/([[:lower:]]?)([[:upper:]]?)/\1\U\1\2\L\2/g
    s/[[:alpha:]]{2}|./[{}\\"]*[&]/g
' </tmp/patterns

Janis · Answer 3 · 2015-03-16T21:51:03.530

0

You haven't mentioned in what way you have your data available. To remove lines containing the posted patterns you can use grep:

grep -v -E '(muller|M\\"uller|M\\"{u}ller)'

(Note that the \ needs another escape.) To match the inverse, lines with the given patterns, omit the -v.

To define the regular expressions in a file use grep's option -f, as in:

grep -v -E -f file-with-regexps

It expects one regexp per line in that file.

edited Mar 16 '15 at 21:51

answered Mar 16 '15 at 21:29

Janis

14,222

Thanks for your reply. But this only works for muller. I need it to work dynamically.
The data is in a file which is parsed with another tool (bib2bib), which accepts regular expressions.
– User133713 Mar 16 '15 at 21:38
If you can't generalise from the suggestion please provide an example data file – Chris Davies Mar 16 '15 at 21:44
@user133713 No, it works for the three words that you asked for. – Janis Mar 16 '15 at 21:52
@user133713 If that "bib2bib" tool has an own regexp parser you should better consult the documentation (or man page) of that tool. The solution provided uses standard Unix extended regexps. – Janis Mar 16 '15 at 21:55
@User133713 Could you clarify what the process is you are working on? Please edit your question... – daniel kullmann Mar 16 '15 at 21:58

Regular expression that ignores certain characters

3 Answers3