5

I want to have an expression that targets all lines that begin with http, end with icon.ico and do not contain config.privoxy.org. In the sample list below, I would like to catch all except the third and fourth entries (from the top).

http://cdn.sstatic.net/askubuntu/img/favicon.ico
http://cdn.sstatic.net/unix/img/favicon.ico
http://config.privoxy.org/error-favicon.ico
http://config.privoxy.org/favicon.ico
http://economictimes.indiatimes.com/icons/etfavicon.ico
http://forums.linuxmint.com/images/favicon.ico
http://forums.mozillazine.org/static/common/images/favicon.ico
http://gmane.org/favicon.ico
http://mail.yimg.com/ok/u/assets/img/favicon-yhoo.ico
http://portableapps.com/favicon.ico
https://help.ubuntu.com/favicon.ico
https://www.axisbank.co.in/favicon.ico
http://user.services.openoffice.org/favicon.ico
http://www.gardnermuseum.org/favicon.ico
http://www.theregister.co.uk/favicon.ico
http://www.webupd8.org/favicon.ico
http://www.wilderssecurity.com/favicon.ico

The best I could come up with is '^.{19}[^x].*icon\.ico$' which is a cheap workaround since x is relatively rare. Is there a fool-proof, proper way to do what I want?

  • 3
    Your problem is discussed in http://stackoverflow.com/questions/406230/regular-expression-to-match-string-not-containing-a-word or http://stackoverflow.com/questions/1395177/regex-to-exclude-a-specific-string-constant – jofel Feb 21 '12 at 13:50
  • 1
    @jofel, I spent quite some time trying to figure out solutions based on look around / forward / behind but I couldn't figure how to apply them to my situation. The closest would be grep -v but I need to get an expression for a content blocking add-on (SimpleBlock) in Firefox. The answer maybe in the links you've provided but I tried quite a few and failed. I'd be grateful for a solution specific to my need. –  Feb 21 '12 at 14:05

2 Answers2

2

Mathematically speaking, if a regular expression recognizes a certain set of inputs, then there is a regular expression that recognizes the complement set. If you know that regular expressions are equivalent to finite automata, it's obvious: swap the accepting and non-accepting states in the automaton. However, the size of the regular expression for the complement can grow exponentially with the size of the original regular expression, so it's often impractically large.

A regular expression for “begin with http, end with icon.ico and do not contain config.privoxy.org” is:

^http([^c]|c[^o]|co[^n]|…|config\.privoxy\.or[^g])*(c(o(n(f(…o(rg?)?)?)?)?)?)?icon\.ico$

(I hope I got it right. Note that there's quite a bit of … to fill in.)

Fortunately, Privoxy accepts more than mathematical regular expressions: it understands Perl extensions, including (?!foo) to match the empty string when it's followed by anything but foo. This is a zero-width negative lookahead assertion (zero-width: matches the empty string; lookahead assertion: restricts what may come immediately afterwards; negative: the restriction is expressed in terms of what may not appear), not a regex negation.

^http(?!.*config\.privoxy\.org).*icon\.ico$

Note that (?!…) must be used with care: if you don't pay attention, it might not mean what you think it means. For example:

  • ^http(?!config\.privoxy\.org).*icon\.ico$ matches http://config.privoxy.org/icon.ico, because config\.privoxy\.org does not appear immediately after the http prefix.
  • ^http(?!.*config\.privoxy\.org)icon\.ico$ does not match http://foo/icon.ico, because icon.ico must come immediately after the http prefix (what's between them can only match only the empty string).
  • ^http.*(?!config\.privoxy\.org).*icon\.ico$ matches http://config.privoxy.org/icon.ico, because (?!config\.privoxy\.org) matches at the : (as well as at the first /, at the o in config, etc).

I think what you're after is in fact

^https?://(?!config\.privoxy\.org/).*/favicon\.ico$
  • Thank you! The last bit of code is the solution. I'm also thankful that you've provided explanations for the possibilities that do not work as intended. (While I do use Privoxy, I have to use something else for https sites. For browsing with Firefox, I like the SimpleBlock add-on but the directions on the box could be less cryptic.) –  Feb 22 '12 at 03:15
  • This is absolute nonsense. Try that last filter in Privoxy and paste one of links mentioned in the first post in your browser. Guess what... you can retrieve each and every one of them. The filter is absolutely doing nothing. –  Jul 31 '12 at 01:38
1
sed -n '/config\.privoxy\.org/d; /^http.*icon\.ico$/p'    

It isn't a single regex, but it certainly is simple.

Peter.O
  • 32,916