0

I am trying to filter URLs and elisp thing-at-point 'url doesn't bounds a url properly because it has backslashes \in it. I believe that backslashes are unsafe in urls, but that's a separate topic. The fact is that I am receiving emails with URLs that have them. So my question is if there is any way to ensure that thing-at-point allows backslashes, so that urls can be properly bounded? The relevant functions should be available at thingatpt.el Thank you in advance.

Update

This is phils' solution

(defun thing-at-point-bounds-of-url-at-point (&optional lax)
  "Return a cons cell containing the start and end of the URI at point.
     Try to find a URI using `thing-at-point-markedup-url-regexp'.
     If that fails, try with `thing-at-point-beginning-of-url-regexp'.
     If that also fails, and optional argument LAX is non-nil, return
     the bounds of a possible ill-formed URI (one lacking a scheme)."
  ;; Look for the old <URL:foo> markup.  If found, use it.
  (or (thing-at-point--bounds-of-markedup-url)
      ;; Otherwise, find the bounds within which a URI may exist.  The
      ;; method is similar to `ffap-string-at-point'.  Note that URIs
      ;; may contain parentheses but may not contain spaces (RFC3986).
      (let* ((allowed-chars "--:=&?$+@-Z_[:alpha:]~#,%;*()!'\\\\")
         (skip-before "^[0-9a-zA-Z]")
         (skip-after  ":;.,!?")
         (pt (point))
         (beg (save-excursion
            (skip-chars-backward allowed-chars)
            (skip-chars-forward skip-before pt)
            (point)))
         (end (save-excursion
            (skip-chars-forward allowed-chars)
            (skip-chars-backward skip-after pt)
            (point))))
    (or (thing-at-point--bounds-of-well-formed-url beg end pt)
        (if lax (cons beg end))))))

and this is the original and url bounded by thing-at-point-bounds-of-url-at-point (both edited)

original:

https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.disney.com%2F&amp\;data=02%7C01%7CDonald.Duck%40disney.com%7Cfd8a40dd8138470286be08d8020da460%7C07ef1208413c4b5e9cdd64ef305754f0%7C0%7C0%7C637261604914185764&amp\;sdata=UdFaQR32cGXKu7ZrBU%2BGg8TJ%2BOeTZUMBV9N%2BE8Vi22s%3D&amp\;reserved=0

bounded:

https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.disney.com%2F&amp
Ajned
  • 672
  • 4
  • 19
  • RFC1738 is clear on that topic: "Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters. These characters are "{", "}", "|", "\", "^", "~", "[", "]", and "`". *All unsafe characters must always be encoded within a URL.*" – phils May 28 '20 at 00:46
  • All of which I believe you've already seen; however you still may wish to contact the sender and get them to fix their bug. – phils May 28 '20 at 00:51
  • Yes, I was aware of the fact that those were unsafe. By the example in my last update you can probably guess who the sender is... – Ajned May 28 '20 at 05:54
  • `&amp\;` looks like an attempt at `&` which has gone awry. – phils May 28 '20 at 06:52

1 Answers1

1

In Emacs 26.3 at least, the allowed characters are hard-coded in thing-at-point-bounds-of-url-at-point; so you would need to modify that function accordingly, to add a backslash to the allowed-chars binding. The comments below on escaping would apply here too: use "\\\\". (And likewise for having no idea whether or not this will have unwanted side-effects.)

Edit: Initial answer follows, before I'd realised that thing-at-point doesn't use url-get-url-at-point at all...


url-get-url-filename-chars is a variable defined in `url-util.el'.
Its value is "-%.?@a-zA-Z0-9()_/:~=&"

Documentation:
Valid characters in a URL.

It's used (by url-get-url-at-point) both in a regexp character alternative, and also with skip-chars-*, which means you'd need to use the latter's escaping for a backslash: "\\\\" (which is merely redundant for the former).

I don't know what all of the practical consequences of modifying this value might be.

phils
  • 48,657
  • 3
  • 76
  • 115
  • Thanks phils. Unfortunately it didn't work. See how I've modified `thing-at-point-bounds-of-url-at-point` in my update above. I've also added an example for illustrative purposes. And yes, it is working since I'm already experiencing, as you mentioned, some side-effects (even though not concerning at this point). – Ajned May 28 '20 at 05:51
  • Which is the "unfortunately didn't work" part? If I put point at the start of your example URL and use `M-: (insert (format "%s\n" (thing-at-point 'url)))` I end up with an exact duplicate line. And I'm getting the same value back with point anywhere within that URL. – phils May 28 '20 at 06:49
  • To be clear, when you say "your example" you mean the original? And have you changed `thing-at-point-bounds-of-url-at-point` as I've described above or not? I cannot get the original example bounded with or without \\\\ in the allowed-chars and I am using emacs 26.3 :-( – Ajned May 28 '20 at 07:28
  • Yes, this is Emacs 26.3, and I copied the code and example URL that you've added to the question. Without evaluating the modified function I see the original problem. After evaluating the function, I see the behaviour you wanted. Try with `emacs -Q`, perhaps. – phils May 28 '20 at 10:12
  • If you're still having issues, tell me step by step exactly how you're testing this, where you've put the function, how and when you're loading/evaluating it, etc. – phils May 28 '20 at 10:40