4

I'm using a simple (re-search-forward "\"") to find the end of string constants in my buffer. This is to copy the string in a def like (defconst name1 "some string"). The problem is that string constants may contain the ?\" character if it is quoted, and one has to check if the quoting char ?\\ is not itself quoted. And I want to find the end of empty strings too. So I need something more complex.

Since Emacs has to do this kind of search all the times, e.g. for highlighting the string constants, I thought that there must be some internal function that does just the trick, and faster and more reliably than I would ever manage to do myself. However I failed to find any such function while looking in manuals or using apropos search.

How is it done internally? Is there a forward-string command that would work like forward-sexp but for string constants? (I understand that string syntax is mode-dependent but I am happy with something that works for elisp or C).

phs
  • 1,095
  • 6
  • 13
  • 1
    You could iterate through buffer using `(goto-char (next-property-change (point)))`, then examine the property under point using `(text-properties-at (point))` to determine if the point is currently in a string. See this for extended example: https://www.gnu.org/software/emacs/manual/html_node/elisp/Property-Search.html . Also, once at the beginning of a string `forward-sexp` will jump to the end of the string (so, if you have any special decorations, which would alter text properties, this would work). – wvxvw Nov 09 '17 at 13:37
  • Thanks for pointing out these features that I did not kow about. It seems this solution is very dependent on how the text was fontified. For example, it does not work in BibTeX mode. – phs Nov 09 '17 at 13:45
  • 1
    Yes, it assumes that Emacs knew how to fontify the buffer. If it didn't... then there isn't much it can do for you. If BibTeX mode didn't specify any font-lock rules, then you are basically looking at a plain text file, Emacs knows nothing about its structure. Even though in many programming languages string follow similar rules, they have some differences... there isn't a universal way to identify a string w/o knowing what language it is defined in. – wvxvw Nov 09 '17 at 14:00
  • At the moment I am using `(re-search-forward "\"\\(\\\\\\(.\\|\n\\)\\|[^\\\"]\\)*\"")` which works fine (and faster than I thought) in C and BibTeX and Lisp modes, but it has its limitations. For example it has to be started outside a string constant :-) – phs Nov 09 '17 at 17:42

5 Answers5

5

Some remarks at first:

  1. This question should not only be about finding the end of a string. Also finding the beginning of a string is quite complicated if you want to do it right. There may be double-quote characters ?\" within the program text that must be ignored. Furthermore there may be single double-quote characters within comments that also must be ignored.
  2. Don't use text properties since these depend on the buffer fontification. If the buffer is lazily fontified it is not fully fontified. You would need to call font-lock-ensure. This is often too much work for the job to be done.

If the major mode uses the syntax table to identify comments and strings it is better to use parse-partial-sexp.

I demonstrate this with the following elisp snippet. You can use the macro for-the-strings for your purpose. Just implement your wanted functionality as the body of this macro. The first two args are symbols that are bound to the beginning and the end of the string. The third arg is the end of the region to search for strings. If you set the third arg to nil the buffer is searched up to (point-max).

(defmacro for-the-strings (b e end &rest body)
  "For each string found in current buffer starting at `point' and not further than END
`let'-bind B and E to the beginning and end, resp., of the string (inclusively string delimiters)
and execute BODY.
Note that you may modify the string within BODY but you may not remove the string delimiters and you may not move the start of the string.
You don't need to preserve point but you may not change the value of B within BODY."
  (declare (debug (symbolp symbolp form body)) (indent 3))
  (let ((sym-end (make-symbol "end"))) ;; generate new uninterned symbol to avoid clash with symbol `end' which maybe already in the namespace.
    `(let (ppss
           (,sym-end (make-marker)))
       (unwind-protect
           (progn
             (set-marker ,sym-end (or ,end (point-max)))
             (while (progn
                      ;; SEARCH FOR THE NEXT STRING OR COMMENT:
                      (setq ppss (parse-partial-sexp (point) ,sym-end nil nil ppss 'syntax-table))
                      (and (< (point) ,sym-end)
                           (null (eobp))))
               (when (nth 3 ppss) ;; NOT A COMMENT BUT A STRING
                 (backward-char) ;; NOW WE ARE AT THE STRING STARTER
                 (let ((,b (point))
                       (,e (scan-sexps (point) 1))) ;; FIND STRING END
                   ,@body
                   (goto-char (1+ ,b)) ;; Maybe, the string is modified by side-effects.
                   ;; The safest action is to start again from the beginning of the string position.
                   ;; We skip to the start of the next sexp.
                   ;; That may also be a string or a comment.
                   (setq ppss (parse-partial-sexp (point) ,sym-end nil t ppss))
                   ))))
         (set-marker ,sym-end nil)
         ))))

;; Demo:
(let* ((end (save-excursion (search-forward "ENDMARK" nil t))))
  (search-forward "TESTDATA" nil t 2) ;; We start with the search at the string TESTDATA below.
  (for-the-strings
      b
      e
      end
    (message "String: %s" (buffer-substring-no-properties b e))
    ;; Next we demonstrate how we can search and modify the found string:
    (goto-char b)
    (when (search-forward "FOOBAR" e t)
      (replace-match "FOO")))) ;;< Evaluate the demo form at that closing paren with C-x C-e and observe the emitted messages.

;; TESTDATA:

"This string is shown.
Escaped \"double-quotes\" work fine." ; some comment

"FOOBAR is replaced by FOO"

(setq some-form ; another comment
      '(1
        2
        ?\" ;;< This escaped double-quote does not start a string.
        "String after ?\\\"."))


; ENDMARK

(message "This string is ignored")
Tobias
  • 32,569
  • 1
  • 34
  • 75
4

I use this regexp string: "\"\\(?:[^\\\"]\\|\\\\\\(?:.\\|[\n]\\)\\)*\"". It matches two " chars enclosing zero or more occurrences of either a non-", non-backslash or a backslash followed by any character (including " and newline).

For example:

(re-search-forward "\"\\(?:[^\\\"]\\|\\\\\\(?:.\\|[\n]\\)\\)*\"" nil 'NO-ERROR)
Drew
  • 75,699
  • 9
  • 109
  • 225
  • 1
    n.b. The following is more useful (readable) when formatted across multiple lines. If anyone wants an `rx` equivalent to the above regexp to cut down on the backslashes: `(rx "\"" (zero-or-more (or (not (any "\\" "\"")) (seq "\\" anything))) "\"")` – phils Nov 22 '17 at 23:40
  • @phils: I didn't know about this `rx` macro. Thanks for pointing it out, it is a great way of writing regexps. Now I will have to comb through all my code using obscure regexps and make them readable... – phs Feb 01 '18 at 08:59
0
(defun my-end-of-string ()
  (interactive)
  (while (and (search-forward "\"" nil t 1)
          ;; if stop inside a comment or string
          (nth 8 (parse-partial-sexp (point-min) (point))))))
Andreas Röhler
  • 1,894
  • 10
  • 10
0

I've finally opted for a more general approach, with a bit more work but yielding more returns.

The package thingatpt.el allows code like (thing-at-point 'symbol) and can be extended easily to also allow (thing-at-point 'string-constant). Extending for string constants is explicitly considered in https://www.emacswiki.org/emacs/StringAtPoint.

phs
  • 1,095
  • 6
  • 13
0

See my answer https://emacs.stackexchange.com/a/14240/202

font-face is used to find string end. I learned this technique from flyspell-mode (Check the code flyspell-prog-mode)

chen bin
  • 4,781
  • 18
  • 36