2

I'm looking for a solution to match two consecutive identical words except for the case of the initial characters with query-replace-regexp:

The the
the The

I can't figure out how to match these typos with a regular expression at once because I don't know how to use downcase or capitalize in the searching part as I usually do in the replacement part (e.g. with lambda). I mean something like this:

;; THIS CODE IS WRONG!!!
(query-replace-regexp (concat "\\(\\(\\<[A-Z][[:alpha:]]+\\>\\)\\)\\([~\s\n]+\\)"
                              `((lambda (data count))
                                (downcase (match-string 1))))
                      nil (point-min) (point-max))

Of course I know this is wrong and so I tried this code for the first kind of typo:

;; "The the" case
(let (cap-word space typo)
  (goto-char (point-min))
  (while (re-search-forward "\\(\\([A-Z]\\)\\([[:alpha:]]+\\)\\)\\([ ~\n]+\\)" nil t)
    (save-excursion
      (setq cap-word (match-string 1)
            space (match-string 2)
            typo (concat cap-word
                         space
                         (downcase cap-word))

            (when (looking-at typo)

              ;; do things

              ))))

It works, but it's slow and, overall, does not match my typos at once. Is there a way to do that?

Review

I said the code above works, but it's not true. Sorry: I posted the wrong version of the code, so I'm putting the right one below for completeness:

;; "The the" case
(let ((case-fold-search nil)
      cap-word space down-word typo)
  (save-excursion
    (goto-char (point-min))
    (while (re-search-forward "\\<\\([A-Z]\\)\\([[:alpha:]]+\\)\\>\\([ ~\n]+\\)" nil t)
      (save-excursion
        (setq cap-word (concat (match-string 1)
                               (match-string 2))
              space (match-string 3)
              down-word (downcase cap-word)
              typo (concat cap-word
                           space
                           down-word))

        (when (looking-at down-word)

          ;; do things

          )))))
Onner Irotsab
  • 431
  • 2
  • 9
  • 1
    For clarity, is it *actually* an important requirement that the two consecutive identical words have different first-letter case? I'm asking because the majority of instances of repeated words *without* case changes are going to be mistakes as well, and so I'd be more inclined to have a query-replace process over *all* such potential errors. – phils Feb 25 '22 at 01:59
  • What @phils asked. Would a case-insensitive match not be sufficient? – Drew Feb 25 '22 at 02:04
  • @phils Yes, it is: I already have a script that deals with two consecutive identical words, but recently I came across the case "The the" that was pointed out to me by the authors (I'm a professional LaTeX typesetter). – Onner Irotsab Feb 25 '22 at 08:38
  • @Drew Unfortunately not: The code I posted above is just part of a very long function that takes care of various things and I need to be case sensitive. – Onner Irotsab Feb 25 '22 at 08:41

1 Answers1

1

There are numerous bugs in that code, but perhaps something like this is what you want?

(let ((case-fold-search nil)
      (word "\\b\\([[:word:]]\\)\\([[:word:]]*\\)\\b[[:space:]]*")
      initial suffix typo)
  (save-excursion
    (goto-char (point-min))
    (while (re-search-forward word nil t)
      (setq initial (match-string 1)
            suffix (match-string 2)
            typo (concat (if (string-match-p "[[:upper:]]" initial)
                             (downcase initial)
                           (upcase initial))
                         suffix "\\b"))
      (when (looking-at typo)
        (message "%s" (match-string 0))))))

Edit: On review, I'd actually go with this approach:

(let ((case-fold-search t)
      (regexp "\\b\\([[:word:]]+\\)[[:space:]]+\\(\\1\\)\\b"))
  (save-excursion
    (goto-char (point-min))
    (while (re-search-forward regexp nil t)
      (unless (string= (match-string 1) (match-string 2))
        (message "%s" (match-string 0))))))

This is comparing the case of the whole word, not just the first letter, but I assumed that was fine.

You can use (aref STRING 0) to obtain the first character if need be (and compare those with eql rather than string=). The initial match is case-insensitive though, so you'd potentially want to compare the remainder of each word as well. I'm guessing you don't need to do any of that, though.

phils
  • 48,657
  • 3
  • 76
  • 115
  • Thank you very much. Could you please explain to me what bugs are you talking about? – Onner Irotsab Feb 25 '22 at 08:47
  • You weren't disabling `case-fold-search` and so you were very likely matching case-insensitively. Your group numbers and associated bindings are wrong (your group 1 is the word, group 2 is the first letter, not `space`) which then messes up your `typo` value. And as `re-search-forward` leaves point at the end of the match, you're only going to be `(looking-at typo)` if the text was "The The the" (assuming `typo` was as you had intended). – phils Feb 25 '22 at 09:20
  • Thank you very much! You are definitely right! I posted the wrong version of the code: I had changed the regular expression to search for `The ` and didn't update the groups below. As for `case-fold-search`, as this code is part of a very long function, where `case-fold-search` is active, I haven't repeated it in my code above, but I should have specified it. Sorry. – Onner Irotsab Feb 25 '22 at 10:22