How to match two consecutive identical words except for the case of the initial characters?

Question

I'm looking for a solution to match two consecutive identical words except for the case of the initial characters with query-replace-regexp:

The the
the The

I can't figure out how to match these typos with a regular expression at once because I don't know how to use downcase or capitalize in the searching part as I usually do in the replacement part (e.g. with lambda). I mean something like this:

;; THIS CODE IS WRONG!!!
(query-replace-regexp (concat "\\(\\(\\<[A-Z][[:alpha:]]+\\>\\)\\)\\([~\s\n]+\\)"
                              `((lambda (data count))
                                (downcase (match-string 1))))
                      nil (point-min) (point-max))

Of course I know this is wrong and so I tried this code for the first kind of typo:

;; "The the" case
(let (cap-word space typo)
  (goto-char (point-min))
  (while (re-search-forward "\\(\\([A-Z]\\)\\([[:alpha:]]+\\)\\)\\([ ~\n]+\\)" nil t)
    (save-excursion
      (setq cap-word (match-string 1)
            space (match-string 2)
            typo (concat cap-word
                         space
                         (downcase cap-word))

            (when (looking-at typo)

              ;; do things

              ))))

It works, but it's slow and, overall, does not match my typos at once. Is there a way to do that?

Review

I said the code above works, but it's not true. Sorry: I posted the wrong version of the code, so I'm putting the right one below for completeness:

;; "The the" case
(let ((case-fold-search nil)
      cap-word space down-word typo)
  (save-excursion
    (goto-char (point-min))
    (while (re-search-forward "\\<\\([A-Z]\\)\\([[:alpha:]]+\\)\\>\\([ ~\n]+\\)" nil t)
      (save-excursion
        (setq cap-word (concat (match-string 1)
                               (match-string 2))
              space (match-string 3)
              down-word (downcase cap-word)
              typo (concat cap-word
                           space
                           down-word))

        (when (looking-at down-word)

          ;; do things

          )))))

For clarity, is it *actually* an important requirement that the two consecutive identical words have different first-letter case? I'm asking because the majority of instances of repeated words *without* case changes are going to be mistakes as well, and so I'd be more inclined to have a query-replace process over *all* such potential errors. — phils, Feb 25 '22 at 01:59
What @phils asked. Would a case-insensitive match not be sufficient? — Drew, Feb 25 '22 at 02:04
@phils Yes, it is: I already have a script that deals with two consecutive identical words, but recently I came across the case "The the" that was pointed out to me by the authors (I'm a professional LaTeX typesetter). — Onner Irotsab, Feb 25 '22 at 08:38
@Drew Unfortunately not: The code I posted above is just part of a very long function that takes care of various things and I need to be case sensitive. — Onner Irotsab, Feb 25 '22 at 08:41

phils · Accepted Answer · 2022-02-25T09:44:16.023

There are numerous bugs in that code, but perhaps something like this is what you want?

(let ((case-fold-search nil)
      (word "\\b\\([[:word:]]\\)\\([[:word:]]*\\)\\b[[:space:]]*")
      initial suffix typo)
  (save-excursion
    (goto-char (point-min))
    (while (re-search-forward word nil t)
      (setq initial (match-string 1)
            suffix (match-string 2)
            typo (concat (if (string-match-p "[[:upper:]]" initial)
                             (downcase initial)
                           (upcase initial))
                         suffix "\\b"))
      (when (looking-at typo)
        (message "%s" (match-string 0))))))

Edit: On review, I'd actually go with this approach:

(let ((case-fold-search t)
      (regexp "\\b\\([[:word:]]+\\)[[:space:]]+\\(\\1\\)\\b"))
  (save-excursion
    (goto-char (point-min))
    (while (re-search-forward regexp nil t)
      (unless (string= (match-string 1) (match-string 2))
        (message "%s" (match-string 0))))))

This is comparing the case of the whole word, not just the first letter, but I assumed that was fine.

You can use (aref STRING 0) to obtain the first character if need be (and compare those with eql rather than string=). The initial match is case-insensitive though, so you'd potentially want to compare the remainder of each word as well. I'm guessing you don't need to do any of that, though.

Thank you very much. Could you please explain to me what bugs are you talking about? — Onner Irotsab, Feb 25 '22 at 08:47
You weren't disabling `case-fold-search` and so you were very likely matching case-insensitively. Your group numbers and associated bindings are wrong (your group 1 is the word, group 2 is the first letter, not `space`) which then messes up your `typo` value. And as `re-search-forward` leaves point at the end of the match, you're only going to be `(looking-at typo)` if the text was "The The the" (assuming `typo` was as you had intended). — phils, Feb 25 '22 at 09:20
Thank you very much! You are definitely right! I posted the wrong version of the code: I had changed the regular expression to search for `The ` and didn't update the groups below. As for `case-fold-search`, as this code is part of a very long function, where `case-fold-search` is active, I haven't repeated it in my code above, but I should have specified it. Sorry. — Onner Irotsab, Feb 25 '22 at 10:22

How to match two consecutive identical words except for the case of the initial characters?

1 Answers1