Search for and correct capitalized words that are not at the beginning of a sentence

Question

I have a latex document in which there are many incorrectly Capitalized words. Could anybody suggest a clever regex to track them?

Even better if there is a function which cycles through such words corrects them by asking yes or no.

Drew · Answer 1 · 2015-04-11T08:03:06.653

One suggestion might be to do this:

Lower-case all capitalized words. E.g., C-x h M-x downcase-region.
Capitalize all words that begin a sentence. Function sentence-end returns a regexp that matches sentence ends - you can adapt that.

Command query-replace-regexp is your friend. You can even replace subgroup matches with text produced or transformed by a Lisp sexp. See C-h f query-replace-regexp, argument TO-STRING. (There is also the similar command replace-regexp.)

For #1, you could use a regexp such as $[A-Z]$$[a-z]*$ and a TO-STRING such as \,(downcase \1)\2. (But if you don't need to check each candidate you can just use M-x downcase-region.)

This command will pretty much do it (without querying):

(defun capitalize-only-sentence-starts ()
  (interactive)
  (downcase-region (point-min) (point-max))
  (let* ((sent-end    (sentence-end))
         (sent-start  (format "%s\\([a-z]\\)" sent-end)))
    (while (re-search-forward sent-start nil t)
      ;; Replace 3rd subgroup match by its uppercase equivalent.
      (replace-match (upcase (match-string 3)) nil nil nil 3))))

That will capitalize only sentence starts that appear after sentence ends, so it won't pick up the very first sentence start.

Note too that function sentence-end matches sentence endings that are defined by these user options:

sentence-end
sentence-end-base      
sentence-end-double-space
sentence-end-without-period
sentence-end-without-space

In my case, (sentence-end) returns this, which is why I used (match-string 3) above, to pick up the sentence start char:

"\\([.?!][]\"'”)}]*\\($\\|[  ]$\\|  \\|[  ][  ]\\)\\|[。．？！]+\\)[    \n]*"

That has 2 subgroups (0 is for the whole match, 1 for the first subgroup match, 2 is for the second). The third subgroup for the full regexp is from sent-start.

But here is a simpler command. ;-) The sentence-end regexp is handled for you, by backward-sentence:

(defun capitalize-only-sentence-starts ()
  (interactive)
  (downcase-region (point-min) (point-max))
  (save-excursion
    (goto-char (point-max))
    (while (not (bobp))
      (backward-sentence)
      (save-excursion (capitalize-word 1)))))

Thanks for answering, but I think this procedure has some negative effects. Since the document consists lot of names which comes within the sentence and should be capitalized. Nevertheless, since the document is in git repo, I will use this with some manual corrections. — kindahero, Apr 11 '15 at 13:28

score 4 · Answer 2 · answered Apr 12 '15 at 00:37

If you use Icicles then you can do what you want pretty much directly.

You can search only for uppercase letters that do not start a sentence. And you can replace them by their lowercase equivalents -- either selectively or all at once. You do this using Icicles search.

Here are the building blocks:

You define a set of search contexts - using a regexp, for example. In this case, you define the search-context regexp so that it matches an uppercase letter that follows a sentence end.
But those are not the contexts that you want to search - you want the opposite. So you tell Icicles that from now on (until you toggle it again) you are searching the contexts that are defined not by matches of that regexp but by its non-matches. That is, you search the complement of the zones defined by the regexp matches.

You can do this using C-c ` C-M-~ C-g. The C-c ` invokes Icicles search. The C-M-~ tells it to toggle complementing, and the C-g exits. (All this accomplishes is to toggle the default value of variable icicle-search-complement-domain-p, turning it on - you can do that using setq, if you prefer. This is a one-time thing - it stays on until you toggle it again, with C-M-~.)
You tell Icicles that when you ask for an on-demand replacement you want to replace only the part that matches your current minibuffer input, not the whole search context (actually its complement, in this case).

You do this explicitly once, because replacing the whole search candidate is the default behavior. You can do it during Icicles search, using M-_, or you can just set option icicle-search-replace-whole-candidate-flag to nil.
You tell Icicles that, when you ask for to replace a match, the replacement is computed using function downcase. That is, you will replace a given match by its lowercase version.

Just as for vanilla Emacs query-replace, replacement can use a literal string or it can use special constructs such as \&, \1, \,(something). But unlike query-replace you can also specify a function that computes the replacement.

You can do this by just requesting to replace, using C-S-RET. When you first try to replace, Icicles prompts you for the replacement to use. However, the replacement is usually a string, just as for query-replace-regexp (the same string constructs are supported).

And in this less common case you want to instead tell Icicles to compute the replacement from the match using function downcase. To be prompted for a replacement function and not for a replacement string, you need to use C-u: C-u C-S-RET. You can change the replacement anytime, using M-, (or C-u M-,, for a function) - you are prompted similarly. (You can also use a lambda form as the replacement function.)

All of the above is essentially just setup! You've told Icicles what kind of search-and-replace you want to perform. (You could encapsulate all of that in a specialized icicle-search command - just bind icicle-search-complement-domain-p to t, icicle-search-replace-whole-candidate-flag to nil etc.)

Now you are ready to search and replace.

C-c ` invokes command icicle-search. It prompts you for a context-defining regexp and which subgroup to use to define the context. In this case, you enter (a) a regexp that matches the end of a sentence followed by an uppercase letter, and (b) the number of the subgroup that matches only the uppercase-letter part.

The regexp you want is the value of (sentence-end) followed by \$[A-Z]\$. In my Emacs this is the giant regexp, and it is the 3rd subgroup that matches the letter:
```
$[.?!][]"'”)}]*\($\|[  ]$\|    \|[  ][  ]$\|[。．？！]+\)[
]*$[A-Z]$
```
A simpler version of such a regexp would be just this: ., ?, or !, followed by two spaces, followed by an uppercase letter. (In this case you would use subgroup 2, not 3, to match the uppercase letter.)
```
$[.?!][ ][ ]$$[A-Z]$
```
You can set a Lisp variable to the regexp string, and then use the variable to insert the regexp in the minibuffer when prompted for the regexp (notice the doubling of backslashes for a Lisp string):
```
(setq foo "\$[.?!][ ][ ]\$\$[A-Z]\$") ; With this, use subgroup 2.
```
In Icicles, whenever you use C-u C-= in the minibuffer you are prompted for a string-valued variable whose value you want to insert. In this case you enter foo and the regexp is inserted as if you had typed it. (If you put the regexp in variable icicle-input-string then you do not need to use C-u, and you are not prompted. You can also insert a string from a register, using C-x r i.)
After you enter the regexp and the subgroup that define the search contexts, you type text to search for within the contexts. In this case, you type [A-Z], meaning that you want to match uppercase letters within the search contexts. (You don't hit RET - that would exit completion and Icicles search. You can dynamically change the text you want to match.)

The search contexts act as completion candidates - all search hits. In this case, they are the zones of text in between the zones of uppercase letters that follow a sentence end. The current set of candidates is filtered by whatever input pattern you have typed in the minibuffer. Use S-TAB to show the candidates in buffer *Completions*.

The parts of each search candidate that you are interested in (because you will replace them) are the pieces that match your minibuffer input - the uppercase letters that do not begin a sentence.
You can cycle among all search hits, visiting them in turn, or you can navigate directly to any of them. (You can even sort them.) Use C-down to cycle.
You can replace individual matches of your minibuffer input ([A-Z]), or you can replace all of them, i.e., downcase them all at once.

To replace them all, just hit M-|. (You can hit C-g to exit searching.)
To replace an individual match, or several in sequence:

When you are at any given search hit (again: a zone of text that does not include an uppercase letter at the beginning of a sentence), you can tell Icicles to replace the parts of the hit that match your minibuffer input, which is [A-Z].

You do this using C-S-RET. Just repeat it to change each match within the search candidate - and on to the next candidate. That is , you can keep hitting C-S-RET to continue replacing matches within the next candidate, and the next. (IOW, you do not need to hit C-down to go to the next candidate - C-S-RET does that after you replace the last match in the previous candidate.)

Whew! That's a lot to write and read. Here is probably a better explanation. Hope it helps.

Drew · Answer 3 · 2015-04-12T19:24:43.863

Here is another way to search for uppercase letters that do not start a sentence, and replace them by lowercase on demand (selective occurrences or all).

To use this method you need libraries isearch-prop.el and isearch+.el - see Isearch+:

Library isearch-prop.el lets you add properties to zones of your buffer. And it lets you search only such propertied zones. Or it lets you search the complement: the zones that do NOT have a given property.

In this case, you will search the zones that do not have uppercase letters at the beginning of a sentence. You will search the zones for an uppercase letter.
Library isearch+.el lets you replace the current search hit (or all subsequent search hits) on demand.

Here we go:

M-x isearchp-regexp-context-regexp-search

You are prompted for a regexp that defines the zones to search.

You enter: $[.?!][ ][ ]$$[A-Z]$, which means end of sentence followed by an uppercase letter.

You are prompted for the subgroup that corresponds to the part of the regexp you want to match. You enter: 2, because $[A-Z]$ is the second subgroup, and its matches are what you want.

You are prompted for a predicate, to further filter the set of zones. Just hit RET without typing any predicate name - no predicate needed.
C-M-~, which means search not the zones defined by that regexp, but the rest of the text (i.e., the complement).
Type . to search the zones (for any char besides newline). (You could type [A-Z], to search for an uppercase letter, but the zones as defined contain only uppercase letters.)
C-u C-M-RET (isearchp-replace-on-demand), to replace the current search hit. The plain prefix arg (C-u) means replace (only) the current hit. You can also use:
- A negative prefix arg (e.g. C--), to make search keys (e.g. C-s) replace search hits you visit, from now on. They thus act the same as C-M-RET.
- A positive prefix arg N, to replace N search hits.
- A zero prefix arg (e.g. C-0), to replace all remaining search hits.
- No prefix arg, to just delete the current hit. You are not prompted for a replacement.
You are prompted for the replacement text to use. Just like query-replace, you can use \, to define the replacement. In this case, you enter \,(downcase \0). The \0 means that you want to act on the whole search hit, which in this case is a single uppercase character. The replacement will be the downcased character.
C-s to continue regexp searching, i.e., move to the next hit. If you want to replace it, hit C-M-RET; if not, use C-s to move on to the next hit.
Repeat #5. You can also use C-0 C-M-RET to just replace all remaining search hits at once.

In case you are wondering what this has to do with properties:

isearch-prop.el lets you work with any text or overlay properties. But for the particular command used here, isearchp-regexp-context-regexp-search, the text property used is the Lisp symbol whose name is the regexp. The property value is a cons whose car is the regexp (a string) and whose cdr is the predicate, which together define the zones. This way, you can have various sets of zones corresponding to different regexps and predicates.

Search for and correct capitalized words that are not at the beginning of a sentence

3 Answers3

Linked