How can I normalize sentence-endings?

Question

Say I have the following text:

A. U. Thor writes:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam est
elit, pulvinar sit amet erat nec, finibus sagittis ex. Mauris
dictum elementum est in fringilla? Duis vel dui elit! Praesent
tincidunt velit vel ipsum facilisis fringilla! Donec tincidunt
pulvinar nulla eu convallis. Sed venenatis non risus nec
tincidunt: nam volutpat mauris elementum tortor posuere
tempus. Nam varius aliquam rhoncus. Praesent commodo odio sed
tristique volutpat. Fusce eu mauris consequat, porta nulla ut,
vestibulum neque? Mauris tincidunt elit quis convallis imperdiet.

Nulla ac porta lectus. Nullam faucibus eu orci in porttitor? Sed
imperdiet cursus venenatis. Nullam eu est nisi: cras accumsan
rutrum sapien, sed maximus nisl pulvinar id. Nulla sollicitudin
sollicitudin neque, in rutrum tortor malesuada quis. In interdum
nec lacus id cursus. Pellentesque condimentum tortor at nisi
fermentum, ac aliquam nunc laoreet. Aenean faucibus orci nec
vehicula facilisis. Donec at metus enim? Nam vel sapien
tortor? Aenean non nibh congue, lacinia tortor posuere,
consectetur mi: integer quis eleifend lectus. Suspendisse vitae
ligula vel ante rhoncus ullamcorper non at urna. Praesent
sollicitudin ut purus non pellentesque.

Sed convallis metus ut ipsum ultricies bibendum? Phasellus vitae
mattis magna. Ut vel ex purus. Phasellus pellentesque diam vel
erat dapibus hendrerit. Ut pretium erat sed sapien tempus, eu
auctor lorem facilisis. Sed eu malesuada turpis. Quisque ac leo
mollis, molestie purus convallis, convallis sem? Sed tortor diam,
efficitur sit amet magna nec, volutpat mattis tortor? Suspendisse
varius neque non mattis pretium? Vivamus ac rhoncus ipsum? Nunc
pellentesque lobortis purus. Proin interdum venenatis augue at
venenatis. Integer in hendrerit urna, vitae commodo
massa. Phasellus nec feugiat ex, sit amet posuere elit.

In hac habitasse platea dictumst. Morbi id velit neque. Nulla
facilisi: sed et enim ut velit accumsan facilisis. Praesent
rhoncus vel sem vitae ullamcorper? Sed tincidunt tincidunt
nunc. Cras egestas urna turpis, at malesuada risus elementum eu.

Curabitur ultricies, erat ut consectetur malesuada, massa ante
iaculis lectus, vitae convallis nibh elit eget dui: sed malesuada
dui eu ipsum luctus tincidunt at ac sapien. Donec lorem ligula,
porttitor non sem vitae, auctor suscipit elit. Sed iaculis ex id
fringilla auctor? Aliquam quis dignissim leo. Sed tincidunt massa
vel justo vehicula vehicula gravida nec metus. Integer euismod est
vitae hendrerit vestibulum? Vivamus condimentum ornare
felis. Aliquam fermentum a diam non pulvinar?

Is there an easy way to double all the sentence-ending spaces? I can't think of a way to detect the length of the 'sentence' it terminates before doubling the space, let alone ignore common prefixes like Mr, Ms, and Mrs.

How about using something like the regexp in the variable `sentence-end` to search and add spaces? -- e.g., `[^.].[.?!]+\$[]\"')}]*\\|<[^>]+>\$\$$\\| $\\|\t\\| \$[ \t\n]*` I'd need to do some further testing to see what modes set that variable, and some further testing to try searching and replacing spaces -- i.e., one space with two spaces . . . But that is the general idea -- using something like `replace-regexp . . .` — lawlist, Jan 09 '15 at 05:00
@lawlist I've tried approaches like this before in the past, but I've never been able to overcome the Mr/Mrs issue with a pure regexp – I think some elisp is required. — Sean Allred, Jan 09 '15 at 05:02
Here is a rough *rough* outline to get you started: `(save-excursion (goto-char (point-min)) (while (re-search-forward "[^.].[.?!]+\$[]\"')}]*\\|<[^>]+>\$\$$\\| $\\|\t\\| \$[ \t\n]*" nil t) (when (not (looking-back "mr. \\|mrs. ")) (replace-regexp "[^.].[.?!]+\$[]\"')}]*\\|<[^>]+>\$\$$\\| $\\|\t\\| \$[ \t\n]*" " " nil (match-beginning 0) (match-end 0)))))` Once you get the bugs ironed out, you can set the exceptions to a variable -- e.g., Mr., Mrs., etc. You might need a `save-match-data` or something like that, or perhaps not? I'm just brainstorming . . . — lawlist, Jan 09 '15 at 05:25
Here is a slightly revised *rough* draft (removing the visual highlights and adding a `.`) -- you'll need to deal with a period v. an exclamation point v. a question mark [perhaps just changing the regexp or beginning replacement point]: `(save-excursion (goto-char (point-min)) (while (re-search-forward "[^.].[.?!]+\$[]\"')}]*\\|<[^>]+>\$\$$\\| $\\|\t\\| \$[ \t\n]*" nil t) (when (not (looking-back "mr. \\|Mrs. ")) (let ((query-replace-lazy-highlight nil)) (replace-regexp "[^.].[.?!]+\$[]\"')}]*\\|<[^>]+>\$\$$\\| $\\|\t\\| \$[ \t\n]*" ". " nil (match-beginning 0) (match-end 0))))))` — lawlist, Jan 09 '15 at 05:38
Even state of the art natural language parsers can't do it with 100% accuracy, so whatever the system you come up with, there will be some errors you will need to fix after you process the text automatically. — wvxvw, Jan 09 '15 at 07:08
@rekado Oh. My. Goodness. *But* it does not recognize titles (e.g. `Dr.`, `Mrs.`, `Gov.`, …). — Sean Allred, Feb 24 '15 at 15:28
Well, it's interactive so you could skip over these instances easily. — , Feb 24 '15 at 16:09

score 3 · Answer 1 · answered Jan 11 '15 at 08:07

You didn't specify exactly what you want to do. My understanding is:

search for one of ., ! or ? that is followed by a space;
- if it is followed by at least two spaces, do nothing;
- if it is preceded with "mrs", "mr", etc. or a word of length 1, do nothing;
- otherwise, add a single space.

The following should do what you need:

(defun normalise-sentence-endings ()
  (interactive)
  (save-excursion
    (while (search-forward-regexp "[.!?] " nil t)
      (when (not (looking-at " "))
        (let* ((previous-word-beginning
                (save-excursion (backward-word) (point)))
               (previous-word
                (downcase (buffer-substring
                           previous-word-beginning
                           (- (point) 2)))))
          (unless (or (= (length previous-word) 1)
                      (member previous-word '("m" "mr" "mrs" "dr" "etc")))
            (insert " ")))))))

score 2 · Answer 2 · 2015-01-26T10:42:48.083

There is a function for repunctuating sentences, and it's called repunctuate-sentences. It allows you to step through each occurrence of a punctuation character or to change all at once.

If you don't like to use this function for some reason, you could temporarily disable sentence-end-double-space and then use forward-sentence to move by sentences and insert a space. This may be better than relying on your own regular expressions, but it won't help you with common abbreviations or prefixes.

Here's the general idea:

(defun my/fix-sentence (arg)
  "Add a space at the end of a sentence."
  (interactive "p")
  (let ((sentence-end-double-space nil))
    (loop for i from 1 to arg do
          (forward-sentence 1)    
          ;; TODO: check if this is some special case and not really a
          ;; true sentence ending.  Otherwise just add a space.        
          (insert " "))))

You can run this with a prefix argument to repeat this as often as you want.

Just saw this today – sorry – Sean Allred Feb 24 '15 at 15:29 — Sean Allred, Feb 24 '15 at 15:29

score 0 · Answer 3 · answered Jan 26 '15 at 14:32

Below is the code I use. It is simple and covers 99% of the cases. It is only missing dealing abbreviations, such as "A. U. Thor" or "Mrs.").

(defun my-two-spaces-per-sentence ()
  (interactive)
  (save-excursion
  (replace-regexp "\\([\\.\\|\\?\\|!]\\) \\([A-Z][a-z]*\\)" "\\1  \\2"
                  nil
                  (if (region-active-p) (region-beginning) (point-min))
                  (if (region-active-p) (region-end) (point-max)))))

How can I normalize sentence-endings?

3 Answers3