5

Are there some Emacs' packages which can be used for analysing the frequency of the characters and the words used in the current buffer.

As a particular example it would be interesting to have a mechanism for sorting the frequency of the characters of a given buffer. For example if the content of a buffer is This is a test. the sorted frequency will be as below.

t 3
s 3
i 2
a 1
e 1
h 1
. 1
Malabarba
  • 22,878
  • 6
  • 78
  • 163
Name
  • 7,689
  • 4
  • 38
  • 84

1 Answers1

8

I don't know of such built-in function, but here's something I could put together while waiting for compilation to finish:

(defun char-stats (&optional case-sensitive)
  (interactive "P")
  (message "case-sensitive: %s" case-sensitive)
  (let ((chars (make-char-table 'counting 0)) 
        current)
    (cl-labels ((%collect-statistics
                 ()
                 (goto-char (point-min))
                 (while (not (eobp))
                   (goto-char (1+ (point)))
                   (setf current (preceding-char))
                   (set-char-table-range
                    chars current
                    (1+ (char-table-range chars current))))))
      (if case-sensitive
          (save-excursion (%collect-statistics))
        (let ((contents (buffer-substring-no-properties
                         (point-min) (point-max))))
          (with-temp-buffer
            (insert contents)
            (upcase-region (point-min) (point-max))
            (%collect-statistics)))))
    (with-current-buffer (get-buffer-create "*character-statistics*")
      (erase-buffer)
      (insert "| character | occurences |
               |-----------+------------|\n")
      (map-char-table
       (lambda (key value)
         (when (and (numberp key) (not (zerop value)))
           (cl-case key
             (?\n)
             (?\| (insert (format "| \\vert | %d |\n" value)))
             (otherwise (insert (format "| '%c' | %d |\n" key value))))))
       chars)
      (org-mode)
      (indent-region (point-min) (point-max))
      (goto-char 100)
      (org-cycle)
      (goto-char 79)
      (org-table-sort-lines nil ?N))
    (pop-to-buffer "*character-statistics*")))
wvxvw
  • 11,222
  • 2
  • 30
  • 55
  • Perfect. Very nice. Is it possible to modify your code in order to not distinguish between uppercase and down-case in the statistics? – Name Feb 08 '15 at 17:39
  • @Name That would be relatively simple for Latin script, but in general... I'll have to search for that--I wouldn't know of an easy way of doing that. – wvxvw Feb 08 '15 at 17:59
  • 1
    @Name actually, try the updated version, I think Emacs knows how to upcase national alphabets properly, so it should do the job. – wvxvw Feb 08 '15 at 18:14
  • `char-stats` is a very handy tool of text analysis and a nice piece of lisp programming. – Name Feb 08 '15 at 19:51