3

SortWords in the EmacsWiki suggests this function as a way to sort words:

(defun sort-words (reverse beg end)
  "Sort words in region alphabetically, in REVERSE if negative.
  When prefixed with negative \\[universal-argument], sorts in
  reverse.

  The variable `sort-fold-case' determines whether alphabetic
  case affects the sort order.

  See also `sort-regexp-fields'."
  (interactive "*P\nr")
  (sort-regexp-fields reverse "\\w+" "\\&" beg end))

However, it does not behave sensibly on words with hyphens. Here is an example. Before:

a-magician pulled the-rabbit out-of the-hat

After using sort-words (above):

a-hat magician of-out pulled-rabbit the-the

This is certainly interesting (in the Chinese proverb kind of way). I suspect it has to do with Destructive Sorting. (I get a lot of mileage from pointing out how strange that is!)

What is a reasonable way to improve sort-words above?

Update: Would someone care to explain (perhaps in a comment or an answer) why the implementation above behaves strangely (at least to my eye) with the hyphens?

Drew
  • 75,699
  • 9
  • 109
  • 225
David J.
  • 1,809
  • 1
  • 13
  • 23

4 Answers4

4

Solution

You can temporarily make - and _ be considered as part of the word in the syntax-table

(defun sort-words (reverse beg end)
  "Sort words in region alphabetically, in REVERSE if negative.
Prefixed with negative \\[universal-argument], sorts in reverse.

The variable `sort-fold-case' determines whether alphabetic case
affects the sort order. See `sort-regexp-fields'.

Temporarily consider - and _ characters as part of the word when sorting."
  (interactive "*P\nr")
  (let ((temp-table (copy-syntax-table text-mode-syntax-table)))
    (with-syntax-table temp-table
      (modify-syntax-entry ?- "w" temp-table)
      (modify-syntax-entry ?_ "w" temp-table)
      (sort-regexp-fields reverse "\\w+" "\\&" beg end))))

Result

Before

a-magician pulled the-rabbit out-of the-hat even when Peter shouted don't

After

Peter a-magician don't even out-of pulled shouted the-hat the-rabbit when

Thanks @Jordon for the suggestion to use the text-mode-syntax-table as the base table.

Kaushal Modi
  • 25,203
  • 3
  • 74
  • 179
  • Does this mutate the syntax table? Will other functions or modes be affected? – David J. Jan 19 '15 at 19:51
  • @DavidJames It did mutate the table earlier; it's fixed now. – Kaushal Modi Jan 19 '15 at 20:05
  • 2
    Depending on the use case, it may be wise to use a copy of text-mode-syntax-table instead of a brand new one. If you are sorting english words this will cover other edge cases like apostrophes in contractions such as the word "don't" – Jordon Biondo Jan 19 '15 at 20:29
  • @JordonBiondo That's an awesome point, fixing it. I still needed to define `-` and `_` as words. – Kaushal Modi Jan 19 '15 at 20:34
3

Sort using symbol boundaries

Here's a solution that came to me after reading OP's answer about what defines a "word":

(defun sort-symbols (reverse beg end)
  (interactive "*P\nr")
  (sort-regexp-fields reverse "\\_<.*?\\_>" "\\&" beg end))

This version uses the symbol-boundary atoms (\_< and \_>). I think this solution makes the most sense, since you don't really want to sort words, you want to sort Lisp symbols. Note I changed the name of the function.

nanny
  • 5,704
  • 18
  • 38
  • Where does one find documentation for symbol-boundary atoms? – David J. Jan 19 '15 at 20:31
  • [Emacs manual, section 15.7](http://www.gnu.org/software/emacs/manual/html_node/emacs/Regexp-Backslash.html#Regexp-Backslash). There's a concise table on the Emacs Wiki: [Regular Expression](http://www.emacswiki.org/emacs/RegularExpression). – nanny Jan 19 '15 at 20:34
1

Your thing comes from the - not being defined a part of "\\w+" syntax, i.e. it thinks that a is a separate word, and the is a separate word.

Just use this instead:

(defun sort-words (reverse beg end)
  "Sort words in region alphabetically, in REVERSE if negative.
  When prefixed with negative \\[universal-argument], sorts in
  reverse.

  The variable `sort-fold-case' determines whether alphabetic
  case affects the sort order.

  See also `sort-regexp-fields'."
  (interactive "*P\nr")

  (sort-regexp-fields reverse "[a-zA-Z0-9-]+" "\\&" beg end)) 
abo-abo
  • 13,943
  • 1
  • 29
  • 43
  • This works, but it seems to me that it 'fights' (or put another way, disregards) the mode-specific definition of a word. – David J. Jan 19 '15 at 19:53
  • 1
    My answer is almost the same as your answer, unless you're coding in APL, or French. – abo-abo Jan 19 '15 at 19:59
  • @DavidJames When you use `M-f` and `M-b` to move the cursor, does it ignore hyphens and continue on, or does it stop at the hyphenated parts? – nanny Jan 19 '15 at 19:59
  • @nanny I'm glad you mention that; I added some commentary about how words should be handled in my answer. – David J. Jan 19 '15 at 20:03
1

It seemed simple and sensible to just add - (and _ for good measure) to the [:word:] character class:

(defun sort-words (reverse beg end)
  (interactive "*P\nr")
  (sort-regexp-fields reverse "[-_[:word:]]+" "\\&" beg end))

Now, this:

a-magician pulled the-rabbit out-of the-hat

becomes this:

a-magician out-of pulled the-hat the-rabbit

Commentary (related but independent of the answer above): The more I think about this, the more it comes down to defining a word. I'm beginning to think that the notion of a 'word' is not universal, even for a specific mode. Here are my thoughts. In a Lisp mode, I can think of two different ways to define a word, based on three possible use cases:

  1. For display, there are certain conventions in play. In Lisp mode, symbols can contain hyphens.
  2. For navigation (e.g. moving the cursor around), it may be reasonable to split words by "-".
  3. For syntax manipulation (e.g. paredit mode), words should not be split by "-". (If paredit did it this way, it would give some unhappy results.)

My tentative conclusion: Defining a word using one syntax table per mode is a flawed design. Instead, the definition of a word should be able to vary based on the use-case. In the example above, two word definitions would suffice, but I would not expect 2 to be enough in the general case.

Update: Thanks to the other answers, I see that Emacs also has syntax awareness of symbols. In the Lisp case (and maybe in general as well) having words and symbols is a sufficient syntax distinction to handle the use cases I mention above.

David J.
  • 1,809
  • 1
  • 13
  • 23