Sorting lines based on numbers in unicode

Question

I want to sort some text in emacs that is based on a field that contains verse numbers in unicode (devanagari). The text is like this:

Verse text bla १०.३ #10.3 
Verse text blah This is १.१९  #1.19 
Verse text ble १०.१३ #10.13 
Verse text bleh ६.२७ #6.27 
Verse text blu १९.२  #19.2 
Verse text bluh ४.७ #4.7

I've added the corresponding arabic numerals with # at the end of each line (these will not appear in the original text). I've been able to do with python. Firstly, I wrote a function get_num() that converts the unicode text into an arabic decimal. Later, I used sorted() with a custom key function for sorting.

Is it possible to achieve this level of customized sorting with an elisp function? I looked at sort-regexp-fields and sort-fields but haven't understood if they are as customizable as python's sorted() Below is the python code for reference:

In [87]: inp
Out[87]: 
['Verse text bla १०.३ #10.3 ',
 'Verse text blah This is १.१९  #1.19 ',
 'Verse text ble १०.१३ #10.13 ',
 'Verse text bleh ६.२७ #6.27 ',
 'Verse text blu १९.२  #19.2 ',
 'Verse text bluh ४.७ #4.7 ']

In [88]: myre = re.compile(r'([०१२३४५६७८९]+\.[०१२३४५६७८९]+)')

In [90]: def get_num(inp):
    ...:     parts = inp.split('.')
    ...:     p1 = ''.join([str(ord(x) - 2406) for x in parts[0]])
    ...:     p2 = ''.join([str(ord(x) - 2406) for x in parts[1]])
    ...:     return '{}.{}'.format(p1, p2)

In [91]: sorted(inp,  key=lambda x: [int(i) for i in get_num(myre.search(x).group()).rstrip(".").split('.')])
Out[91]: 
['Verse text blah This is १.१९  #1.19 ',
 'Verse text bluh ४.७ #4.7 ',
 'Verse text bleh ६.२७ #6.27 ',
 'Verse text bla १०.३ #10.3 ',
 'Verse text ble १०.१३ #10.13 ',
 'Verse text blu १९.२  #19.2 ']

dalanicolai · Answer 1 · 2021-09-22T08:12:08.407

EDIT

To also sort the string with the pattern you gave in the comments (but alternated with a variable number of words), you can use the following function to split the strings, and use it('s result) in Tobias his answer:

(defun split-string-on-devanagari ()
  (interactive)
  (let (substrings
        (start (goto-char (point-min))))
    (while (search-forward-regexp "\\([०१२३४५६७८९]+\\)\\(?:\\.\\([०१२३४५६७८९]+\\)\\)?" nil t)
      (push (buffer-substring-no-properties start (match-end 0)) substrings)
      (unless (eobp)
        (forward-char)
        (setq start (point))))
    (nreverse substrings)))

END EDIT

Well, there are many ways to do this. From your python code (and because they do not appear in the original text), I infer that we should really use the devanagari numbers for sorting.

So then one way to achieve this is by first replacing the devanagari numbers by latin numbers, then use sort-numeric-fields on the last field (i.e. using negative field number), and then replace back the devanagari numbers. You can achieve that with the following code

(require 'cl-lib)

(defun replace-all (from to)
  (goto-char (point-min))
  (while (search-forward from nil t)
    (replace-match to)))

(defun sort-lines-by-devanagari-nums ()
  (interactive)
  ;; create number pairs (uses cl-lib)
  (let ((num-pairs (cl-mapcar (lambda (x y) (cons x y))
                              (split-string "0123456789" "" t)
                              ;; create list of devanagari number strings
                              (mapcar 'char-to-string (number-sequence 2406 2415)))))
    ;; replace
    (dolist (x num-pairs)
      (replace-all (cdr x) (car x)))
    ;; sort
    (sort-numeric-fields -1 (point-min) (point-max))
    ;; replace
    (dolist (x num-pairs)
      (replace-all (car x) (cdr x)))))

After evaluating the above code, run M-x sort-lines-by-devanagari-nums to sort the text in your original buffer (without the latin numbers. Otherwise change -1 to -2, although this would replace the latin numbers also with the devanagari numbers).

For an alternative approach, that would be more similar to your given python example, you could hack something using e.g. seq-sort-by (see very basic example here).

Were you able to test this against the text I posted? I'm using emacs with evil mode. After evaluating both the functions, M-x `sort-lines-by-devanagari-nums` doesn't work for me on the selected text. — linuxfan, Sep 21 '21 at 18:24
Ah, sorry. Somehow, I exchanged the `cdr` and the `car` in the last line when posting the answer here, which is sloppy :| So I have corrected it now (but also note my comment after the code block). — dalanicolai, Sep 21 '21 at 23:44
I ran the function against a block like below: `hello ३.१ world १.१२ again १.९` It simply changes the numbers but performs no sorting. Am I missing something? `hello 3.1 world 1.12 again 1.9` — linuxfan, Sep 22 '21 at 00:04
Okay, so now you would also like to sort different fields on a single line? My solution works on the example you gave originally, which suggested a different pattern (with or without the latin number). Also your python answer suggests a different pattern because it strongly suggests that you split on the newline character, so that the pattern you are trying now would not work with your own python example. Your new pattern suggests that you would also like to split "single" words alternated with number, but I guess you would want to sort also strings of arbitrary lengths. Am I right? — dalanicolai, Sep 22 '21 at 06:45
Or do you mean that the answer also does not work when you put the fields on different lines? Because here it does work in that case. — dalanicolai, Sep 22 '21 at 06:55
Hi @dalancolai, my formatting of the sample text in the comment was off. I used multiple-lines when I tested your code. Thanks for the help :) — linuxfan, Sep 22 '21 at 12:26

Tobias · Accepted Answer · 2021-09-22T08:27:30.597

Use seq-sort-by instead of sort-regexp-fields. Therewith, you can specify a function to extract the key from the strings and a sort function.
In your case string< fits as lexicographical sort function when you interpret the numbers as characters in the string.
That works if the numbers don't exceed (max-char) giving 4194303 in my case.

(defun devanagari-to-num (devanagari)
  "Convert a devanagari-string encoded number into a number.
Return 0 if DEVANAGARI is not a devanagari-string encoded number."
  (condition-case nil
      (string-to-number (apply #'string (seq-map (lambda (x) (+ x (- ?0 ?०))) devanagari)))
    (error 0)))
;; Test: (devanagari-to-num "१०")

(seq-sort-by
 (lambda (s)
   (if (string-match "\\([०१२३४५६७८९]+\\)\\(?:\\.\\([०१२३४५६७८९]+\\)\\)?" s)
       (string (devanagari-to-num (match-string 1 s))
           (devanagari-to-num (or (match-string 2 s) "")))
     ""))
 #'string<
 (split-string "Verse text bla १०.३ #10.3
Verse text blah This is १.१९  #1.19
Verse text ble १०.१३ #10.13
Verse text bleh ६.२७ #6.27
Verse text blu १९.२  #19.2
Verse text bluh ४.७ #4.7 " "\n"))

If the numbers exceed (max-char) in your case you need to replace #'string<and the call to string with appropriate othere functions.

I think this solution is slightly more elegant than mine, but, as suggested in my answer, you can use `seq-sort-by` for it. No need to import cl-lib. — dalanicolai, Sep 22 '21 at 07:04
@dalanicolai Thank you for the hint. `seq.el` is newer than `cl-lib.el`. Even if I know some of the `seq.el` functions, I didn't know `seq-sort-by` yet. I changed `cl-sort` by `seq-sort-by` in my answer. — Tobias, Sep 22 '21 at 08:29

Sorting lines based on numbers in unicode

2 Answers2