3

Let's say I have a directory with the files _b, a, c, č, d. I would like to sort the files according to the cs_CZ.UTF8 locale but without ignoring the underscore, ie like this: _b a c č d.

Currently, ls (and ls | sort as well) sorts the files like this: a _b c č d. All answers I have found suggest using LC_COLLATE=C, but that changes the ordering to this: _b a c d č (notice that č is now at the end, not between c and d as it is supposed to be).

Is there any way to achieve this goal?

Note that I also care about other characters than underscore, ie I would like a-n.pdf a-p.pdf a.pdf c č d sort in this order and not a-n.pdf a.pdf a-p.pdf c č d. (EDIT: Actually, a.pdf a-n.pdf a-p.pdf c č d is also fine, as long as the non-alphanumeric characters are not ignored.)

The following are not the answers I am looking for:

  • using LC_COLLATE=C as explained above,
  • using shell expansion such as ls _*; ls [^_]* because the question is not only about underscores.
  • That's an interesting challenge; it might be helpful to either copy, or move your sample data / expected results into a table to make it a little easier to understand how hyphens and underscores should be handled. The help page https://unix.stackexchange.com/editing-help#tables shows the markdown for tables. – Mark Stewart Oct 13 '21 at 22:46
  • When you edit your question, you will probably want to include the [locale] tag also. Have you looked at https://unix.stackexchange.com/q/421908/90290 and https://unix.stackexchange.com/q/39827/90290 yet? – Mark Stewart Oct 13 '21 at 23:37
  • @MarkStewart I don't care exactly whether how exactly non-alphanumeric characters are ordered as long as they are not ignored completely. Thus a.pdf a-n.pdf a-p.pdf is OK, as is a-n.pdf a-p.pdf a.pdf, but not a-n.pdf a.pdf a-p.pdf as that clearly ignores the hyphens/periods. I have tried to clarify this in the question. – Nikola Benes Oct 14 '21 at 13:58
  • @MarkStewart Thanks for the pointers, I understand there is no easy solution and I would have to learn how to write my own LC_COLLATE rules. I might do that once I have the free time… It seems strange, however, that no one's ever thought about this before (even though this is how collation works e.g. in Windows Explorer). – Nikola Benes Oct 14 '21 at 14:09
  • Also, this might be a case of the Mandela effect, but I think I remember the sorting in ls working like this in some Linux distros some 10–15 years ago. – Nikola Benes Oct 14 '21 at 14:11
  • It's funny you mention Windows Explorer; it does sort filenames quite well, even version1.txt, version2.txt, ... version10.txt does sort "correctly" with version10.txt at the end. – Mark Stewart Oct 14 '21 at 14:18

1 Answers1

2

On a GNU system, appending NULs to non-alphas may help:

$ ls | sed 's/[^[:alpha:]]/&\x0/g' | sort | tr -d '\0'
_b
a
c
č
d

That assumes file names don't contain newline characters. Generally, you can't sort lists of file names with sort as file names can very well be made of several lines themselves.

You could replace the newlines within filenames with / here before sorting. With zsh:

print -rNC1 -- *(N) | # print NUL-delimited
  tr '\n\0' '/\n' |
  sed 's/[^[:alpha:]]/&\x0/g' |
  sort |
  tr -d '\0' |
  tr '/' '\n'

Or to keep the list NUL-delimited so it can be post-processed:

print -rNC1 -- *(N) | # print NUL-delimited
  tr '\n\0' '/\n' |
  sed 's/[^[:alpha:]]/&\x0/g' |
  sort |
  tr -d '\0' |
  tr '/\n' '\n\0'

The strcoll() API that is used for sorting takes two NUL-terminated strings. Traditional sort implementations only support text input and text input precludes NULs so they're fine. GNU sort however, like GNU implementations of most standard text utilities do support NULs with their input.

I don't know how exactly GNU sort handles lines with NULs, but my guess would be that it breaks the lines on NULs and compares segments one to one. So, when comparing foo_\0car with foobar_\0more for instance, foo_ is compared with foobar_ and comes first.

You can also define the order of zsh globs using arbitrary transformations with the oe (order based on the evaluation of some code) or o+function glob qualifiers. But before calling strcoll(), zsh removes the NULs, so you can't use the same transformation as used for GNU sort above.

Instead, there, you could add 0s before sequences of non-alphas and 1s before sequences of alphas:

In ~/.zshrc

set -o extendedglob
mysort() {
  REPLY=${REPLY//(#m)[^[:alpha:]]##/0$MATCH}
  REPLY=${REPLY//(#m)[[:alpha:]]##/1$MATCH}
}

Then:

$ print -rC1 -- *(No+mysort)
_b
a
c
č
d