5

I use the following command on Ubuntu with rename (installed with sudo apt-get rename) to rename all files which have the given characters in regex:

find . -execdir rename 's/[^A-Za-z0-9_.@+,#!?:&%~\(\)\[\]\/ \-]/?/g' * {} \;

This is working very well and all other characters are changed to ?. Now I want to include French characters like àèìòù and so on. So I added À-ÿ to my regex:

find . -execdir rename 's/[^A-Za-zÀ-ÿ0-9_.@+,#!?:&%~\(\)\[\]\/ \-]/?/g' * {} \;

But somehow the files are not getting renamed and they seem to be corrupted after running this command with À-ÿ because I can't delete them anymore.

What is the right way to include them in the rename regex?

Mango D
  • 225
  • 2
  • 7
  • What variant of rename are you using (even among the perl ones, there are many)? What's the output of locale charmap? – Stéphane Chazelas Aug 14 '19 at 10:42
  • By French character, do you mean any letter from the Latin script with any number of diacritics, or only those that are typically used in the French languages (for instance, no á), or would any alphabetic character (in any script, not only latin) do? – Stéphane Chazelas Aug 14 '19 at 10:44
  • @StéphaneChazelas I updated my question. How can I check with variant I am using? Is there a specific one for Ubuntu? – Mango D Aug 14 '19 at 10:44
  • 1
    apt-cache show rename should tell you which it is. On Ubuntu 18.04, it seems to be the one from https://metacpan.org/release/File-Rename version 0.20 – Stéphane Chazelas Aug 14 '19 at 10:46
  • 1
    @StéphaneChazelas I think I can use any not only latin letters. The reason is that I want to prepare those files for a PHP Script which handles them and can't read specific special chars like line breaks, symbols or stuff like that. I want to exclude all of them. – Mango D Aug 14 '19 at 10:46
  • 2
    I believe the problem is that these characters are encoded in a Unicode encoding (most likely, UTF-8) and the rename command expects unibyte characters. – Ned64 Aug 14 '19 at 10:47
  • @Ned64 Is there a way to change these expectations for the rename command? – Mango D Aug 14 '19 at 10:57
  • @MangoD I wouldn't know an efficient way. Personally I would probably re-map them by writing a short snippet in sed, then re-map back afterwards. That works OK if there are unused Unibyte characters in your names. However, there are probably better ways! – Ned64 Aug 14 '19 at 11:16
  • It would be better to use the same criteria as your PHP script does to exclude characters. – Stéphane Chazelas Aug 14 '19 at 11:56

1 Answers1

5

Assuming those file names are encoded in UTF-8, use:

find . -depth -execdir rename -n '
  utf8::decode$_ or die "cannot decode $_\n";
  s{[^\w.\@+,#!?:&%~()\[\]/ -]}{?}gs;
  utf8::encode$_;
  ' {} +

(remove the -n when happy).

Beware that some BSD implementations of find do not prefix the file names with ./ with -execdir so that command could fail if there are file names that start with -. With your variant of rename, you should be able to work around it by changing rename -n to rename -n -- (that doesn't work will all other perl rename variants).

In modern versions of perl, \w (for word character) is any alphanumeric (in any alphabetic script, not just Latin), or underscore character plus other connector punctuation chararcters plus Unicode marks (so for instance, includes the combining acute accent character that follows e in the decomposed form of é).

If you wanted to be more restrictive, instead of \w, you could use \p{latin}\p{mark}0-9_ to only include letters in the Latin script (and not Cyrillic, Greek...), the combining diacritics (though not limited to those typically used with Latin letters), and only the Hindu–Arabic decimal digits (and not other kinds of digits) and underscore (and not other connector punctuation characters).

If you don't use utf8::decode, perl will assume the characters are encoded in the iso8859-1 unibyte character set (for instance where 0xc3 0xa9 (the UTF-8 encoding of the pre-composed form of é) is à ©).

Alternatively, you can use zsh (which will decode characters as per the locale's encoding (see the output of locale charmap)):

autoload zmv # best in ~/.zshrc
zmv -n '(**/)(*)(#qD)' '$1${2//[^][:alnum:]_.@+,#!?:&%~()[\/ -]/?}'

Each byte from any sequence of bytes that don't form valid characters in your locale will also be turned into a ? (where rename above would die with a cannot decode error).

Its [[:alnum:]] uses your locale's alnum category so is unlikely to include other Unicode connector punctuation or marks characters.

In both perl and zsh (but often not in other tools), ranges like [a-zÀ-ÿ] are based on the codepoint of the characters. For instance azÀÿ are \u0061\u007A\u00C0\u00FF so, that range would match the abcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ characters in that range of code points (which includes non-alphabetic characters and not all characters in the Latin script or used in the French language like œ). In perl, you'd also need to add a use utf8 to be able to use the UTF-8 encoding of À and ÿ in the perl code.