Your incorrect files appear to be double-UTF-8 encoded.
For instance, the ä
U+00E4 has been encoded as:
- U+00E4 -> 0xc3 0xa4 (UTF-8 encoding)
- 0xc3 -> 0xc3 0x83 (iso8859-1
Ã
-> UTF-8), 0xa4 -> 0xc3 0xa4 (iso8859-1 ¤
-> UTF-8) where each byte of the UTF-8 encoding of U+00E4 has been interpreted as if they were the encoding of some other character in a single-byte charset (here likely iso8859-1 or windows-1252) and encoded again in UTF-8.
So you're right to use convmv -f utf8 -t iso-8859-1
for that. To leave the files that are not double-encoded alone, convmv
has a special option for that: --fixdouble
, so it should just be:
convmv --fixdouble -f utf8 -t iso-8859-1 -r --notest .
There's a section dedicated to that in convmv
's manual:
How to undo double UTF-8 (or other) encoded filenames
Sometimes it might happen that you "double-encoded" certain filenames, for example the file names already were UTF-8 encoded and you
accidentally did another conversion from some charset to UTF-8. You can simply undo that by converting that the other way round. The
from-charset has to be UTF-8 and the to-charset has to be the from-charset you previously accidentlly used. If you use the
"--fixdouble" option convmv will make sure that only files will be processed that will still be UTF-8 encoded after conversion and it
will leave non-UTF-8 files untouched. You should check to get the correct results by doing the conversion without "--notest" before,
also the "--qfrom" option might be helpful, because the double utf-8 file names might screw up your terminal if they are being printed -
they often contain control sequences which do funny things with your terminal window. If you are not sure about the charset which was
accidentally converted from, using "--qfrom" is a good way to fiddle out the required encoding without destroying the file names finally.
Files that are double-UTF-encoded via iso8859-1 (which covers code points U+0000 U+00FF) will contain sequences of non-ASCII characters consisting of one character in the the U+00C2 -> U+00F4 range (ÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóô
) followed by one or more characters in the U+0080 -> U+00BF range (U+0080 to U+009F being control character plus non-breaking space plus ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿
). Those sequences are relatively unlikely to occur in non-double-encoded text, especially considering that characters above U+00E0 (the lower case ones in the first set above) have to be followed by at least 2 characters in the second set, so convmv --fixdouble
is unlikely to get it wrong.