2

I have filenames like Käyttöohje.pdf. This should be Käyttöohje.pdf.

I can do the conversion of all files in sub directories with the command:

convmv -f utf8 -t iso-8859-1 -r --notest *

This converts Käyttöohje.pdf to Käyttöohje.pdf.

The problem is if the file is already in form Käyttöohje.pdf

File Käyttöohje.pdf is converted to K'$'\344''ytt'$'\366''ohje.pdf

How to change the above command so that

  • Käyttöohje.pdf is converted to Käyttöohje.pdf (left alone) and
  • Käyttöohje.pdf is still converted to Käyttöohje.pdf
George
  • 23

1 Answers1

5

Your incorrect files appear to be double-UTF-8 encoded.

For instance, the ä U+00E4 has been encoded as:

  1. U+00E4 -> 0xc3 0xa4 (UTF-8 encoding)
  2. 0xc3 -> 0xc3 0x83 (iso8859-1 Ã -> UTF-8), 0xa4 -> 0xc3 0xa4 (iso8859-1 ¤ -> UTF-8) where each byte of the UTF-8 encoding of U+00E4 has been interpreted as if they were the encoding of some other character in a single-byte charset (here likely iso8859-1 or windows-1252) and encoded again in UTF-8.

So you're right to use convmv -f utf8 -t iso-8859-1 for that. To leave the files that are not double-encoded alone, convmv has a special option for that: --fixdouble, so it should just be:

convmv --fixdouble -f utf8 -t iso-8859-1 -r --notest .

There's a section dedicated to that in convmv's manual:

How to undo double UTF-8 (or other) encoded filenames

Sometimes it might happen that you "double-encoded" certain filenames, for example the file names already were UTF-8 encoded and you accidentally did another conversion from some charset to UTF-8. You can simply undo that by converting that the other way round. The from-charset has to be UTF-8 and the to-charset has to be the from-charset you previously accidentlly used. If you use the "--fixdouble" option convmv will make sure that only files will be processed that will still be UTF-8 encoded after conversion and it will leave non-UTF-8 files untouched. You should check to get the correct results by doing the conversion without "--notest" before, also the "--qfrom" option might be helpful, because the double utf-8 file names might screw up your terminal if they are being printed - they often contain control sequences which do funny things with your terminal window. If you are not sure about the charset which was accidentally converted from, using "--qfrom" is a good way to fiddle out the required encoding without destroying the file names finally.

Files that are double-UTF-encoded via iso8859-1 (which covers code points U+0000 U+00FF) will contain sequences of non-ASCII characters consisting of one character in the the U+00C2 -> U+00F4 range (ÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóô) followed by one or more characters in the U+0080 -> U+00BF range (U+0080 to U+009F being control character plus non-breaking space plus ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿). Those sequences are relatively unlikely to occur in non-double-encoded text, especially considering that characters above U+00E0 (the lower case ones in the first set above) have to be followed by at least 2 characters in the second set, so convmv --fixdouble is unlikely to get it wrong.