0

I got some files containing Finnish text with mixed encoding, something one would get by (echo Mäntysalo ; echo Mäntysalo | recode utf-8..iso-8859-1) > problem.txt. Is there a "right" way to correct files to one encoding from a command line? In reality just handling ÄäÖö is almost all that is needed, but I guess there is some command that also fixes Å, É and so on.

  • 1
    The problem is difficult: automatically it is very difficult to decide which encoding. I would write a small program and check line per line: either it is valid UTF-8 else I would assuming Latin-1. Combinations of Latin-1 which are also UTF-8 (and outside ASCII range) are very seldom. But on code, where authors are mixed in same line, and with other encoding the task is very difficult. – Giacomo Catenazzi Nov 15 '22 at 11:57
  • Good point! It is very good guess that every single line contains just one encoding! – Jori Mäntysalo Nov 18 '22 at 06:41

0 Answers0