I got some files containing Finnish text with mixed encoding, something one would get by (echo Mäntysalo ; echo Mäntysalo | recode utf-8..iso-8859-1) > problem.txt
. Is there a "right" way to correct files to one encoding from a command line? In reality just handling ÄäÖö is almost all that is needed, but I guess there is some command that also fixes Å, É and so on.
Asked
Active
Viewed 86 times
0

Jori Mäntysalo
- 153
-
1The problem is difficult: automatically it is very difficult to decide which encoding. I would write a small program and check line per line: either it is valid UTF-8 else I would assuming Latin-1. Combinations of Latin-1 which are also UTF-8 (and outside ASCII range) are very seldom. But on code, where authors are mixed in same line, and with other encoding the task is very difficult. – Giacomo Catenazzi Nov 15 '22 at 11:57
-
Good point! It is very good guess that every single line contains just one encoding! – Jori Mäntysalo Nov 18 '22 at 06:41