How do I re-encode a mixed encoded text file

Question

I've got a log file that is ASCII, except for a few UTF-8 characters (which I can fix for a future version).

For the moment, I need to figure out how to get this file to a viewable/searchable/editable state by gedit/less etc.

enca -L none file returns 7bit ASCII characters Surrounded by/intermixed with non-text data.

enconv -L none -X ASCII file and enconv -L none -X UTF-8 file "succeed" but do not actually change anything.

How do I go about fixing this file?

Update (after some answers):

Actually, as stated below (upvotes to all :)), ASCII + UTF-8 is UTF-8. What I have is

0003bbc0  28 4c 6f 61 64 65 72 29  20 50 61 74 69 65 6e 74  |(Loader) Patient|
0003bbd0  20 00 5a 00 5a 00 5a 00  38 00 31 00 30 00 34 00  | .Z.Z.Z.8.1.0.4.|
0003bbe0  20 6e 6f 74 20 66 6f 75  6e 64 20 69 6e 20 64 61  | not found in da|
0003bbf0  74 61 62 61 73 65 0d 0a  32 36 20 53 65 70 20 32  |tabase..26 Sep 2|

I believe it will be a will be ~~a cp1252-type encoding~~. Actually, I don't know what it is the cp1252 will be a 1-byte for ASCII won't it?

Incidentally, the fact that linux barfs on this helped me figure out that an input file (where the id's came from) was encoded badly...

"005a 005a 005a 0038 0031 0030 0034 0020" is UTF-16BE for "ZZZ8104 " ... That looks like a reasonalbe name/number for a not-found patient ID. — Windows uses UTF-16BE (Big-Endian) for its unicode encoding standard... The first 256 characters all begin with 00. If it was UTF-16LE (Little Endian) then the 00wold be trailing instead of leading... It could be either because: Do you choose the starting poing to be 2000 (' '), or "005a ('Z') ... but none of that is known for sure.. It might be some in-house encoding.. You have to picke the most likely.... — Peter.O, Oct 26 '11 at 17:34
What this is is a log file (that is logging in ascii or UTF-8), and writing out id's that are pulled in from a text file that is a list that is exported from a SQL Server result set. — Stephen, Nov 02 '11 at 08:22

score 4 · Answer 1 · answered Oct 25 '11 at 00:10

A file which is "ASCII, except for a few UTF-8 characters" is, well, simply an UTF-8 file.

It is viewable/searchable/editable as long as you are using an UTF-8 locale.

You cannot convert it to ascii as the latter has no equivalent representation for your UTF-8 special characters.

You might want to convert to Isolatin with

iconv -f UTF-8 -t ISO-8859-1

score 2 · Answer 2 · answered Oct 25 '11 at 00:09

If you have a file that contains ASCII with a few UTF-8 characters, then it is by definition a UTF-8 file. A pure ASCII file is also valid UTF-8.

It sounds like what you have is a mix of ASCII, UTF-8, and some other single-byte encoding like Latin-1. That's hard to clean up. But it's difficult to give good advice without knowing what the file actually contains. Try posting the output of hexdump -C file (cutting it down to a few lines that contain problem characters).

score 2 · Accepted Answer · answered Oct 26 '11 at 22:47

What you have is in fact ASCII (in its usual encoding in 8-bit bytes) with a bit of UCS-2 (Unicode restricted to the basic plane (BMP), where each character is encoded as two 8-bit bytes), or perhaps UTF-16 (an extension of UCS-2 that can encode all of Unicode by using a multi-word encoding for code points above U+D7FF).

I doubt you'll find a tool that can handle such an unholy mixture out of the box. There is no way to decode the file in full generality. In your case, what probably happened is that some ASCII data was encoded into UTF-16 at some point (Windows and Java are fond of UTF-16; they're practically unheard of in the unix world). If you go by the assumption that the original data was all ASCII, you can recover a usable file by stripping off all null bytes.

<bizarre tr -d '\000' >ascii

Peter.O · Answer 4 · 2011-10-25T02:11:41.363

Try chardet from package python-chardet -- I just now tried it on a file which enca couldn't recognize... chardet detected a charset type. (accordint to the man-page, enca stands for Extremely Naive Charset Analyse :)

If you can't detect the type, then re-encoding is rather futile, as the re-encoder needs to know the input format (see Detecting character-sets, below)

You can try toopen the file in another text editor, eg. emacs, vim, jedit, etc.

gedit has an Choose/Add/Remove option, in it's File Open dialog. You can choose/add charsets to the charset-list (once you know what it is).. gedit only opens types shown in that list.

Further, it may be a Word-processor file.. Try opening it with OpenOffice.org.

Another (desperate(?) option, is to user strings.
strings will print the strings of printable characters in files.

Detecting character-sets is fraught with problems. For the many Latin-script based languages (which yours seems to be), there are many charset variations. The only common theme for these charsets is the baseline 7-bit ASCII charset which consists of 128 possibilities for hexadecimal \x00 to \x7F..

Any of the many single-byte charsets which utilizes the 8th bit (a further 128 letters) uses this upper range in as many different ways as there are charsets..

Unless you know what the encoding is, it is often a game of statistical probability to detect it (reverse engineering), because the detecting program has no idea what letter it is looking at; it only sees byte values. When no uniquely defining differnce is detected (not a simple task), then the only resort is to choose the most often used charset which matches.

The bottom line is that even if the file contains a completely valid charset A, it can look equally as valid to a detecting program as does charset B... This is the very reason why one needs to know the charcter encoding! -- especially for character sets which use only a single-byte.

Multi-byte character set have much more obvious finger print, but even then, if the sample set is not large enough, it is again a guessing game...

Re: statistical analysis: I answered a similar question on StackOverflow once, and included some sample PHP code as a starting point: Best way to correct garbled data caused by false encoding — janmoesen, Oct 25 '11 at 11:18
@jammoesen: I like your class name, Utf8Voodoo. That sume it up perfectly :) — Peter.O, Oct 25 '11 at 20:30

How do I re-encode a mixed encoded text file

Update (after some answers):

4 Answers4

Linked