I've got a log file that is ASCII, except for a few UTF-8 characters (which I can fix for a future version).
For the moment, I need to figure out how to get this file to a viewable/searchable/editable state by gedit/less etc.
enca -L none file
returns 7bit ASCII characters
Surrounded by/intermixed with non-text data
.
enconv -L none -X ASCII file
and enconv -L none -X UTF-8 file
"succeed" but do not actually change anything.
How do I go about fixing this file?
Update (after some answers):
Actually, as stated below (upvotes to all :)), ASCII + UTF-8 is UTF-8. What I have is
0003bbc0 28 4c 6f 61 64 65 72 29 20 50 61 74 69 65 6e 74 |(Loader) Patient|
0003bbd0 20 00 5a 00 5a 00 5a 00 38 00 31 00 30 00 34 00 | .Z.Z.Z.8.1.0.4.|
0003bbe0 20 6e 6f 74 20 66 6f 75 6e 64 20 69 6e 20 64 61 | not found in da|
0003bbf0 74 61 62 61 73 65 0d 0a 32 36 20 53 65 70 20 32 |tabase..26 Sep 2|
I believe it will be a will be a cp1252-type encoding. Actually, I don't know what it is the cp1252 will be a 1-byte for ASCII won't it?
Incidentally, the fact that linux barfs on this helped me figure out that an input file (where the id's came from) was encoded badly...
"005a 005a 005a 0038 0031 0030 0034 0020"
isUTF-16BE
for "ZZZ8104 " ... That looks like a reasonalbe name/number for a not-found patient ID. — Windows usesUTF-16BE
(Big-Endian) for its unicode encoding standard... The first 256 characters all begin with00
. If it was UTF-16LE (Little Endian) then the00
wold be trailing instead of leading... It could be either because: Do you choose the starting poing to be 2000 (' '), or "005a ('Z') ... but none of that is known for sure.. It might be some in-house encoding.. You have to picke the most likely.... – Peter.O Oct 26 '11 at 17:34