I'm confused by character-sets in Unix. I have a CSV file downloaded via SFTP:
$ file -ib myfile
text/plain; charset=us-ascii
The purpose for this character-set quest is that the data within file is seen like:
Flyers: Video Center
While I want:
Flyers: Video Center
I tried:
iconv -f us-ascii -t utf-8 myfile
Which is throwing the following error:
iconv: illegal input sequence at position 528666
Please clarify what's going on regarding character-sets? Can I download in UTF-8 while getting a file via SFTP? How do we usually decide on what is junk within a character set?
$Locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
$ LC_ALL=C sed -n l
Zimbabwe,175,Unknown Network,-1,Unknown,-1,Unknown,-1,US: Flyers: Video Center:,854088,Standard Display,-998,10/28/2014
$ iconv -f utf-8 -t l1
iconv: illegal input sequence at position 1228354
When set Terminal (Under Transalation, character set to UTF-8), I am able to see clean data.
But, when I read this with UTF-8 encoding using a ETL tool; the data is read as junk.
When I grep my file for data
"Flyers: Video Center"
I don't see result for the fact that data is stored as
"Flyers: Video Center"
Can the file coding be changed so as to see what I want?
hexdump for junk characters:
0000000: 4e42 4353 3a20 4e48 4c2e 636f 6d3a 2055 NBCS: NHL.com: U
0000010: 533a 2046 6c79 6572 733a c2a0 5669 6465 S: Flyers:..Vide
0000020: 6fc2 a043 656e 7465 723a 2057 6861 7427 o..Center: What'
0000030: 7320 486f 740a s Hot.
$dd bs=1 skip=1228300 count=100 < temp1.csv | xxd
100+0 records in
100+0 records out
100 bytes (100 B) copied, 0.000141 seconds, 709 kB/s
0000000: 3031 342c 320a 556e 6b6e 6f77 6e20 436f 014,2.Unknown Co
0000010: 756e 7472 792c 2d31 2c48 756c 7520 4c69 untry,-1,Hulu Li
0000020: 7665 2c33 3738 3834 312c 4e42 433a 2041 ve,378841,NBC: A
0000030: 6d65 7269 6361 e280 9973 2047 6f74 2054 merica...s Got T
0000040: 616c 656e 743a 2053 686f 7274 666f 726d alent: Shortform
0000050: 2c33 3230 3631 3332 2c55 6e6b 6e6f 776e ,3206132,Unknown
0000060: 2053 6974 Sit
Some garbled text:
Junk Americaâs
must have been (Note that apostrophe is not this ' but ’)
America’s
And
BMW â Golden
must have been (Note that hyphen is long hyphen not this -):
BMW – Golden
xxd TheFile.csv | head
)? Guessing the correct encoding from your question is difficult. It is definitely notus-ascii
, it could be for examplelatin1
. – jofel Dec 03 '14 at 10:49LC_ALL=C sed -n l
on your file that covers that  character.file
uses simple heuristics to determine a charset, you can't trust it for that. – Stéphane Chazelas Dec 03 '14 at 11:47locale
in your question. My bet would be that the file is in UTF-8 already (and that's a non-breaking-space character (encoded as 0xc2 0xa0) but your locale is not (for instance it may be iso8859-1 where 0xc2 is  and 0xa0 is nbsp). – Stéphane Chazelas Dec 03 '14 at 11:53LC_ALL=C sed -n l somefile
and post the results as an edit to your question. – slm Dec 03 '14 at 12:01LC_ALL=C sed -n l < file.csv
. Either way, aiconv -f utf-8 -t l1
would probably fix it (for your terminal at least). – Stéphane Chazelas Dec 03 '14 at 12:03LC_ALL=C sed -n l your-file
you really ran? Having a non-ASCII character in the output of that would be ased
bug. We still don't know what byte sequence those  are really made of. And it sounds possible you have both UTF-8 and non-UTF-8 character. Does position 1228354 correspond to those  or something else? – Stéphane Chazelas Dec 03 '14 at 13:00iconv -f utf-8
complains, there are some other byte sequences elsewhere in the file that don't form valid UTF-8 characters. What's the output ofdd bs=1 skip=1228300 count=100 < file.csv | hd
? (orxxd
if you don't havehd
). – Stéphane Chazelas Dec 03 '14 at 13:14iconv -f utf-8 -t l1
didn't work because that character doesn't exist in l1 (aka latin1 aka iso-8859-1). So your file seems to be valid UTF-8. The problem (if any) probably is with your ETL tool. If it expects iso8859-1 encoding, you can try and get an approximation withiconv -f utf-8 -t l1//TRANSLIT
– Stéphane Chazelas Dec 03 '14 at 14:33