4

I'm confused by character-sets in Unix. I have a CSV file downloaded via SFTP:

$ file -ib myfile
text/plain; charset=us-ascii

The purpose for this character-set quest is that the data within file is seen like:

Flyers: Video Center

While I want:

Flyers: Video Center

I tried:

iconv -f us-ascii -t utf-8 myfile

Which is throwing the following error:

iconv: illegal input sequence at position 528666

Please clarify what's going on regarding character-sets? Can I download in UTF-8 while getting a file via SFTP? How do we usually decide on what is junk within a character set?

$Locale  
LANG=en_US.UTF-8  
LC_CTYPE="en_US.UTF-8"  
LC_NUMERIC="en_US.UTF-8"  
LC_TIME="en_US.UTF-8"  
LC_COLLATE="en_US.UTF-8"  
LC_MONETARY="en_US.UTF-8"  
LC_MESSAGES="en_US.UTF-8"  
LC_PAPER="en_US.UTF-8"  
LC_NAME="en_US.UTF-8"  
LC_ADDRESS="en_US.UTF-8"  
LC_TELEPHONE="en_US.UTF-8"  
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=  

$  LC_ALL=C sed -n l  
Zimbabwe,175,Unknown Network,-1,Unknown,-1,Unknown,-1,US: Flyers: Video Center:,854088,Standard Display,-998,10/28/2014

$ iconv -f utf-8 -t l1   
iconv: illegal input sequence at position 1228354  

When set Terminal (Under Transalation, character set to UTF-8), I am able to see clean data.
But, when I read this with UTF-8 encoding using a ETL tool; the data is read as junk.

When I grep my file for data

"Flyers: Video Center" 

I don't see result for the fact that data is stored as

"Flyers: Video Center"

Can the file coding be changed so as to see what I want?

hexdump for junk characters:

0000000: 4e42 4353 3a20 4e48 4c2e 636f 6d3a 2055  NBCS: NHL.com: U  
0000010: 533a 2046 6c79 6572 733a c2a0 5669 6465  S: Flyers:..Vide  
0000020: 6fc2 a043 656e 7465 723a 2057 6861 7427  o..Center: What'  
0000030: 7320 486f 740a                           s Hot.  


$dd bs=1 skip=1228300 count=100 < temp1.csv | xxd  
100+0 records in  
100+0 records out  
100 bytes (100 B) copied, 0.000141 seconds, 709 kB/s  
0000000: 3031 342c 320a 556e 6b6e 6f77 6e20 436f  014,2.Unknown Co  
0000010: 756e 7472 792c 2d31 2c48 756c 7520 4c69  untry,-1,Hulu Li  
0000020: 7665 2c33 3738 3834 312c 4e42 433a 2041  ve,378841,NBC: A  
0000030: 6d65 7269 6361 e280 9973 2047 6f74 2054  merica...s Got T  
0000040: 616c 656e 743a 2053 686f 7274 666f 726d  alent: Shortform    
0000050: 2c33 3230 3631 3332 2c55 6e6b 6e6f 776e  ,3206132,Unknown  
0000060: 2053 6974                                 Sit  

Some garbled text:

Junk Americaâs   

must have been (Note that apostrophe is not this ' but ’)

America’s

And

BMW â Golden  

must have been (Note that hyphen is long hyphen not this -):

BMW – Golden 
Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
  • 1
    Can you paste a hexdump of the relevant parts somewhere, please (xxd TheFile.csv | head)? Guessing the correct encoding from your question is difficult. It is definitely not us-ascii, it could be for example latin1. – jofel Dec 03 '14 at 10:49
  • 0000000: 7265 706f 7274 206e 616d 653a 2048 6973 report name: His 0000010: 746f 7269 6361 6c5f 436f 6e73 7472 6169 torical_Constrai 0000020: 6e65 645f 4176 6169 6c73 0a64 6174 6520 ned_Avails.date 0000030: 7261 6e67 653a 2032 3031 342d 3130 2d32 range: 2014-10-2 – Abhishek Dec 03 '14 at 11:06
  • 0000040: 3865 6474 2d32 3031 342d 3130 2d32 3865 8edt-2014-10-28e 0000050: 6474 0a43 6f75 6e74 7279 2c4d 524d 2043 dt.Country,MRM C 0000060: 6f75 6e74 7279 2049 442c 4469 7374 7269 ountry ID,Distri 0000070: 6275 746f 722c 4d52 4d20 4469 7374 7269 butor,MRM Distri 0000080: 6275 746f 7220 4944 2c53 6572 6965 732c butor ID,Series, 0000090: 4d52 4d20 5365 7269 6573 2049 442c 5369 MRM Series ID,Si – Abhishek Dec 03 '14 at 11:06
  • Why does my file -ib command says us-ascii then? Also how do i find the characters in a file that are not supported by UTF-8 assuming file is in UTF-8 or any other character encoding – Abhishek Dec 03 '14 at 11:19
  • 1
    Please edit your question to add the portion of the output of LC_ALL=C sed -n l on your file that covers that  character. file uses simple heuristics to determine a charset, you can't trust it for that. – Stéphane Chazelas Dec 03 '14 at 11:47
  • Also, please include the output of locale in your question. My bet would be that the file is in UTF-8 already (and that's a non-breaking-space character (encoded as 0xc2 0xa0) but your locale is not (for instance it may be iso8859-1 where 0xc2 is  and 0xa0 is nbsp). – Stéphane Chazelas Dec 03 '14 at 11:53
  • sorry Stephane, what does "output of LC_ALL=C sed -n l on your file" mean? – Abhishek Dec 03 '14 at 11:59
  • @Abhishek - please move this new information into your question. Do not merely add the output as comments. It's difficult to decipher what you're showing in the above. – slm Dec 03 '14 at 12:00
  • @Abhishek - he's asking you to run this command: LC_ALL=C sed -n l somefile and post the results as an edit to your question. – slm Dec 03 '14 at 12:01
  • 2
    Please add that to your question (click on edit). If you see that  in a terminal, then it could be that your terminal is not in UTF-8 despite your locale, or it could be that the file has undergone a iso-8859-1 -> utf-8 conversion twice (we could confirm if you run LC_ALL=C sed -n l < file.csv. Either way, a iconv -f utf-8 -t l1 would probably fix it (for your terminal at least). – Stéphane Chazelas Dec 03 '14 at 12:03
  • @slm - Thanks for the learnings. I am new to forum & happy to see quick response. – Abhishek Dec 03 '14 at 12:26
  • @Abhishek - yup. Help everyone here to help you 8-). – slm Dec 03 '14 at 12:27
  • @Stéphane Chazelas - Thanks for explanation & patience. – Abhishek Dec 03 '14 at 12:28
  • BTW, my question is still open & have added further details by editing my post – Abhishek Dec 03 '14 at 12:51
  • Thanks for the hexdump. The beginning of the file really seems to be us-ascii (7bit). Can you post a hexdump of the relevant lines (where the strange characters appear)? – jofel Dec 03 '14 at 12:57
  • What system is that? Was it LC_ALL=C sed -n l your-file you really ran? Having a non-ASCII character in the output of that would be a sed bug. We still don't know what byte sequence those  are really made of. And it sounds possible you have both UTF-8 and non-UTF-8 character. Does position 1228354 correspond to those  or something else? – Stéphane Chazelas Dec 03 '14 at 13:00
  • Your hexdump confirms that's a UTF-8 encoded non-breaking-space character. Now since iconv -f utf-8 complains, there are some other byte sequences elsewhere in the file that don't form valid UTF-8 characters. What's the output of dd bs=1 skip=1228300 count=100 < file.csv | hd? (or xxd if you don't have hd). – Stéphane Chazelas Dec 03 '14 at 13:14
  • @StéphaneChazelas - Red Hat Enterprise Linux Server release 5.3 – Abhishek Dec 03 '14 at 13:17
  • @StéphaneChazelas - "NBC: Americaâs" is the junk at 1228354 – Abhishek Dec 03 '14 at 13:42
  • 에이바 - Isn't working – Abhishek Dec 03 '14 at 14:21
  • iconv -f CP1252 -t UTF-8 myfile | dos2unix > myoutfile – 에이바 Dec 03 '14 at 14:25
  • 1
    That's a valid UTF-8 character. The iconv -f utf-8 -t l1 didn't work because that character doesn't exist in l1 (aka latin1 aka iso-8859-1). So your file seems to be valid UTF-8. The problem (if any) probably is with your ETL tool. If it expects iso8859-1 encoding, you can try and get an approximation with iconv -f utf-8 -t l1//TRANSLIT – Stéphane Chazelas Dec 03 '14 at 14:33

2 Answers2

1

Issue #1: grepping "Flyers: Video Center"... I don't see the result :

In the hexadecimal dump of the file, notice the two bytes C2A0 between the words Flyers: and Video. This is a the UTF8 encoding for Non-breaking space. grepping NBSP is known to fail For more information, read How to remove special 'M-BM-' character with sed and use sed to replace ...Hex c2a0. Short answer is:

sed -i.bak -e 's/\xc2\xa0/ /' /path/to/file

Issue #2 `America’s' shows as 'Americaâs' (??):

Here, the dump contains three bytes e28099, known as RIGHT SINGLE QUOTATION MARK (’). Actually, there should be no problem here ! You probably got distracted by the problem above (could you confirm?)

If you use grep, sed and other tools with expression that respect your locale (UTF8!), then it will work:

printf 'America\xe2\x80\x99s\n' | grep --only-matching "[[:punct:]]"
printf 'America\xe2\x80\x99s\n' | sed -e "s/[[:punct:]]/?/"

If you want to get rid of all those UTF-8 "special" characters, use can use the tips above or iconv (but nowadays, there are few excuses not to support UTF8).

Drop all non-ascii chars:

type a.txt | iconv -f utf8 -t ASCII//TRANSLIT

Or to preserve chars from one locale:

type a.txt | iconv -f utf8 -t iso8859-15//TRANSLIT | iconv -f iso8859-15 -t utf8
1

Unlike XML, which should contain an encoding tag that describes the original content encoding you have to get that piece of information for csv with other means.

While type works pretty well on binary data, it does not work well on text data.

This is due to the way the command works. see man text. Roughly: It looks at the first couple of bytes and tries to guess the content looking up what was found in a lookup-table.

So usually it is best practice to ask the originator of the content which charset was used.

If that is not possible - for whatever reason, iconv is your best friend here.

Nils
  • 18,492