6

Have an upload process, reads the file and using sqlldr it uploads the data to DB. I was getting invalid number issue while processing the file in sqlldr. Found the file is in UTF-16 format and then converted to UTF-8 format in notepad++, it started working fine. Now i am trying to convert it systematically like below.

iconv -f UTF-16 -t UTF-8 file_name >output_file_name

The file may be of different encodings, so i want to find what encoding the file is of, before converting and then based on that do conversion. something like use file command to read the UTF-16 only from the below and then use it in the -f option.

bash-4.2$ file "/FILE_UPLOADS/Relationship (4).txt"
/FILE_UPLOADS/Relationship (4).txt: Little-endian UTF-16 Unicode text, with CRLF line terminators

How do I do that?

Pat
  • 239
  • 1
    Do you have a subset of possible source encodings to consider? For example UTF8 ISO8859-1 UTF16-LE UTF16-BE – Chris Davies Sep 12 '22 at 12:05
  • 1
    Why the [ksh] tag when your prompt indicates you're using bash? – Stéphane Chazelas Sep 12 '22 at 13:42
  • UTF-16 and UTF-8 are transfer formats, not encodings. The encoding for both of them is Unicode. Are you talking about encodings or transfer formats? Note that detecting an encoding is impossible in general. – Jörg W Mittag Sep 12 '22 at 20:25
  • @JörgWMittag I'd call UTF-* different encodings for the same character set (Unicode). – Paŭlo Ebermann Sep 12 '22 at 22:36
  • Do you specifically have to use file for the detection? – Toby Speight Sep 13 '22 at 07:51
  • 1
    @Paŭlo, we have abstract characters (a character set), expressed as Unicode code-points (character encoding), which are represented in a byte stream using UTF-16 (transfer format). Does that make it clear? – Toby Speight Sep 13 '22 at 07:56
  • @TobySpeight i found file gives me the transfer formats, not aware any other commands. So using file command – Pat Sep 14 '22 at 05:50

3 Answers3

7

vim is able to automatically detect a few file encodings by itself and do conversion to UTF-8, so you could try and process your files in ex mode with:

vim --clean -E -s -c 'argdo set fileencoding=utf-8 nobomb | update' -c q -- *.txt

With update we also only rewrite files that have been modified in the process.

  • Thanks. I am using VIM "IMproved 7.4 (2013 Aug 10, compiled Sep 30 2020 08:08:00)" and--clean is not an option. Execute this without clean but nothing happened, i had to kill the process after couple of mins. – Pat Sep 14 '22 at 05:48
  • @Pat, I don't have access to such an ancient version, but apart from --clean, I don't think the rest would require a recent version. It's likely vim is not happy about something and wants you do something about it, but -s hides that. Remove the -s to see. – Stéphane Chazelas Sep 14 '22 at 08:17
4

You can use file -i, it will return the MIME encoding of the file.

Something like:

iconv -f `file -i $file|grep -Po 'charset=\K.*'` -t UTF-8 $file > $file_converted

Another way is to use a more dedicated tool, for example:
https://gitlab.freedesktop.org/uchardet/uchardet
Then the command become even simpler

iconv -f `uchardet $file` -t UTF-8 $file > $file_converted

But you need to install it.

White Owl
  • 5,129
1

When file says Little-endian UTF-16 Unicode text or with --mime-encoding utf-16le, it means that the file is encoded in UTF-16 with a BOM that indicates that it's in little endian.

file cannot detect UTF-16 text files (little or big endian) without BOM.

For UTF-16 text, it needs the first two bytes to be either 0xff, 0xfe (little endian) or 0xfe 0xff (big endian), and then checks that the rest of the first 64KiB of data look like text (only looking for UTF-16 encoded ASCII control characters that are not expected in text files).

For iconv, utf-16le means little-endian UTF-16 without BOM, while utf-16 means utf-16 with BOM, whether that's big or little endian.

So if you use the output of file -b --mime-encoding to use as the from charset in iconv, you'll end-up with a UTF-8 encoded BOM in the output.

Here, you'd probably want something like:

encoding=$(file -b --mime-encoding - < "$file") &&
  case $encoding in
    (utf-16[bl]e) iconv -f UTF-16 < "$file" -t UTF-8 > "$newfile";;
    (us-ascii | utf-8) ;; # already utf-8
    (*) printf >&2 '%s\n' "don't know what to do with a $encoding encoding"
  esac

If those are Microsoft files, as the CRLF suggests, you might want to use dos2unix instead of iconv. dos2unix (at least current versions) should be able to detect and UTF-16 and translate to the locale's charset (run it with LC_ALL=C.UTF-8 dos2unix if you want the output to be UTF-8 regardless of the locale) and change the CRLFs to LFs and fix other quirks in Microsoft files.

  • This one worked. But the converted file have new blank line after every non blank line, and the encoding is utf-8 – Pat Sep 14 '22 at 05:59
  • @Pat, iconv transcodes, it doesn't add characters. Maybe what you're seeing is the CRLFs from those files (that shouldn't show as blank lines though, what are you looking those files with?). Best may be to use dos2unix. See edit. – Stéphane Chazelas Sep 14 '22 at 08:14
  • dos2unix is not installed in our servers, so used tr -d '\r' and it worked. Thanks. – Pat Sep 14 '22 at 10:28