3

I was running a little bash script on some text files:

find . -name "*.dat" -exec iconv --from-code='UTF-8' --to-code='ASCII//TRANSLIT' {} --output={} \;

My machine is a Ubuntu 14.04 LTS.

After a while, I found that like half of the data in the files disappeared; simply cut off in the middle of a line/word. It is the mysterious signal 7 or core dump (as I heard). The problem is somehow, when the files are too large. Some of my file have >60kB but iconv made them around 30kB.

What can I do about this? Is this a bug? Is there a workaround? Is there any other convenient way to transliterate diacritics?

MERose
  • 527
  • 1
  • 10
  • 24
  • The man page doesn't say what happens when the input file is the same as the output file, and it looks like nothing special is done. So when iconv truncates the file in preparation for writing the first block, any reads after that may be zero length. – Mark Plotnick Dec 26 '14 at 15:05
  • Yes, it seems that you are right. However, the problem only occurs for files larger than 30kB. Smaller files are handled correctly. Also, the text in the truncated files get converted correctly. – MERose Dec 26 '14 at 15:19
  • 1
    I ran strace on it. It looks like iconv starts to write to the output file when it has amassed 32768 bytes. – Mark Plotnick Dec 26 '14 at 15:30
  • If I try to use iconv individually on a file larger than 32 K (and overwrite the original file), I'll got the "Bus error" message, with an exit code = 135. – Guillaume Husta Feb 04 '22 at 10:58

1 Answers1

2

As pointed out in the comments to my question, the problem occurs when two conditions are met:

  1. Source and target file are the same.
  2. File is larger than 32768 bytes.

There are two solutions: Either cast a temporary file which then automatically replaces the source file, or use recode.

As to the first solution see for example. https://unix.stackexchange.com/a/10243/94483. For sponge, there is a very good question on SO (https://stackoverflow.com/q/64860/362146) and also an answer here: https://unix.stackexchange.com/a/19980/94483

I will now use iconv as recode supports less character sets (and I also failed to make it run):

FILELIST=$(find . -type f -name "*.dat")

for file in $FILELIST
do
  iconv --from-code='UTF-8' --to-code='ASCII//TRANSLIT' "$file" | sponge "$file"
done

sponge does the replacing job. It's from moreutils.

MERose
  • 527
  • 1
  • 10
  • 24