cat vs. view and execution of file containing diacritics uploaded from Windows to Linux with WinSCP

Question

I have a DE_CopyOldToNew.sh file that was created in Windows. The file is then uploaded to Linux using WinSCP. The file contains a whole bunch of cp commands that copies files to a new folder with a new filename being assigned. The commands contain folders and files with diacritics like Gewährleistungsbürgschaft. When I do a cat DE_CopyOldToNew.sh I noticed that the diacritics are displayed in a "corrupted" way like Gew▒hrleistungsb▒rgschaft. When I do a view DE_CopyOldToNew.sh then the diacritics are displayed as they should be, like Gewährleistungsbürgschaft. When I execute my script I am getting cp: cannot stat errors and the diacritics in the folders and files are displayed as Gew\344hrleistungsb\374rgschaft. I have uploaded the file using binary as well as text and I have also performed a dos2unix DE_CopyOldToNew.sh. When I copy the content of my script in Windows and paste it into a new file in Linux then I am able to run the new script without issues. What is causing the uploaded version to be "corrupted" (for a lack of a better word)?

Which editor are you using? It's probably saving as latin1 / iso8859 / cp1252 instead of utf8. When saving as UTF-8, if there is an option to save with or without Byte-Order-Mark, pick UTF-8 without BOM. You can also use iconv to convert charsets. file might display the charset (but it's a guess). — frostschutz, Mar 07 '24 at 09:39

Chris Davies · Accepted Answer · 2024-03-07T12:47:53.397

3

Your file is written in one of the ISO-8859 encodings (probably Windows CP1252 or ISO-8859-15), whereas your Linux-based system is set up to expect a UTF-8 encoding.

You can verify this easily enough:

# Original text
printf 'Gew\344hrleistungsb\374rgschaft\n'
Gew�hrleistungsb�rgschaft
What character set
printf 'Gew\344hrleistungsb\374rgschaft\n' | file -
/dev/stdin: ISO-8859 text
Transcoded text
printf 'Gew\344hrleistungsb\374rgschaft\n' | iconv -f iso-8859-15 -t utf-8
Gewährleistungsbürgschaft
What character set
printf 'Gew\344hrleistungsb\374rgschaft\n' | iconv -f iso-8859-15 -t utf-8 | file -
/dev/stdin: UTF-8 Unicode text

Solutions?

Create your file as UTF-8 on the source system (Windows applications support this character set)
Downgrade your Linux-based system back to ISO-8859. Not recommended (but possible)

Convert the file once it's been transferred:

iconv -f iso-8859-15 -t utf-8 DE_CopyOldToNew.sh >DE_CopyOldToNew.sh.tmp &&
    mv -f DE_CopyOldToNew.sh.tmp DE_CopyOldToNew.sh

edited Mar 07 '24 at 12:47

answered Mar 07 '24 at 12:29

Chris Davies

116,213
16
160
287

Thanks for the response @Chris Davies. I have recreated my file in Windows in UTF-8 and uploaded it again. $ file DE_CopyOldToNew.sh DE_CopyOldToNew.sh: UTF-8 Unicode (with BOM) text, with very long lines, with CR line terminators When doing a cat DE_CopyOldToNew.sh the copy commands are now all starting on a new line. When doing a view DE_CopyOldToNew.sh the commands are all wrapped and delimited with a ^M like <copy command 1>^M<copy command 2>. When executing the script I am getting ./DE_CopyOldToNew.sh: line 1: cp: command not found – Rico Strydom Mar 11 '24 at 06:45
Your file is still in the wrong format. Can you create it on the target system instead? So much easier. If not then How can I remove the BOM from a UTF-8 file? and txt File from Mac not converting properly – Chris Davies Mar 11 '24 at 07:47
In my VBA code I have removed the UTFStream.LineSeparator = 10 line of code and in Linux I performed a dos2unix DE_CopyOldToNew.sh. This solved the problem. – Rico Strydom Mar 11 '24 at 09:09
If you put the line separator code back in, it'll probably work without needing dos2unix – Chris Davies Mar 11 '24 at 10:50

cat vs. view and execution of file containing diacritics uploaded from Windows to Linux with WinSCP

1 Answers1

What character set

Transcoded text

What character set