4

pipe (|) limited text file is transferred from Windows application for processing. While processing there is a special character in the first column in first line of the file. This is how file looks in notepad before transferring from Windows

Sector|Name|Manager|...

When I read IFS='|' read -r -a fields < "/uploads/file_data.txt", the first column sector is read as "Sector" with special characters prefixed.

When i do this head -1 "/uploads/file_data.txt" | od -c the value printed is

0000000 357 273 277   S   e   c   t   o   r   |

I tried tr -d < //uploads/file_data.txt > /uploads/file_data_temp.txt but dint help. How do i replace the special characters not only this if any unknown characters are in the file uploaded in future.

Pat
  • 239

1 Answers1

6

You probably have a "bom" (byte order mark, used on unicode locale based system to specify the "little-endian"/"big-endian" ness of the system

see https://en.wikipedia.org/wiki/Byte_order_mark

Thankfully, that one seems to be for the utf-8 locale, which is a good thing if you expect only ASCII 1-177 characters...

You could take it out by interposing a sed that has been forced to use (temporarily) the C locale in order to "see" this:

LC_ALL=C sed '1s/^\xEF\xBB\xBF//' 

used for example as :

incoming program | LC_ALL=C sed '1s/^\xEF\xBB\xBF//' | somecmd
 # or
< incomingfile LC_ALL=C sed '1s/^\xEF\xBB\xBF//' > outputfile
  #  <incomingfile  : will give "incomingfile" content as stdin to sed 
  # then sed modifies only the first line, replacing the BOM with ""
  #    (the rest is not touched by sed and is transmitted as-is)
  #  > outputfile : directs sed output (ie, incomingfile without the BOM) to "outputfile"
  • 1
    it is the byte order mark. The octal values 357 273 277 are ef bb bf in hex, that's the UTF-8 encoding of U+FEFF. And their terminal is set incorrectly for UTF-8. – ilkkachu Oct 29 '21 at 15:15
  • @olivier-dulac thanks for answering. I tried this /UPLOADS/File_upload_10-29.txt LC_ALL=C sed '1s/^\xEF\xBB\xBF//' > /UPLOADS/File_FINAL.txt but it gave me each column names and then command not found. line 1: Sector: command not found – Pat Oct 29 '21 at 15:23
  • 1
    @Pat try LC_ALL=C sed '1s/^\xEF\xBB\xBF//' /UPLOADS/File_upload_10-29.txt. – terdon Oct 29 '21 at 16:07
  • 1
    @Pat you forgot the "<" before /UPLOADS/.... , so it tried to execute that file instead of using it as stdin... – Olivier Dulac Oct 29 '21 at 16:39
  • Works perfectly after correcting the syntax.. Thank you both. – Pat Oct 29 '21 at 17:51