2

We are getting new line character in one the column of a CSV file. The data for the column is coming in consecutive rows. Eg:

ID,CODE,MESSAGE,DATE,TYPE,OPER,CO_ID
12202,INT_SYS_OCS_EX_INT-0000,"""OCSSystemException: HTTP transport error: java.net.ConnectException: Tried all: '1' addresses, but could not connect over HTTP to server: '10.244.166.9', port: '8080'
 failed reasons:
  [0] address:'/10.244.166.9',port:'8080' : java.net.ConctException: Connection refused
""",06-09-2021 05:52:32,error,BillCycle,6eb8642aa4b
20840,,,06-09-2021 16:17:18,response,changeLimit,1010f9ea05ff

The issue is for column Messageand id 12202 , in which data is coming in triple quotes and in consecutive rows.

My requirement is that for the column Message, the data should come in a single row rather than multiple rows, because my etl loader fails to import an embedded newline.

  • The problem is with column 3 (Column Name : Message). The data is coming in triple quotes and in consecutive rows. For e.g. 1244,,"""Exception error: java connection error :8080 Connection refused """ Here the data is coming not in a single row but in multiple rows because of which my etl loader is not able to read data properly – mansi bajaj Sep 22 '21 at 12:04
  • sed 's/"""/"/g' not help for you? – K-attila- Sep 22 '21 at 12:25
  • Hi Katt, the above comand will replace triple quotes to single quotes only. It will not remove the space and new line characters insidw the column – mansi bajaj Sep 22 '21 at 12:28
  • 2
    The data that you show is a perfectly legal CSV file. A CSV parser would handle it with no issue. How are you trying to read this file? If you were to replace the newline characters, with what would you want to replace them? – Kusalananda Sep 22 '21 at 12:30
  • 1
    The problem is obviously, that not every software that claims to be able to import csv is really able to import each legal csv around. – Philippos Sep 22 '21 at 12:43
  • @mansibajaj please do not edit or change the sample text you first presented when first posting this question. Instead, add "ADDENDUM: Sample Text 2" at the bottom of your post. [Many (if not all) of the below answers have used that sample text to craft a solution for you. Changing it now (many days later) means that future readers will be confused]. – jubilatious1 Sep 28 '21 at 15:20
  • @jubilatious1: Sure. Will take care of this. Thankyou. – mansi bajaj Sep 29 '21 at 09:28

6 Answers6

3

If your CSV was generated by MS tools such as Excel then the "newline" in the middle of the field is a LF alone while the "newline" at the end of each record is a CRLF like this (note the LF $ mid-field vs the CRLF ^M$ at the end of the records):

$ cat -Ev file
ID,Code,Message,date^M$
1244,,"""Exception error : java connection error$
:8080 Connection refused""",01-09-2021^M$

and, if so, you can just tell GNU awk (for multi-char RS) that records end in CRLF and to replace mid-record LFs with blanks:

$ awk -v RS='\r\n' '{gsub(/\n/," ")} 1' file
ID,Code,Message,date
1244,,"""Exception error : java connection error :8080 Connection refused""",01-09-2021

If you only have a POSIX awk and no tools that work on CSVs, then see whats-the-most-robust-way-to-efficiently-parse-csv-using-awk for how to handle them or if CR doesn't appear anywhere else in your file you can do this with any awk:

$ awk -v RS='\r' 'NR>1{print prev} {sub(/^\n/,""); gsub(/\n/," "); prev=$0}' file
ID,Code,Message,date
1244,,"""Exception error : java connection error :8080 Connection refused""",01-09-2021
Ed Morton
  • 31,617
  • Hi, Thankyou so much everyone for your reply and I really appreciate it. The above awk command seems working perfectly fine in removing new line characters. [ awk -v RS='\r\n' '{gsub(/\n/," ")} 1' filename] . However what I am seeing is after running this command a blank line is getting created at the end of the file because of which number of record count is coming as one extra. – mansi bajaj Sep 27 '21 at 12:27
  • Also, I understand that inside gsub we are replacing new line character with blank after providing record seperater . But what I don't understand is why we are providing a 1 before file name. Thanks again for your help. – mansi bajaj Sep 27 '21 at 12:32
  • The script cannot create a blank line at the end of the output as all it does is replace all \r\ns with \ns and change all other \ns to blanks so if there's an empty line at the end of your output then that's because there was an empty record at the end of your input. Please edit your question to show brief sample input/output (showing both run through cat -Ev) that demonstrate the problem if you'd like help. – Ed Morton Sep 27 '21 at 12:32
  • Regarding the 1 - see why-does-1-in-awk-print-the-current-line. To understand it better, try removing the 1 and see what happens. – Ed Morton Sep 27 '21 at 12:34
  • Hang on, what is the output of awk --version? If it's not GNU awk as I said is required to use a multi-char RS THEN you could get a blank line at the end of the output as other awks will treat RS='\r\n' as if you wrote just RS='\r'. – Ed Morton Sep 27 '21 at 12:38
  • If you don't have GNU awk (gawk) then get it as it has a ton of extremely useful extensions. If you can't get it for some reason then use the other script I provided in my answer that I said will work in any awk. – Ed Morton Sep 27 '21 at 12:46
  • -Ed Morton : Thankyou for your reply. 1. I have added the file structure in the subject description above. 2. The cat -Ev is not working as it gives error illegal option. 3. Ater running the command : awk -v RS='\r\n' '{gsub(/\n/," ")} 1' file it creates a new blank row at the end. So for e.g with the above source file it will add one blank row at the end of last record i.e. after id : 60020840 Attaching output in next comment – mansi bajaj Sep 28 '21 at 08:56
  • Also the output of awk --version : Usage: awk [-F fs][-v Assignment][-f Progfile|Program][Assignment|File] ... I tied multiple other command as well but same output – mansi bajaj Sep 28 '21 at 09:21
  • OK, then you're clearly not using GNU awk so use the command I provided that doesn't require GNU awk. I'm sorry, I don't know what I'm supposed to make of all that text in the previous comments. – Ed Morton Sep 28 '21 at 12:20
2

One simple way is to just remove newline characters on lines with only 3 fields:

$ perl -F','  -pane 's/\n// if $#F==2' file 
ID,Code,Message,date
1244,,"""Exception error : java connection error:8080 Connection refused""",01-09-2021
1245,,"""Exception error :""",01-09-2021
1246,,"ffadsdasd",01-09-2021

Of course, this assumes you can never have a , inside a field, something that is allowed in CSV files. So if your Message field contains something like """foo,bar""", it can fail. This is why it is always better to use a dedicated parser.

This approach should work for any valid CSV file:

$ perl -MText::CSV -le '$csv = Text::CSV->new({binary=>1}); while ($row = $csv->getline(STDIN)){ $row->[2]=~s/\n//; $csv->print(STDOUT,$row)}' < file
ID,Code,Message,date
1244,,"""Exception error : java connection error:8080 Connection refused""",01-09-2021
1245,,"""Exception error :""",01-09-2021
1246,,ffadsdasd,01-09-2021
terdon
  • 242,166
2

If you have csvkit utilities you can fix up lines with embedded newlines, for example by transforming the newline into the literal two characters \n:

csvformat -M $'\r' datafile |                    # temporarily end lines with $'\r'
    sed -e ':a' -e 'N;$!ba' -e 's/\n/\\n/g' |    # transform $'\n' into '\n'
    tr '\r' '\n'                                 # convert the line endings back to $'\n'

Thanks to a poster on the csvkit github for their solution, which in turn links back to an answer on StackOverflow

Chris Davies
  • 116,213
  • 16
  • 160
  • 287
1

Using Raku (formerly known as Perl_6)

raku -MText::CSV -e 'my $csv=Text::CSV.new;  .perl.put for $csv.getline_all(open($*ARGFILES, :r, :!chomp));'

Sample Input:

ID,Code,Message,date
1244,,"""Exception error : java connection error
:8080 Connection refused""",01-09-2021

Sample Output:

$["ID", "Code", "Message", "date"]
$["1244", "", "\"Exception error : java connection error\n:8080 Connection refused\"", "01-09-2021"]

You can accomplish what you hope by using a dedicated module (e.g. Text::CSV) and the Raku programming language. I've added a call to .perl so you can visualize the \n character (FYI, .raku also works). Once you get the data structure above it's simple enough to alter the code, mapping into the fields to eliminate embedded newlines:

raku -MText::CSV -e 'my $csv=Text::CSV.new;  .put for $csv.getline_all(open($*ARGFILES, :r, :!chomp)).map(*.subst("\n"," ", :g));'

Updated Output:

ID Code Message date
1244  "Exception error : java connection error :8080 Connection refused" 01-09-2021

https://modules.raku.org/dist/Text::CSV:cpan:HMBRAND
https://github.com/Tux/CSV
https://raku.org

jubilatious1
  • 3,195
  • 8
  • 17
0

As first step run through your CSV file on and convert to Unix line endings:

dos2unix your_csvfile

This changes \r\n to \n Then in the next step :

Using GNU sed we keep track of even/odd numbers of double quotes as shown here:

sed -Ee '
  h;s/[^"]*//g
  /^(..)*$/!{
    z;G;N;D
  }
  g;s/\n//g
' your_csvfile

perl -pe 's/\n/<>/e while y/"// % 2' your_csvfile

We can use the awk utility to count the number of double quotes and keep accumulating lines until they become even.

awk '{
  t = $0
  while ((gsub(/"/,"&",t) ~ /[13579]$/) && (getline nxt > 0)) 
    t = t nxt
  print t
}' your_csvfile

Output:

     1  ID,Code,Message,date$
     2  1244,,"""Exception error : java connection error:8080 Connection refused""",01-09-2021$
guest_7
  • 5,728
  • 1
  • 7
  • 13
0

Assuming the input is a Unix text file and that you want to replace literal newlines in the MESSAGE field with the string \n:

mlr --csv put '$MESSAGE = gsub($MESSAGE,"\n","\\n")' file

The output, given the CSV in the question:

ID,CODE,MESSAGE,DATE,TYPE,OPER,CO_ID
12202,INT_SYS_OCS_EX_INT-0000,"""OCSSystemException: HTTP transport error: java.net.ConnectException: Tried all: '1' addresses, but could not connect over HTTP to server: '10.244.166.9', port: '8080'\n failed reasons:\n  [0] address:'/10.244.166.9',port:'8080' : java.net.ConctException: Connection refused\n""",06-09-2021 05:52:32,error,BillCycle,6eb8642aa4b
20840,,,06-09-2021 16:17:18,response,changeLimit,1010f9ea05ff

The same thing using csvformat from csvkit to replace all record-ending newlines with the "RS" character (octal ASCII code 036, a character that does not exist in the document), then swap the remaining literal newlines with this "RS" character, and then finally replace all "RS" characters with the string \n:

csvformat -M $'\036' file | tr '\n\036' '\036\n' | sed $'s/\036/\\\\n/g'

The result is the same as shown above.

Kusalananda
  • 333,661