2

I have a large (2 Million rows) csv file exported from a SQL Server database. I don't have access to the database, and there's some newline character within a column, which makes it difficult to process in R.

Sample data like this:

playerID,yearID,gameNum,gameID,teamID,lgID,GP,startingPos
gomezle01,1933,0,ALS193307060,NYA,AL,1,1
ferreri01,1933,0,ALS193307060,BOS,AL,1,2
gehrilo01,1933,0
,ALS193307060,NYA,AL,1,3
gehrich01,1933,0,ALS193307060,DET,AL,1,4
dykesji01,1933,0,ALS193307060,CHA,AL,1,5
cronijo01,1933,0,ALS193307060
,WS1,AL,1,6
chapmbe01,1933,0,ALS193307060,NYA,AL,1,7
simmoal01,1933,0,ALS193307060,CHA,AL,1,8
ruthba01,1933,0,ALS193307060,NYA,AL,1,9

In above sample data, some lines breaks into two lines. How to clean it up?

Update

  1. The csv file was generated on Microsoft platform. So the line ending is ^m. I made up the sample data on Linux - ended in \n. It's my bad, I didn't make it clear. But I guess I can substitute \n with \r\n.
  2. The line break doesn't always appear before a comma, it appears randomly even within a field like

.

Bill 
Gates.

Solved

Step 1: Remove ^M in the middle of line:

perl -pe 's/\r(?!\n)//g'

Ref: https://stackoverflow.com/questions/6081465/how-to-remove-carriage-returns-in-the-middle-of-a-line

Step 2: Substitute \n, with , (See @jimmij 's answer below.)

perl -p00e 's/\n,/,/g' 
Nick
  • 1,127

2 Answers2

4

Let's take this as the test file:

$ cat file
playerID,yearID,gameNum,gameID,teamID,lgID,GP,startingPos
gomezle01,1933,0,ALS193307060,NYA,AL,1,1
ferreri01,1933,0,ALS193307060,BOS,AL,1,2
gehrilo01,1933,0
,ALS193307060,NYA,AL,1,3
gehrich01,1933,0,ALS193307060,DET,AL,1,4
dykesji01,1933,0,ALS193307060,CHA,AL,1,5
cronijo01,1933,0,ALS193307060
,WS1,AL,1,6
chapmbe01,1933,0,ALS193307060,NYA,AL,1,7
simmoal01,1933,0,ALS193307060,CHA,AL,1,8
ruthba01,1933,0,ALS193307060,NYA,AL,1,9
Bill
Gates,1933,0,ALS193307060,NYA,AL,1,9

This puts the lines back together:

$ awk 'NR==1{printf "%s",$0; gsub(/[^,]/,""); nlast=n=length($0); next;} nlast==n{printf "\n";nlast=0} {printf "%s",$0; gsub(/[^,]/,""); nlast+=length($0)} END{print ""}' file
playerID,yearID,gameNum,gameID,teamID,lgID,GP,startingPos
gomezle01,1933,0,ALS193307060,NYA,AL,1,1
ferreri01,1933,0,ALS193307060,BOS,AL,1,2
gehrilo01,1933,0,ALS193307060,NYA,AL,1,3
gehrich01,1933,0,ALS193307060,DET,AL,1,4
dykesji01,1933,0,ALS193307060,CHA,AL,1,5
cronijo01,1933,0,ALS193307060,WS1,AL,1,6
chapmbe01,1933,0,ALS193307060,NYA,AL,1,7
simmoal01,1933,0,ALS193307060,CHA,AL,1,8
ruthba01,1933,0,ALS193307060,NYA,AL,1,9
BillGates,1933,0,ALS193307060,NYA,AL,1,9

The line-ending requirements are not clear to me. We could add code to handle it here. Or, for greatest flexibility, you can run the file through dos2unix or unix2dos as needed.

Answer for first version of question

I have a large (2 Million rows) csv file

This is a line-oriented solution that does not require reading the entire file into memory at once:

$ awk 'NR>1 && !/^,/{printf "\n";} {printf "%s",$0} END{print ""}' file
playerID,yearID,gameNum,gameID,teamID,lgID,GP,startingPos
gomezle01,1933,0,ALS193307060,NYA,AL,1,1
ferreri01,1933,0,ALS193307060,BOS,AL,1,2
gehrilo01,1933,0,ALS193307060,NYA,AL,1,3
gehrich01,1933,0,ALS193307060,DET,AL,1,4
dykesji01,1933,0,ALS193307060,CHA,AL,1,5
cronijo01,1933,0,ALS193307060,WS1,AL,1,6
chapmbe01,1933,0,ALS193307060,NYA,AL,1,7
simmoal01,1933,0,ALS193307060,CHA,AL,1,8
ruthba01,1933,0,ALS193307060,NYA,AL,1,9

How it works

  • NR>1 && !/^,/{printf "\n";}

    If we are not on the first line, NR>1, and the current line does not begin with a comma, !/^,/, print a newline.

  • {printf "%s",$0}

    Print the current line without a newline.

  • END{print ""}

    When we reach the end of the file, print one more newline to terminate the last line.

John1024
  • 74,655
1

perl to the rescue:

$ perl -p00e 's/\n,/,/g' file
playerID,yearID,gameNum,gameID,teamID,lgID,GP,startingPos
gomezle01,1933,0,ALS193307060,NYA,AL,1,1
ferreri01,1933,0,ALS193307060,BOS,AL,1,2
gehrilo01,1933,0,ALS193307060,NYA,AL,1,3
gehrich01,1933,0,ALS193307060,DET,AL,1,4
dykesji01,1933,0,ALS193307060,CHA,AL,1,5
cronijo01,1933,0,ALS193307060,WS1,AL,1,6
chapmbe01,1933,0,ALS193307060,NYA,AL,1,7
simmoal01,1933,0,ALS193307060,CHA,AL,1,8
ruthba01,1933,0,ALS193307060,NYA,AL,1,9

I assumed here, that the split is always just before comma ,.

jimmij
  • 47,140
  • Thank you very much. The newline character was generated on Microsoft platform, so it should be carriage return. ^M or \r\n. I didn't make it clear, my bad. (I made the data sample on Linux.) No. The split is not always before comma, it even appears within a field like Bill \r\n Gates. I know this is really stupid. Data cleaning is always frustrated. – Nick Aug 09 '15 at 01:26
  • The problem solved by first remove all the ^M in the middle of the line, and then using your perl code. The 00 option is not so clear, can you explain it briefly? What does a paragraph mean? – Nick Aug 09 '15 at 02:37
  • @Nick You can read about 0 nuances in http://unix.stackexchange.com/questions/192485/how-exactly-does-perls-0-option-work . In fact in this case single 0 should be enough. – jimmij Aug 09 '15 at 10:21