I have a large (2 Million rows) csv
file exported from a SQL Server
database. I don't have access to the database, and there's some newline character within a column, which makes it difficult to process in R
.
Sample data like this:
playerID,yearID,gameNum,gameID,teamID,lgID,GP,startingPos
gomezle01,1933,0,ALS193307060,NYA,AL,1,1
ferreri01,1933,0,ALS193307060,BOS,AL,1,2
gehrilo01,1933,0
,ALS193307060,NYA,AL,1,3
gehrich01,1933,0,ALS193307060,DET,AL,1,4
dykesji01,1933,0,ALS193307060,CHA,AL,1,5
cronijo01,1933,0,ALS193307060
,WS1,AL,1,6
chapmbe01,1933,0,ALS193307060,NYA,AL,1,7
simmoal01,1933,0,ALS193307060,CHA,AL,1,8
ruthba01,1933,0,ALS193307060,NYA,AL,1,9
In above sample data, some lines breaks into two lines. How to clean it up?
Update
- The csv file was generated on Microsoft platform. So the line ending is
^m
. I made up the sample data on Linux - ended in\n
. It's my bad, I didn't make it clear. But I guess I can substitute\n
with\r\n
. - The line break doesn't always appear before a comma, it appears randomly even within a field like
.
Bill
Gates.
Solved
Step 1: Remove ^M
in the middle of line:
perl -pe 's/\r(?!\n)//g'
Ref: https://stackoverflow.com/questions/6081465/how-to-remove-carriage-returns-in-the-middle-of-a-line
Step 2: Substitute \n,
with ,
(See @jimmij 's answer below.)
perl -p00e 's/\n,/,/g'
^M
or\r\n
. I didn't make it clear, my bad. (I made the data sample on Linux.) No. The split is not always before comma, it even appears within a field likeBill \r\n Gates
. I know this is really stupid. Data cleaning is always frustrated. – Nick Aug 09 '15 at 01:26^M
in the middle of the line, and then using your perl code. The00
option is not so clear, can you explain it briefly? What does a paragraph mean? – Nick Aug 09 '15 at 02:370
nuances in http://unix.stackexchange.com/questions/192485/how-exactly-does-perls-0-option-work . In fact in this case single0
should be enough. – jimmij Aug 09 '15 at 10:21