You appear to have a properly formatted CSV file that uses tab-characters as field delimiters. Newline characters are allowed to be embedded in fields as long as such fields are properly quoted, which the example that you show is. Any CSV parser would have no issues reading your data.
If you want to remove these newlines anyway, you may use a CSV parser such as csvkit
.
I'll be working on an example file that looks like this:
$ cat -t file.csv
col1^Icol2^Icol3
col1^I"ABC
123"^Icol3
col1^Icol2^Icol3
Each ^I
is a tab character. The second field on the second line contains two consecutive newline characters and we want to safely replace these with a single space character.
I'm using the csvjson
from csvkit
, which converts the CSV data to a JSON document. This makes it slightly easier to modify the data using jq
, which could also be used to turn the data back into CSV format:
$ csvjson -t -H file.csv | jq -r '.[] | [ .[] | values |= gsub("\n\n";" ") ] | @csv'
"col1","col2","col3"
"col1","ABC 123","col3"
"col1","col2","col3"
The csvjson
command used here converts each row of the CSV file into a JSON object. The -t
option tells the tool that the input uses tab characters as delimiter and with -H
we tell it that there are no column headers.
The JSON objects are put into an array and are read by jq
which extracts the values (the data will be assigned to keys like a
, b
, c
, etc. since the original CSV file does not have headers, or at least none that are mentioned in the question) and applies a simple substitution using gsub()
to replace each pair of consecutive newline characters with a space.
You could obviously change the regular expression used with gsub()
above to \n+
to have it replace any run of consecutive newlines with a single space character.
The @csv
operator then receives the data as a set of arrays, which are formatted for CSV output.
Would you want to change the default field delimiters from commas back to tabs, pipe the result through csvformat
with its -T
(for tab-delimited output) and -H
(no header in the CSV input) options:
$ csvjson -t -H file.csv | jq -r '.[] | [ .[] | values |= gsub("\n\n";" ") ] | @csv' | csvformat -T -H
col1 col2 col3
col1 ABC 123 col3
col1 col2 col3
csvformat
will automatically quote fields that needs quoting.
The csvformat
tool is also part of csvkit
.
As a reference, the intermediate JSON document created by csvjson
looks like this (prettified by jq
):
[
{
"a": "col1",
"b": "col2",
"c": "col3"
},
{
"a": "col1",
"b": "ABC\n\n123",
"c": "col3"
},
{
"a": "col1",
"b": "col2",
"c": "col3"
}
]
123
on line 3? Is there a tab after the$
on line two? How can we distinguish between the\n
that should be removed and those that shouldn't? – terdon Jun 07 '21 at 14:21