3

I have a tab delimited text file where the 23rd column with string has \n in it which causes it to break to the next line.

I opened the text file in vi and enabled white space characters and I can see that the values in DESCR2 field have blank lines breaking the string.

The string is contained within tab delimited characters and I am trying to remove the \n and have the string concatenate to ABC 123 while still in 1 field .

I have tried tr -d '\n' < file.txt but that would make all the lines into 1 line. I want to only remove \n\n from that column.

I also tried sed 's/\n\n//' file.txt but it has no effect. I am able to search and replace \n\n in vi but I can't get the same result with sed.

Example:

\t"ABC\n
\n
123"\t

Desired output:

\t"ABC 123"\t
Kusalananda
  • 333,661
ntropi
  • 37
  • 3
    Please [edit] your question and include i) an example of your input file and ii) the output you expect from it. We need this not only to test our solutions but also to understand the problem. For example, is there a tab before the 123 on line 3? Is there a tab after the $ on line two? How can we distinguish between the \n that should be removed and those that shouldn't? – terdon Jun 07 '21 at 14:21
  • 3
    CSV/TSV files are allowed to have new lines inside the records. Use a proper CSV parser. – pLumo Jun 07 '21 at 14:22
  • @terdon I tried my best to give a sample but the format nature of my sample does not render properly in the editor. The string value that I want to change has double \n\n so I am not certain that knowing which \n to remove is a concern. To answer your question, there is tab encapsulating the entire string. – ntropi Jun 07 '21 at 16:52

7 Answers7

2

You appear to have a properly formatted CSV file that uses tab-characters as field delimiters. Newline characters are allowed to be embedded in fields as long as such fields are properly quoted, which the example that you show is. Any CSV parser would have no issues reading your data.

If you want to remove these newlines anyway, you may use a CSV parser such as csvkit.

I'll be working on an example file that looks like this:

$ cat -t file.csv
col1^Icol2^Icol3
col1^I"ABC

123"^Icol3 col1^Icol2^Icol3

Each ^I is a tab character. The second field on the second line contains two consecutive newline characters and we want to safely replace these with a single space character.

I'm using the csvjson from csvkit, which converts the CSV data to a JSON document. This makes it slightly easier to modify the data using jq, which could also be used to turn the data back into CSV format:

$ csvjson -t -H file.csv | jq -r '.[] | [ .[] | values |= gsub("\n\n";" ") ] | @csv'
"col1","col2","col3"
"col1","ABC 123","col3"
"col1","col2","col3"

The csvjson command used here converts each row of the CSV file into a JSON object. The -t option tells the tool that the input uses tab characters as delimiter and with -H we tell it that there are no column headers.

The JSON objects are put into an array and are read by jq which extracts the values (the data will be assigned to keys like a, b, c, etc. since the original CSV file does not have headers, or at least none that are mentioned in the question) and applies a simple substitution using gsub() to replace each pair of consecutive newline characters with a space.

You could obviously change the regular expression used with gsub() above to \n+ to have it replace any run of consecutive newlines with a single space character.

The @csv operator then receives the data as a set of arrays, which are formatted for CSV output.

Would you want to change the default field delimiters from commas back to tabs, pipe the result through csvformat with its -T (for tab-delimited output) and -H (no header in the CSV input) options:

$ csvjson -t -H file.csv | jq -r '.[] | [ .[] | values |= gsub("\n\n";" ") ] | @csv' | csvformat -T -H
col1    col2    col3
col1    ABC 123 col3
col1    col2    col3

csvformat will automatically quote fields that needs quoting.

The csvformat tool is also part of csvkit.


As a reference, the intermediate JSON document created by csvjson looks like this (prettified by jq):

[
  {
    "a": "col1",
    "b": "col2",
    "c": "col3"
  },
  {
    "a": "col1",
    "b": "ABC\n\n123",
    "c": "col3"
  },
  {
    "a": "col1",
    "b": "col2",
    "c": "col3"
  }
]
Kusalananda
  • 333,661
1

GoCSV can do this.

Convert TSV to CSV and replace the newlines

I'm starting with a TSV files that looks like this, trying to mock up your data:

+--------+--------+--------+--------+--------+
| Col21  | Col22  | DESCR2 | Col24  | Col25  |
+--------+--------+--------+--------+--------+
| data21 | data22 | ABC    | data24 | data25 |
|        |        |        |        |        |
|        |        | 123    |        |        |
+--------+--------+--------+--------+--------+

The first step is to convert TSV to CSV, which is the format all GoCSV commands work on. I'm also adding a new column at the end that has the values of DESC2 with the newlines replaced. -n is the name of the new column, -t is the SPRIG template that has the replace function we need (.DESCR2 | replace reads like "cat the DESCR2 column into the replace function"):

gocsv delim   \
    -i "\t"   \
    -o ","    \
    input.tsv | 
  gocsv add                               \
    -n DESCR2_replaced                    \
    -t '{{ .DESCR2 | replace "\n" " " }}' \
  > replaced.csv

replaced.csv

+--------+--------+--------+--------+--------+-----------------+
| Col21  | Col22  | DESCR2 | Col24  | Col25  | DESCR2_replaced |
+--------+--------+--------+--------+--------+-----------------+
| data21 | data22 | ABC    | data24 | data25 | ABC  123        |
|        |        |        |        |        |                 |
|        |        | 123    |        |        |                 |
+--------+--------+--------+--------+--------+-----------------+

Swap-in new column and rename to old column

With the data normalized in the _replaced column, I'm going to "select out" the old DESCR2 and "select in" the new DESCR2_replaced in its place; then rename DESCR2_replaced back to DESCR2. In my example I only have 6 columns so the -c column indexes are different than in your 23+ column file(s):

gocsv select     \
    -c 1-2,6,4-5 \
    replaced.csv |
  gocsv rename    \
    -c 3          \
    -names DESCR2 \
  > final.csv

final.csv

+--------+--------+----------+--------+--------+
| Col21  | Col22  | DESCR2   | Col24  | Col25  |
+--------+--------+----------+--------+--------+
| data21 | data22 | ABC  123 | data24 | data25 |
+--------+--------+----------+--------+--------+

Convert back to TSV

gocsv delim   \
    -i ","    \
    -o "\t"   \
    final.csv \
  > final.tsv

One big pipeline

gocsv delim                                \
    -i "\t"                                \
    -o ","                                 \
    input.tsv                              \
| gocsv add                                \
    -n DESCR2_replaced                     \
    -t '{{ .DESCR2 | replace "\n" " " }}'  \
| gocsv select                             \
    -c 1-2,6,4-5                           \
| gocsv rename                             \
    -c 3                                   \
    -names DESCR2                          \
| gocsv delim                              \
    -i ","                                 \
    -o "\t"                                \
> final.tsv
Zach Young
  • 220
  • 2
  • 5
0

Have you tried to use: sed ':a;N;$!ba;s/\\n\n/ /g' file.txt?

I found this answer that details how to remove a newline using sed and just added a \\n before that with the additional backslash to escape the special character.

crow
  • 51
0

Using Miller, and giving your fragment a proper TSV header:

$ printf '%b\n' 'foo\tbar\tbaz' '\t"ABC\n\n123"\t'
foo     bar     baz
        "ABC

123"

then

$ printf '%b\n' 'foo\tbar\tbaz' '\t"ABC\n\n123"\t' | 
    mlr --tsv put -S 'for(k,v in $*){$[k] = sub(v,"\n+"," ")}'
foo     bar     baz
        ABC 123

or (converting the output to CSV to make the field separation less ambiguous)

$ printf '%b\n' 'foo\tbar\tbaz' '\t"ABC\n\n123"\t' | 
    mlr --itsv --ocsv put -S 'for(k,v in $*){$[k] = sub(v,"\n+"," ")}'
foo,bar,baz
,ABC 123,

Miller has a -I switch for in-place modification of files.

steeldriver
  • 81,074
0

A simple run of sed will only contain a single line at any one time in its pattern space, which is why your sed syntax will not work. A (similar if not duplicate) question has an answer explaining how to handle multiline edits using sed Here. The TLDR is that it sucks and has painful syntax

Similarly tr will fail, since it again only looks at a single line at any one time.

The easiest way to handle multiple lines, in my opinion is to use perl instead:

perl -0777 -pe 's/\n\n/ /igs' file.txt

Where -0777 tells perl to match against the whole file and -pe just finds and replaces

This version will do inline edits with a -i

perl -0777 -pe 's/\n\n/ /igs' -i file.txt

EDIT: If you intend to do the replace with the special characters displayed you may need to escape the \n's properly in your regex

ReedGhost
  • 483
0

sed by default uses newlines as separators. However you may just use NUL as separators and then can easily act on \n\n:

sed -z 's/\n\n/ /g'

Note that g is needed as now you act on a pseudo-single line (zero-delimited) which may have multiple \n\n-cases.

FelixJN
  • 13,566
-1

I was able to achieve my goal with vi.

vi -c '%s/\n\n/\n/ | wq' sample.txt
ntropi
  • 37
  • You are probably using Vim rather than Vi, right? When I tried this with Vim, it inserts a nul character, not a newline. Did you not want to replace them with a space? – Kusalananda Jun 07 '21 at 20:10
  • Yeah I meant to replace \n\n with a space – ntropi Jun 07 '21 at 20:11
  • @Kusalananda / @ntropi When you replace line-feed \n in Vim you have to replace with carriage-return \r. One of the quirks in vim. https://stackoverflow.com/a/12388814/3342816 - search differs from substitute. Again in substitute, \0 is whole match, \n is NULL and for \r one would . . . – ibuprofen Jun 24 '21 at 09:00