2

I need to convert CSV to TSV in bash. I found this solution and it works well, but not for all datasets as I will show below.

For example, for a.txt:

a,"test, part2 ""the start""",b

sed format it badly:

[ nir ]$ cat a.txt | sed -E 's/("([^"]*)")?,/\2\t/g' 
a    "test    Op. 15 ""the start"    b
#^ tab....^ tab..................^ tab

Issues here: missing ,, extra tab, extra quotes.

Just to reference even python code format it badly:

[ nir ]$ cat a.txt | csv2tsv.py
a    "test, part2 ""the start"""    b
#^ tab..........................^ tab

Issues here: extra quotes.

csv2tsv.py is: csv.writer(sys.stdout, dialect='excel-tab').writerows(csv.reader(sys.stdin))

Where the true convert should looks like:

a    test, part2 "the start"    b
#^ tab......................^ tab

Would love to get some feedback how to solve this in bash. I went over many solution in the internet but non managed to handle those quotes on quotes inside quotes :)

muru
  • 72,889
Nir
  • 1,335

3 Answers3

4

bash 5.1 comes with a loadable CSV module

BASH_LOADABLES_PATH=${BASH/\/bin\//\/lib\/}
enable -f csv csv
csv -a fields "$line"
new_line=$(IFS=$'\t'; echo "${fields[*]}")
declare -p line fields new_line

outputs

declare -- line="a,\"test, part2 \"\"the start\"\"\",b"
declare -a fields=([0]="a" [1]="test, part2 \"the start\"" [2]="b")
declare -- new_line="a  test, part2 \"the start\"   b"
#.....................^ tab......................^ tab

This is not valid is there's a field containing a tab.


In a pipeline:

IFS=$'\t'
cat file |
while IFS= read -r line; do
    csv -a fields "$line"
    echo "${fields[*]}"
done |
tail

Although this is more idiomatic bash

IFS=$'\t'
while IFS= read -r line; do
    csv -a fields "$line"
    echo "${fields[*]}"
done < file | tail
glenn jackman
  • 85,964
  • Can you please show how to use it as part of pipe? for example cat a.txt | <convert> | tail – Nir Sep 06 '21 at 05:19
  • Note that few systems ship those loadable bash builtins by default as they're more like example code (and not much tested). That BASH_LOADABLES_PATH=${BASH/\/bin\//\/lib\/} will likely not work if bash is found in /bin (which it is most of the time). – Stéphane Chazelas Sep 06 '21 at 07:46
  • That csv builtin was added in 5.1. It was not there in 5.0 – Stéphane Chazelas Sep 06 '21 at 07:50
4

With mlr:

mlr -N --icsv --otsvlite cat < file.csv > file.tsv

Or:

mlr -N --c2t --quote-none cat < file.csv > file.tsv

But note that if a csv field contains a tab character, it will end up not escaped in the output and therefore introducing an extra field.

With GNU sed, you could do the same with:

sed -E '
  # append next line as long as there is not an even number
  # of "s, to handle fields with newline. You can omit this line
  # if the fields are guaranteed not to contain newlines:
  :1; /^([^"]*"[^"]*")*[^"]*$/! {N;b1}

s/$/,/ s/(([^,"])|"((""|[^"]))"),/\2\3\t/g s/\t$// s/""/"/g' < file.csv > file.tsv

That assumes the input is valid text in the current locale. Start sed as LC_ALL=C sed... to disable localisation and treat the input as if it was binary input to avoid decoding issues (and likely speed things up if speed is a concern)

0

Or use csvformat from the csvkit -- this tool takes care of quoting any field containing the delimiter: I added a line to the input file that has tabs in it.

$ cat a.txt
a,"test, part2 ""the start""",b
c,d,e   with    tabs

$ csvformat -D $'\t' a.txt a "test, part2 ""the start""" b c d "e with tabs"

glenn jackman
  • 85,964
  • The " should not be in the result – Nir Sep 09 '21 at 09:20
  • 1
    The second line needs quotes to contain the inner delimiters. For the first line, the standard says "If fields are not enclosed with double quotes, then double quotes may not appear inside the fields." But the standard is merely advisory: if you don't want to follow it, then don't, but common CSV tools (like csvkit) may not meet your requirements – glenn jackman Sep 09 '21 at 13:14