Is there a command line spell to drop a column in a CSV-file?

Question

Having a file of the following contents:

1111,2222,3333,4444
aaaa,bbbb,cccc,dddd

I seek to get a file equal to the original but lacking a n-th column like, for n = 2 (or may it be 3)

1111,2222,4444
aaaa,bbbb,dddd

or, for n = 0 (or may it be 1)

2222,3333,4444
bbbb,cccc,dddd

A real file can be gigabytes long having tens thousands columns.

As always in such cases, I suspect command line magicians can offer an elegant solution... :-)

In my actual real case I need to drop 2 first columns, which can be done by dropping a first column twice in a sequence, but I suppose it would be more interesting to generalise a bit.

Are the fields guaranteed to not contain ,? (I.e., , is only ever used as a field separator.) — user, Mar 21 '12 at 11:59
@MichaelKjörling, it would be nice to have a more flexible solution, but in my case - yes: the separator is , and it never occurs inside a field. — Ivan, Mar 21 '12 at 16:38

Scott McClung · Accepted Answer · 2012-03-21T16:58:36.677

58

I believe this is specific to cut from the GNU coreutils:

$ cut --complement -f 3 -d, inputfile
1111,2222,4444
aaaa,bbbb,dddd

Normally you specify the fields you want via -f, but by adding --complement you reverse the meaning, naturally. From 'man cut':

--complement
    complement the set of selected bytes, characters or fields

One caveat: if any of the columns contain a comma, it will throw cut off, because cut isn't a CSV parser in the same way that a spreadsheet is. Many parsers have different ideas about how to handle escaping commas in CSV. For the simple CSV case, on the command line, cut is still the way to go.

edited Mar 21 '12 at 16:58

answered Mar 21 '12 at 05:08

Scott McClung

1,695

5

That works fine as long as it's a simple CSV file. If any of the columns is a string with a comma in it, it'll throw cut off because it's not a CSV parser. If a CSV field has a field separator in its value, it's wrapped in quotes. Btw, on the subject of cut, -f takes field ranges. cut -f, -d3- will output the third field on, removing the first two. – Alexios Mar 21 '12 at 11:37
3

You mean cut -d, -f3- – Useless Mar 21 '12 at 11:54
@Alexios that's a good point. I never really deal with "real" CSV, only the simple subset. I will edit my answer to reflect that. – Scott McClung Mar 21 '12 at 16:46
@Useless: Damn, yes. That's what I call my ‘cut dyslexia’ striking again. sigh. Scott: CSV files are tricky beasts. Far too many different sub-formats, some of which aren't even CSV, but are conventionally called that anyway. – Alexios Mar 21 '12 at 20:41
This prints the new CSV to my terminal - how do I get it to overwrite the input (or perhaps write to a new file, seems like OP was looking for either)? – Max Ghenis Oct 22 '14 at 16:33
@MaxGhenis: I think a shell redirect is your only option with cut. However, you would probably want to write your output to a temporary file, and not attempt to directly overwrite it. – Scott McClung Oct 27 '14 at 02:20
It gives me an error "cut: illegal option -- -" – Nic Scozzaro Oct 28 '20 at 17:26

score 13 · Answer 2 · edited Apr 13 '17 at 12:36

If the data is simply made of comma-separated columns:

cut -d , -f 1-2,4-

You can also use awk, but it's a bit awkward because while clearing a field is easy, removing the separator takes some work. If you have no empty field, it's not too bad:

awk -F , 'BEGIN {OFS=FS}  {$3=""; sub(",,", ","); print}'

If you have actual CSV, where commas can appear inside fields if properly quoted, you need a real CSV library.

score 3 · Answer 3 · answered Nov 23 '22 at 11:40

Using a CSV-aware tool to remove the two first columns from a header-less CSV input file:

$ cat file
1111,2222,3333,4444
aaaa,bbbb,cccc,dddd

$ mlr --csv -N cut -x -f 1,2 file
3333,4444
cccc,dddd

The -x option to the cut operation in Miller (mlr) causes the operation to exclude the named fields (in this case, fields number 1 and 2). Had the CSV data had headers, we would have been able to use named fields with -f (the -N option would also need to be dropped in this scenario).

Since Miller is CSV-aware, it copes with properly quoted fields containing embedded commas, quotes, and newlines.

score 1 · Answer 4 · answered Nov 05 '20 at 18:06

1

To remove the 3rd column and save a new file, you can do:

cut -d , -f 1-2,4- > output_file.csv

Note that 1-2,4- means "keep columns 1 to 2, and 4 to the end".

answered Nov 05 '20 at 18:06

Nic Scozzaro

111

score -1 · Answer 5 · answered May 19 '20 at 04:07

Try the command below to drop columns using index.

dropColumnCSV --index=0 --file=file.csv

This would work if the columns are separated by comma, as sed commands are used inside the function to remove strings.

dropColumnCSV() {
  # argument check
  while [ $# -gt 0 ]; do
    case "$1" in
      --index=*)
        index="${1#*=}"
        ;;
      --file=*)
        file="${1#*=}"
        ;;
      *)
        printf "* Error: Invalid argument. *\n"
        return
    esac
    shift
  done

  # file check
  if [ ! -f $file ]; then
        printf "* Error: $file not found.*\n"
        return
  fi

  # sed remove command index zero
  if [[ $index == 0 ]]; then
    sed -i 's/\([^,]*\),\(.*\)/\2/' $file

  # sed remove command index greater than zero
  elif [[ $index > 0 ]]; then
    pos_str=$(for i in {1..$(seq "$index")}; do echo -n '[^,]*',; done| sed 's/,$//') ;
    sed -i 's/^\('$pos_str'\),[^,]*/\1/' $file
  fi
}

Is there a command line spell to drop a column in a CSV-file?

5 Answers5

Linked