1

Suppose I have a crazy data set in CSV:

SOS_VOTERID,COUNTY_NUMBER,COUNTY_ID,LAST_NAME,FIRST_NAME,MIDDLE_NAME,SUFFIX,DATE_OF_BIRTH,REGISTRATION_DATE,PARTY_AFFILIATION,RESIDENTIAL_ADDRESS1,RESIDENTIAL_SECONDARY_ADDR,city,RESIDENTIAL_STATE,zip,RESIDENTIAL_ZIP_PLUS4,RESIDENTIAL_COUNTRY,RESIDENTIAL_POSTALCODE,MAILING_ADDRESS1,MAILING_SECONDARY_ADDRESS,MAILING_CITY,MAILING_STATE,MAILING_ZIP,MAILING_ZIP_PLUS4,MAILING_COUNTRY,MAILING_POSTAL_CODE,CAREER_CENTER,CITY,CITY_SCHOOL_DISTRICT,COUNTY_COURT_DISTRICT,CONGRESSIONAL_DISTRICT,COURT_OF_APPEALS,EDUCATIONAL_SERVICE_CENTER_DISTRICT,EXEMPTED_VILLAGE_SCHOOL_DISTRICT,LIBRARY,LOCAL_SCHOOL_DISTRICT,MUNICIPAL_COURT_DISTRICT,PRECINCT_NAME,PRECINCT_CODE,STATE_BOARD_OF_EDUCATION,STATE_REPRESENTATIVE_DISTRICT,STATE_SENATE_DISTRICT,TOWNSHIP,VILLAGE,WARD,PRIMARY-03/07/2000,GENERAL-11/07/2000,SPECIAL-05/08/2001,GENERAL-11/06/2001,PRIMARY-05/07/2002,GENERAL-11/05/2002,SPECIAL-05/06/2003,GENERAL-11/04/2003,PRIMARY-03/02/2004,GENERAL-11/02/2004,SPECIAL-02/08/2005,PRIMARY-05/03/2005,PRIMARY-09/13/2005,GENERAL-11/08/2005,SPECIAL-02/07/2006,PRIMARY-05/02/2006,GENERAL-11/07/2006,PRIMARY-05/08/2007,PRIMARY-09/11/2007,GENERAL-11/06/2007,PRIMARY-11/06/2007,GENERAL-12/11/2007,PRIMARY-03/04/2008,PRIMARY-10/14/2008,GENERAL-11/04/2008,GENERAL-11/18/2008,PRIMARY-05/05/2009,PRIMARY-09/08/2009,PRIMARY-09/15/2009,PRIMARY-09/29/2009,GENERAL-11/03/2009,PRIMARY-05/04/2010,PRIMARY-07/13/2010,PRIMARY-09/07/2010,GENERAL-11/02/2010,PRIMARY-05/03/2011,PRIMARY-09/13/2011,GENERAL-11/08/2011,PRIMARY-03/06/2012,GENERAL-11/06/2012,PRIMARY-05/07/2013,PRIMARY-09/10/2013,PRIMARY-10/01/2013,GENERAL-11/05/2013,PRIMARY-05/06/2014,GENERAL-11/04/2014,PRIMARY-05/05/2015
OH0012781511,87,26953,HOUSEHOLDER,SHERRY,LEIGH,,11/26/1965,08/19/1988,,211 N GARFIELD ST ,   ,BLOOMDALE,OH,44817,,,,  PO BOX 222  ,   ,BLOOMDALE,OH,44817,,,,PENTA JVSD,,,,05,06,WOOD EDUC SRV CTR,,,ELMWOOD LOCAL SD,BOWLING-GREEN,BLOOM TWP BLOOMDALE,87-P-ABO,02,03,02,,BLOOMDALE VILLAGE,,D,,,X,,X,,,,X,,,,,,,,,,,,,D,,X,,,,,,X,,,,,,,,,X,,,,,,,
OH0012781528,87,31122,KEATON,JENNIFER,KAI,,11/27/1968,07/13/2015,,110 N GARFIELD ST ,   ,BLOOMDALE,OH,44817,,,,  PO BOX 16  ,   ,BLOOMDALE,OH,44817,,,,PENTA JVSD,,,,05,06,WOOD EDUC SRV CTR,,,ELMWOOD LOCAL SD,BOWLING-GREEN,BLOOM TWP BLOOMDALE,87-P-ABO,02,03,02,,BLOOMDALE VILLAGE,,,X,,,,,,,,X,,,,,,,,,,,,,,,X,,,,,,X,,,,,,,X,,X,,,,,,,

And I wish only to save the city and zip columns; Is there a simple command line spell, that I can specify the two names as arguments and save only those two columns in a new CSV file?

JZ.
  • 111

2 Answers2

1

If you have simple, comma-separated columns, you can do it with awk. Parse the first line to determine the columns you want, and then print those columns.

wanted_columns=city,zip
wanted_columns=",$wanted_columns," awk -F, '
    NR==1 {
        for (i=1; i<=NF; i++) {
            if (index(ENVIRON["wanted_columns"], ","$i",")) {last=i; columns[i]=","}
            columns[last]="\n"
        }
    }
    {
        for (i=1; i<=NF; i++) {
            if (columns[i]) printf "%s%s", $i, columns[i]
        }
    }'

Alternatively, you can use cut; this will be faster if the file is large. First parse the header line with different tools to determine the column numbers.

wanted_columns=city,zip
{
  IFS= read header;
  cut_spec=$(printf %s "$header" |
             wanted_columns=",$wanted_columns," awk -v RS=, '
                 index(ENVIRON["wanted_columns"], ","$0",") {printf "%d,", NR}'
             );
  { printf %s\\n "$header"; cat; } | cut -d , -f "${cut_spec%,}";
}

If your CSV file is an actual CSV file with some quoted columns that can have embedded columns or newlines, use a proper CSV tool. For example, in Python:

#!/usr/bin/env python2
import csv, sys
wanted_columns = set(sys.argv[1:])
reader = csv.reader(sys.stdin)
header = reader.next()
columns = [i for i in range(len(header)) if header[i] in wanted_columns]
writer=csv.writer(sys.stdout)
writer.writerow(sys.argv[1:])
for row in reader: writer.writerow([row[i] for i in columns])

(Pass the desired column names as arguments to the script.)

0

If you don't have embedded , (meaning no ...,"foo,bar",...) you can use cut:

cut -d, -f13,15

selects only columns 13 and 15 where columns are delimited by ,. Why are columns 13 and 15 the right ones? I counted the commas with

head -n 1 data.csv | sed 's/zip,.*/,/' | tr -dc , | wc -c

Explanation: take the first line of input (head), find "zip" and replace it and the rest of the line with a comma (sed), then throw away everthing that is not a comma (tr) and count (wc). So "a,zip,b" is transformed to "a,,", to ",," to 2. "zip" is the second field.

Thomas Erker
  • 2,857
  • The poster specified columns by name, I'm guessing they don't want to use column numbers. – forquare Sep 06 '15 at 18:43
  • 1
    Sure, but the poster didn't say explicitly what selection by number wouldn't be ok. – Thomas Erker Sep 06 '15 at 18:46
  • It is explicitly written that I can specify the two names as arguments, it is unreasonable to assume that someone is going to list all the ways he does not want the problem to be solved (assuming that exhaustively listing them is even possible), since he clearly indicated in which way he wanted it to be. – To마SE Sep 09 '15 at 07:06