How to extract columns by names from CSV file

Question

I have over 150 CSV files with inconsistent columns on MAC (Unix) e.g:

Sheet A

ID	Col	mobile	email	secondary mobile	Name
v	v	v	v	v	v
v	v	v	v	v	v

Sheet B

ID	Col	Col	Col	Col	Col
v	v	v	v	v	v
v	v	v	v	v	v

Sheet C

col	Email	MOBILE	name	Col	col
v	v	v	v	v	v
v	v	v	v	v	v

Sheet D

ID	col	col	col	Mobile	col
v	v	v	v	v	v
v	v	v	v	v	v

I need to DERIVE a new-sheet from each of the aformentioned sheet with the following columns (in case any exists)

ID, NAME, EMAIL, MOBILE, SECONDARY_MOBILE

Then I will combine the newly derived sheets into one sheet using CAT. So that I can upload the data into Oracle DB.

So, When it comes to SHEET D, I suppose, the derived sheet should be

ID	NAME	EMAIL	MOBILE	SECONDARY_MOBILE
			V
			V

I hope it is clear. Thanks in advance.

You can either use a tool such as csvcut or mlr (Miller) that can work with column names directly, or loop over the header line to find the indices of the desired columns. See for example How to print certain columns by name? — steeldriver, Aug 07 '21 at 16:51
We really can't help you parse data we cannot see. Please [edit] your question and show us a few example input files and the output you would need from that example. Try and cover all possibilities. Without this, we can't know what you need. Do you have a header? Will it always be name or can it also be Name? How are fields defined? By commas? Spaces? Tabs? — terdon, Aug 07 '21 at 16:53
@steeldriver thanks for your reply. I have clarified the question as instructed by you. — Test2, Aug 07 '21 at 20:19
@terdon thanks for your reply. I have clarified the question as instructed by you. — Test2, Aug 07 '21 at 20:20
Thank you, but I am afraid that is considerably more confusing :(. Please show us some of the actual contents of the file. Do not format them as a nice table, we need to know what the actual data look like. You can see this by opening a terminal and running head file.csv on the files. You are asking for a text-based solution, so we need a text file to work with. We need to know how your columns are defined, if there are IDs, how we can know if a column is present or not etc. — terdon, Aug 07 '21 at 21:12
You can use the formatting tools to format your examples as code. — terdon, Aug 07 '21 at 21:21
you have provided your sample data in a completely useless format. 1, it's formatted, pipe-separated output, not a CSV file; 2. it has duplicate column names in several of the files (i.e. several are called just "col" or "Col"); 3. all values are just "v", which does not provide any information about what type of data each field is (some, like email and name, are obviously string fields, but one of the most important, the ID field, is not - is that a string or an integer?). Anonymising data is a good thing to do, but you've gone way too far and made it completely useless as a sample. — cas, Aug 08 '21 at 02:50

score 1 · Answer 1 · answered Aug 07 '21 at 23:33

CSV, despite being having a IETF RFC is a family of formats, not just one. So even if you find somethin which claims to do what you want, it might not understand the particular variant of CSV your files are formatted in. There is no CSV parser for UnixODBC. Both Python and PHP have very robust and flexible parsers. I'm more familiar with the latter / estimate it would take me about 30 minutes to write the required code. OTOH it'd probably be as much work to import that data into Oracle as-is then normalize the structure thereafter.

Thanks,
I will do it with a Python developer for now.
– Test2 Aug 08 '21 at 05:18 — Test2, Aug 08 '21 at 05:18

score 0 · Answer 2 · answered Aug 08 '21 at 10:27

0

Using Miller and starting in example from these 3 files

#input_01.csv
a,b,c
0,2,5
#input_02.csv
b,a,c
0,2,5
#input_03.csv
b,a
0,2

you can run

mlr --csv cut -f a,b then unsparsify input_0*.csv >output.csv

to have

a,b
0,2
2,0
2,0

answered Aug 08 '21 at 10:27

aborruso

2,855

How to extract columns by names from CSV file

2 Answers2