0

I have over 150 CSV files with inconsistent columns on MAC (Unix) e.g:

Sheet A

ID Col mobile email secondary mobile Name
v v v v v v
v v v v v v

Sheet B

ID Col Col Col Col Col
v v v v v v
v v v v v v

Sheet C

col Email MOBILE name Col col
v v v v v v
v v v v v v

Sheet D

ID col col col Mobile col
v v v v v v
v v v v v v

I need to DERIVE a new-sheet from each of the aformentioned sheet with the following columns (in case any exists)

ID, NAME, EMAIL, MOBILE, SECONDARY_MOBILE

Then I will combine the newly derived sheets into one sheet using CAT. So that I can upload the data into Oracle DB.

So, When it comes to SHEET D, I suppose, the derived sheet should be

ID NAME EMAIL MOBILE SECONDARY_MOBILE
V
V

I hope it is clear. Thanks in advance.

Test2
  • 1
  • You can either use a tool such as csvcut or mlr (Miller) that can work with column names directly, or loop over the header line to find the indices of the desired columns. See for example How to print certain columns by name? – steeldriver Aug 07 '21 at 16:51
  • 1
    We really can't help you parse data we cannot see. Please [edit] your question and show us a few example input files and the output you would need from that example. Try and cover all possibilities. Without this, we can't know what you need. Do you have a header? Will it always be name or can it also be Name? How are fields defined? By commas? Spaces? Tabs? – terdon Aug 07 '21 at 16:53
  • @steeldriver thanks for your reply. I have clarified the question as instructed by you. – Test2 Aug 07 '21 at 20:19
  • @terdon thanks for your reply. I have clarified the question as instructed by you. – Test2 Aug 07 '21 at 20:20
  • 2
    Thank you, but I am afraid that is considerably more confusing :(. Please show us some of the actual contents of the file. Do not format them as a nice table, we need to know what the actual data look like. You can see this by opening a terminal and running head file.csv on the files. You are asking for a text-based solution, so we need a text file to work with. We need to know how your columns are defined, if there are IDs, how we can know if a column is present or not etc. – terdon Aug 07 '21 at 21:12
  • You can use the formatting tools to format your examples as code. – terdon Aug 07 '21 at 21:21
  • you have provided your sample data in a completely useless format. 1, it's formatted, pipe-separated output, not a CSV file; 2. it has duplicate column names in several of the files (i.e. several are called just "col" or "Col"); 3. all values are just "v", which does not provide any information about what type of data each field is (some, like email and name, are obviously string fields, but one of the most important, the ID field, is not - is that a string or an integer?). Anonymising data is a good thing to do, but you've gone way too far and made it completely useless as a sample. – cas Aug 08 '21 at 02:50
  • 1
  • you've given no indication of how you expect to resolve missing data - e.g. some of the input files have no fields for name or email address. One of them ("SheetB") is missing everything but the ID field, the rest are just generic "Col".
  • – cas Aug 08 '21 at 02:52