0

I have a bunch of csvfiles that I'm importing into a database. I'd like to get a preview of the unique values in each column to help me create the tables. I've written a script that takes a input csv file and output text file. I want to write column headers and unique values to the output file. Here are some of the criteria I haven't been able to implement:

  1. I want to skip columns that are all numbers, but allow for string that contain numbers like "Unit 7".
  2. I want to skip strings that are whitespace like ' ' but allow for strings with spaces like "Unit 7"
  3. I don't want timestamp or time objects like.
#!/usr/bin/env bash
set -o errexit
set -o nounset

main() {

    if [[ $1 -ne *.csv ]] ; then
            echo "$1 is not a csv file"
            exit 1
    elif [[ -z $2 ]] ; then
            echo "Usage: univals <csvfile.csv> <outputfile.txt>"
            exit 1
    else
            header_length=$(head $1 -n 1 | wc -w) 
            headers=( $(head $1 -n 1 | tr '\t' '\n') )
            for ((i=1 ; i < $header_length ; i++)) ; do

This code facilitates printing unique values on one line: https://stackoverflow.com/questions/19274695/sorting-on-same-line-bash

                    a=( $@ )
                    b=( $(printf "%s " ${a[@]} | cut -f $i $1 | grep -v '[0-9]\|\s' | sort -u) )
                    $(echo "${headers[i-1]}" >> $2)
                    $(printf "%s " ${b[@]} >> $2)
            done
    fi

} main "$@"

This has helped me skip the numbers but clearly taken a toll on everything that has a number in it or has a space in it. Thanks in advance for any help/advice.

I got some help for this script from here and here.

  • 3
  • This honestly shouts "do it in a more versatile programming / scripting language". Not a job for a shell script. – Marcus Müller Jul 26 '22 at 23:41
  • Also, the process of identifying unique values: exactly what you want a database for. So, import e.g. in sqlite, do your sql on that, be done. Really, not a job for a shell. – Marcus Müller Jul 26 '22 at 23:43
  • @steeldriver It is popular to accept that kind of arguments as self-evident, but there are no strong guidelines in that referenced SO answer. If you were to follow that advice, you would never write a line of bash again. Just use your common sense. If you're writing a really complex script that needs a lot of structure or performs a lot of string manipulation, then bash is probably not a good fit. If you care about speed, then you need to run benchmarks and assess what is acceptable performance. None of that involves whether your script runs a shell loop or not. – r_31415 Jul 27 '22 at 01:51
  • Having said that, I agree with @MarcusMüller, particularly due to the 3rd criterion. Once you have to parse dates (or worse, figure out how to define "time objects"), then a scripting language is a better fit. – r_31415 Jul 27 '22 at 01:55
  • It would help if you gave example input/desired output instead of just your code... But I would start by defining a regex that removes the values you don't want and using that with sed to remove them from your input file. Then foreach column, select the column (ie via cut) and then sort -u – TAAPSogeking Jul 27 '22 at 21:47

0 Answers0