0

I'm trying to find all the unique values in a column. However, with this command I'll also get the header row. How do I skip that?

awk -vFPAT='([^,]*)|("[^"]+")|","' '{if ($2!~/NULL/) {print $2}}' Files/* | sort | uniq -c | sort -n | wc -l

Sample data is as:

"link","shared_story","101","52
"link","published_story","118","100
"link","published_story","134","51
"link",NULL,"152","398
"link","shared_story","398","110
Emma
  • 9
  • 2
    Is Sample data your sample input or expected output? I thought it was sample input but then you talk about a "header row" in your question and that doesn't seem to be present in your "Sample data". Whichever it is, input or output, please [edit] your question to add the other one too. – Ed Morton Oct 15 '20 at 17:02
  • 2
    Also: 1) in your FPAT definition you have ("[^"]+")|"," - what string are you trying to match with "," that isn't matched with "[^"]+"? 2) The last field on each line of your "Sample data" starts with a double quote but has no terminating double quote - is that really what your data looks like? If not then please fix your example. – Ed Morton Oct 15 '20 at 17:04

2 Answers2

2

The line number of the current file in awk is stored in the special FNR variable (NR holds the line number overall, not per-file). So, if your header is on the first line, you can simply modify your command to skip that line:

awk -vFPAT='([^,]*)|("[^"]+")|","' '{if ($2!~/NULL/ && FNR>1) {print $2}}' Files/* 

Alternatively, you can skip lines matching a specific string from your header:

awk -vFPAT='([^,]*)|("[^"]+")|","' '{if ($2!~/NULL/ && !/headerString/) {print $2}}' Files/*
terdon
  • 242,166
2

You can do the whole task with one awk command like this:

awk -vFPAT='([^,]*)|("[^"]+")|","' 'FNR > 1 && $2 !~ /NULL/ && !seen[$2]++ {cnt++}
                                    END {print cnt}' *.csv

See here an example about this awk idiom, it is used to consider the value in the hash only for the first time we see it. So it is useful for various tasks, like, excluding duplicates, excluding uniques, allow only N times of appearence etc.

See also examples with the standard awk built-in variables, FNR is the record number per file (for one input file, FNR==NR for every line)


Also, in general, if you want to parse all the files excluding the first line, with any command, you could use tail. This excludes the first line of every file and prints the rest:

tail -n +2 *.csv | command
thanasisp
  • 8,122