0

I need to filter a very large csv file in the most efficient way. I already tried csvgrep but is not efficient timewise, so I'm now trying to use AWK for this task.

The csv file has a ; separator, and I need to filter all rows that have in the 48th column a certain string that starts with a certain pattern, that I will pass through a script. So it will be something along the lines of:


pattern='59350'

awk -F ";" '$48 ~ /^59350/ input.csv > output.csv # This works

However, I need to pass $pattern inside the regex statement, rather than explicitly write the pattern.

I have tried several combinations but all give me an empty output.csv file.

Here are some of my failed attempts:


awk -F ";" -v var="$pattern" '$48 ~ /^var/' input.csv > output.csv

awk -F ";" -v var="$pattern" '$48 ~ /^$var/ {print}' input.csv > output.csv

awk -F ";" -v var="$pattern" '$48 ~ /^${var}/ {print}' input.csv > output.csv

How do I do this?

Please, also, if you have a more efficient way that won't load the whole csv file in memory or just faster (I was thinking of grep but not sure if it is suitable and how to implement it)

Thank you in advance

  • The linked answer don't answer my question, as I need to search for a text that STARTS WITH a pattern. It may be obvious for you but not for me how to get there. @JeffSchaller – Margherita Di Leo May 26 '21 at 15:09
  • @MargheritaDiLeo what you're trying to match isn't a "pattern", it's a regular expression. The word "pattern" is vastly overused and highly ambiguous and code written thinking of "patterns" is often buggy because of that so don't use it in the context of matching text - use "string" or "regexp", whichever you want. Anyway, you already show in your code that you know the regexp you want to match is ^59350 and now you know that to pass a regexp in in a variable is awk -v var='<regexp>' '$42 ~ var' so just do that - awk -v var='^59350' '$42 ~ var'. – Ed Morton May 26 '21 at 15:20
  • FWIW though I think what you REALLY want is a string match at the start of the field and that'd be awk -v var='59350' 'index($42,var) == 1'. – Ed Morton May 26 '21 at 15:25
  • @EdMorton Thank you for your answer. I have tried with awk -F ";" -v var='^$pattern' '$48 ~ var' input.csv > output.csv but output.csv is empty. I don't mind if you want to call it differently, but I need to pass it as a variable in my workflow, not hardcoded. Thanks – Margherita Di Leo May 26 '21 at 15:29
  • @EdMorton awk -F ";" -v var=$pattern 'index($48,var) == 1' worked! Thank you! PS I still believe this question is totally different from the one linked and deserves to be reopened as more people can read the correct answer. Thanks again for your time. – Margherita Di Leo May 26 '21 at 15:37
  • 1
    @MargheritaDiLeo if pattern is a shell variable, you need double quotes rather than single quotes to allow the value to be expanded before it is passed to awk: -v var="^$pattern". When you do -v var='^$pattern', awk is trying to match $pattern literally. – steeldriver May 26 '21 at 15:42
  • @steeldriver thank you for pointing that out! Actually in my code the variable is called differently, I changed it for exemplification, like input.csv and output.csv, but didn't think that was a protected word. Thank you! – Margherita Di Leo May 27 '21 at 07:35

0 Answers0