4

A snippet of a typical tsv file I have used

10  Interstellar    Main Theme Extended UDVtMYqUAyw
11  Journey XvG78AmBLc4
12  Jurassic Park Music & Ambience  Amazing Soundscapes and Music   PPl__iyIg6w
13  Lord of the Rings   Sound of The Shire  chLZQtCold8
14  Lord of the Rings   The Shire: Sunset at Bag End    uBmbI8dzc-M

The following searches for lord (case insensitively) in 2nd column of all tsv files:

awk '$2~IGNORECASE = 1;/lord/{print}' *.tsv

13 Lord of the Rings Sound of The Shire chLZQtCold8 14 Lord of the Rings The Shire: Sunset at Bag End uBmbI8dzc-M

Now, I wanted to pass Lord as a bash environment variable:

$ awk -v Pattern="Lord" '$2~Pattern{print}' *.tsv 
13      Lord of the Rings       Sound of The Shire      chLZQtCold8
14      Lord of the Rings       The Shire: Sunset at Bag End    uBmbI8dzc-M

Problem

How to do the match of pattern case insensitively?

I tried the following but it doesn't work

awk -v Pattern="lord" '$2~IGNORECASE = 1;Pattern{print}' *.tsv

awk -v Pattern="lord" 'IGNORECASE = 1;$2~Pattern{print}' *.tsv

awk -v Pattern="lord" 'BEGIN {IGNORECASE = 1} {$2~Pattern{print}}' *.tsv

awk -v Pattern="Lord" '{IGNORECASE = 1; $2~Pattern}' *.tsv

Refer

Porcupine
  • 1,892
  • awk -v Pattern="Lord" '{IGNORECASE = 1; $2~Pattern}' does not work. awk -W version GNU Awk 5.0.1 – Porcupine Jun 15 '21 at 01:22
  • ... sorry I initially misread that as awk -v Pattern="Lord" '{IGNORECASE = 1;} $2~Pattern' *.tsv which also works but unnecessarily re-assigns IGNORECASE = 1 for every record – steeldriver Jun 15 '21 at 01:30

4 Answers4

4

First of all, I doubt that $2~IGNORECASE = 1;/lord/{print} works the way you think it does - AFAIK it assigns value 1 to the variable IGNORECASE; compares the value of $2 to the result (i.e. $2 ~ 1) and by default prints $0 if the result is true; then compares $0 case-insensitively against /lord/ and also prints $0 if that is true.

If your intent is to compare $2 case-insensitively, you can use

gawk 'BEGIN{IGNORECASE = 1} $2 ~ /lord/{print}` *.tsv

or just

gawk 'BEGIN{IGNORECASE = 1} $2 ~ /lord/` *.tsv

The equivalent with a variable would be

gawk -v Pattern="lord" 'BEGIN{IGNORECASE = 1} $2 ~ Pattern' *.tsv

Note that IGNORECASE is not a standard awk feature - as far as I know, only GNU awk (gawk) supports it - for portability you can use toupper or tolower to get the input into a specific case.

steeldriver
  • 81,074
  • Note though that the portable version will also be wrong in some (more unusual) cases, because comparing upper- or lower-cased text is not the same as comparing case-insensitively in Unicode, and under some locales. (It doesn't sound like gawk does exactly the right thing either, but it's more often right.) – Michael Homer Jun 15 '21 at 04:32
1

Regarding The following searches for lord (case insensitively) in 2nd column of all tsv files: awk '$2~IGNORECASE = 1;/lord/{print}' *.tsv - no, it doesn't do that at all. It does a regexp comparison for $2 against the result of assigning IGNORECASE to 1 which is always true and so it prints the current line. It then looks for any string matching the regexp lord anywhere on the line and it found prints the line a second time. You probably meant to do awk 'BEGIN{IGNORECASE = 1} $2~/lord/' *.tsv as that would do what you describe.

Don't use the word "pattern" in this context as it's highly ambiguous. You're using Pattern as a partial regexp match but describing it as if you want a full-word string match. So, please replace "pattern" with all 3 of string-or-regexp and partial-or-full and word-or-line everywhere it occurs in your question so we can help you come up with the right solution. See how-do-i-find-the-text-that-matches-a-pattern for more information.

Here are some possible solutions for what you may be trying to do:

Partial string match:

$ awk -v var="$var" -F'\t' 'index(tolower($2),tolower(var))' file.tsv
13  Lord of the Rings   Sound of The Shire  chLZQtCold8
14  Lord of the Rings   The Shire: Sunset at Bag End    uBmbI8dzc-M

Full-word string match:

$ awk -v var="$var" -F'\t' 'index(" "tolower($2)" ",tolower(var))' file.tsv
13  Lord of the Rings   Sound of The Shire  chLZQtCold8
14  Lord of the Rings   The Shire: Sunset at Bag End    uBmbI8dzc-M

Full-line string match:

$ awk -v var="$var" -F'\t' 'tolower($2) == tolower(var)' file.tsv
$

Partial regexp match:

$ awk -v var="$var" -F'\t' 'tolower($2) ~ tolower(var)' file.tsv
13  Lord of the Rings   Sound of The Shire  chLZQtCold8
14  Lord of the Rings   The Shire: Sunset at Bag End    uBmbI8dzc-M

Full-word regexp match:

$ awk -v var="$var" -F'\t' '(" "tolower($2)" ") ~ tolower(var)' file.tsv
13  Lord of the Rings   Sound of The Shire  chLZQtCold8
14  Lord of the Rings   The Shire: Sunset at Bag End    uBmbI8dzc-M

Full-line regexp match:

$ awk -v var="$var" -F'\t' 'tolower($2) ~ ("^"tolower(var)"$")' file.tsv
$

The above assumes your shell variable does not contain escape sequences or if it does you want them expanded. If that's not the case then use ENVIRON[] or ARGV[] to pass the value of the shell variable to awk instead of -v, see how-do-i-use-shell-variables-in-an-awk-script for details.

Ed Morton
  • 31,617
0

With perl:

Searching a pattern in second field of file:

perl -F"\t" -lane '$F[1] =~ /(?i)lord/ and print' input.tsv
  • -F"\t" is because file is tsv
  • $F[1] is second file of record because fields are zero-indexed.
  • (?i) is case insensitive option in regex
  • or modifier i may be used for case insensitivity as in
perl -F"\t" -lane '$F[1] =~ /lord/i and print' input.tsv

regex matching a shell variable may be done by export as in

export p=lord
perl -F"\t" -lane '$F[1] =~ /(?i)$ENV{p}/ and print' input.tsv
perl -F"\t" -lane '$F[1] =~ /$ENV{p}/i and print' input.tsv

Searching in all .tsv files of a folder:

perl -F"\t" -lane '$F[1] =~ /$ENV{p}/i and print' *.tsv

If you want filename with records, then following would do:

perl -F"\t" -lane '$F[1] =~ /$ENV{p}/i and print $ARGV. ":" .$_' *.tsv
0

If you don't have to use awk and can use a tool dedicated to processing tabular data, like GoCSV, this is a snap.

Starting with the sample of data you provided, I made up some names and took a guess at "Journey":

input.tsv

ID Album Track Hash
10 Interstellar Main Theme Extended UDVtMYqUAyw
11 Journey XvG78AmBLc4
12 Jurassic Park Music & Ambience Amazing Soundscapes and Music PPl__iyIg6w
13 Lord of the Rings Sound of The Shire chLZQtCold8
14 Lord of the Rings The Shire: Sunset at Bag End uBmbI8dzc-M
  1. set the shell variable pattern
  2. delim to convert the TSV to CSV
  3. filter on column 2 with the -i case-invariant --regex of that shell variable
  4. behead to get just the matching rows
  5. convert back to TSV:
pattern='lord'
gocsv delim -i "\t" input.tsv              \
| gocsv filter -c 2 -i --regex "$pattern"  \
| gocsv behead                             \
| gocsv tsv

13 Lord of the Rings Sound of The Shire chLZQtCold8 14 Lord of the Rings The Shire: Sunset at Bag End uBmbI8dzc-M

Zach Young
  • 220
  • 2
  • 5