0

I'm trying to match filenames that contain two text patterns but the matching process should ignore case. Neither of the following regular expressions work:

Setting the awk variable 'IGNORECASE' to a nonzero value (as recommended in info awk) so that all regular expression and string operations ignore case, and then building a logical "and" operation using two regular expressions prints all files:

$ ls -R | awk 'IGNORECASE = 1;/bingo/ && /number/;'


I tried converting the data to lowercase before using lookaheads (I know the second lookahead is not needed) to match both the text patterns "bingo" and "number". However awk does not print any output which it should by default 1, 2
$ ls -R | awk 'tolower($0) ~ /(?=.*bingo)(?=.*number)/'

Which part of the awk or regular expression syntax is wrong (or what is missing) and what is the correct way to do a case-independent search that is only successful when the additional pattern appears on the same line?

Update:

from running

$ ls -R | awk '/bingo/'

it seems that awk may be performing the match against the lines in each file in the output of ls -R due to filenames not containing the string constant "bingo" being matched by awk. If this is the case, how do you get awk to have the same behavior as grep when receiving output from (i.e. sent through) a pipe?

bit
  • 1,106
  • 4
    IGNORECASE is only supported by GNU awk (gawk) -- so if you have that, use gawk instead of awk. Lookaheads/behinds are not supported by any awk implementation. –  Aug 04 '19 at 18:22
  • 1
  • 2
    If you're trying to recursively match filenames then perhaps find . -iname '*bingo*' -iname '*number*' would be more suitable? – steeldriver Aug 04 '19 at 18:25
  • @mosvy I was focused on getting the regular expression to work that I didn't consider that awk does not support lookaheads or lookbehinds, thanks. I tried replacing awk with gawk in the first code example but gawk still lists all files – bit Aug 04 '19 at 19:39
  • @steeldriver good suggestion, does that perform an implicit "and" operation? Is it possible that awk is searching each file for the text pattern instead of searching the output from ls like grep does? – bit Aug 04 '19 at 19:43
  • Of course it will, because gawk will treat IGNORECASE=1 as a pattern, which will be always true (same as a simple 1;). Use gawk -vIGNORECASE=1 '/bingo/ && /number/' or put the IGNORECASE=1 assignment in a BEGIN block. –  Aug 04 '19 at 19:54
  • @mosvy ls -R | awk 'BEGIN {IGNORECASE=1} /bingo/ && /number/' and ls -R | awk -vIGNORECASE=1 /bingo/ && /number/' both work, so it seems that regular awk also supports IGNORECASE. Do you know why your version i.e. ls -R | gawk -vIGNORECASE=1 /foo/ || /bar/' also prints filenames that do not contain foo or bar, e.g. baz.txt is also printed. May be you can turn your comment into question. – bit Aug 04 '19 at 20:05
  • No, regular awk doesn't support it. It's probably that the awk on your system is actually gawk -- awk --version will tell you if that's the case. That cannot be assumed even on a per-distro basis -- on debian the user can change it with update-alternatives --set awk /usr/bin/gawk. –  Aug 04 '19 at 20:12
  • @MyWrathAcademia (regarding find) yes it does: from man find: Where an operator is missing, -a is assumed. (-a being logical AND) – steeldriver Aug 04 '19 at 21:12
  • The || instead of && was an error in a previous version of the comment, but even with ||, touch baz.txt; ls -R | gawk -vIGNORECASE=1 '/foo/ || /bar/' will not print the baz.txt file. –  Aug 04 '19 at 22:54

2 Answers2

5

wrt your first script:

awk 'IGNORECASE = 1;/bingo/ && /number/;'
  1. IGNORECASE is gawk-only as pointed out in the comments, and
  2. your awk code is equivalent to:

    awk '(IGNORECASE = 1){print}; (/bingo/ && /number/){print}'
    

so it will do a case-insensitive match in GNU awk but not in other awks, and it will always print the current line (since the assignment IGNORECASE=1 evaluates to 1 which is a true condition) and then any line containing both bingo and number will be printed a second time.

wrt your second script:

awk 'tolower($0) ~ /(?=.*bingo)(?=.*number)/'

That ?= stuff is PCRE lookarounds - awk supports EREs, not PCREs, so I'd have to think about what it'd really mean in an ERE but whatever it is, it's not what you wanted it to mean.

wrt your statement that:

It seems that awk may be performing the match against the lines in each file in the output of ls -R

I don't know why you think that but no, it's not.

Here's what I think you want in GNU awk:

awk 'BEGIN{IGNORECASE=1}; /bingo/ && /number/'

Or:

awk -v IGNORECASE=1 '/bingo/ && /number/'

and in any awk:

awk '{lc=tolower($0)}; (lc ~ /bingo/) && (lc ~ /number/)'
Ed Morton
  • 31,617
2

If you want to find names in the current directory or below that contains the strings bingo and number in any case, you should not pass the output of ls -R through awk but instead use find:

find . -iname '*bingo*' -iname '*number*'

The -iname predicate is non-standard but commonly implemented and will match the filename currently being examined against the given globbing pattern case-insensitively.

If you would like to get the filename only, and not the complete pathname to the found files, then use

find . -iname '*bingo*' -iname '*number*' -exec basename {} \;

With GNU find, you could use

find . -iname '*bingo*' -iname '*number*' -printf '%f\n'

which will be quicker than using basename.

If you know that the order of the two words is "bingo followed by number", then use -iname '*bingo*number*' with find instead of two -iname tests.

If you know that this is the order of the words that you'd like to find, you may also use bash directly:

shopt -s globstar      # enable ** to match across / in pathnames
shopt -s nocaseglob    # enable case-insensitive globbing
shopt -s failglob      # error when a pattern does not match anything

printf '%s\n' **/*bingo*number*

To get the filename portion of the pathnames:

shopt -s globstar nocaseglob failglob

for name in **/*bingo*number*; do
    basename -- "$name"
done

or, if you have GNU basename and don't expect to ever match thousands of files,

shopt -s globstar nocaseglob failglob

basename -a -- **/*bingo*number*

where -a tells the utility to show the filename portion of each argument (multiple arguments).

Stéphane points out in comments that to ignore the order of the two substrings in e.g. bash, you could use the extended globbing pattern

!(!(*bingo*)|!(*number*))

This works by matching every name except names that do not contain either of the two strings. So you'll get

shopt -s globstar nocaseglob failglob
shopt -s extglob  # for extended globbing patterns in bash

for name in **/!(!(*bingo*)|!(*number*)); do
    basename -- "$name"
done

Related:

Kusalananda
  • 333,661