Make grep work for special filenames

Question

I have a set of txt files whose names may contain space or special characters like #.

I have a grep solution grep -L "cannot have" $(grep -l "must have" *.txt) to list all the files who have must have but not cannot have.

For instance, there is a file abc defg.txt which contains only 1 line: must have.

So normally the grep solution should find out abc defg.txt, but it returns:

grep: abc: No such file or directory
grep: defg.txt: No such file or directory

I think for filenames containing #, the grep solution is also invalid.

Could anyone help me amend the grep solution?

# would not be a problem (except with zsh -o globsubst -o extendedglob), but *, ?, [, space, tab, newline would be with bash. — Stéphane Chazelas, May 08 '14 at 08:46
find . -type f -name \*.txt -exec grep -qF 'must have' {} \; ! -exec grep -qF 'cannot have' {} \; -print — don_crissti, Aug 19 '16 at 11:21

Stéphane Chazelas · Answer 1 · 2016-08-19T06:31:31.853

Since you're already using GNU specific options (-L), you could do:

grep -lZ -- "must have" *.txt | xargs -r0 grep -L -- "cannot have"

The idea being to use -Z to print the list of file names NUL-delimited and use xargs -r0 to pass that list as arguments to the second grep.

Command substitution, by default, splits on space, tab and newline (and NUL in zsh). Bourne-like shells other than zsh also perform globbing upon each word resulting of that splitting.

You could do:

IFS='
' # split on newline only
set -f # disable globbing
grep -L -- "cannot have" $(
    set +f # we need globbing for *.txt in this subshell though
    grep -l -- "must have" *.txt
  )

But that would still break on filenames containing newline characters.

In zsh (and zsh only), you can do:

IFS=$'\0'
grep -L -- "cannot have" $(grep -lZ -- "must have" *.txt)

Or:

grep -L -- "cannot have" ${(ps:\0:)"$(grep -lZ -- "must have" *.txt)"}

What's the point in specifying *.txt as the argument after disabling globbing? — devnull, May 08 '14 at 08:49

score 2 · Answer 2 · answered May 08 '14 at 10:48

2

IF you're willing to go further afield, awk can do it in one pass:

awk 'function s(){if(a&&!b){print f}} FNR==1{s();f=FILENAME;a=b=0} 
  /must have/{a=1} /cannot have/{b=1} END{s()}' filepattern

For recentish gawk you can simplify with BEGINFILE and ENDFILE. (Like all awk answers you can put the awk commands in a file with -f, and like most you can easily convert to perl if you prefer.)

answered May 08 '14 at 10:48

dave_thompson_085

4,062

1

Note however that grep -l/L stops reading at the first match so is likely to be more efficient (also because of the general awk code interpretation overhead). With GNU awk, you could use nextfile to avoid reading the whole file when it can be avoided (when cannot have is found). – Stéphane Chazelas May 08 '14 at 11:24

score -1 · Answer 3 · answered May 08 '14 at 08:20

-1

Consider using find instead and grep using a shell command:

find . -name '*.txt' -print0 | xargs -0 -I{} sh -c 'grep -q "must have" -- "{}" && grep -L "cannot have" -- "{}"'

answered May 08 '14 at 08:20

devnull

10,691

1

never embed {} in the shell code! – Stéphane Chazelas May 08 '14 at 08:31
@StéphaneChazelas I trust that is a meaningful rule, but could you give a hint to find an explanation? – Volker Siegel Aug 18 '16 at 16:53
@VolkerSiegel, that's a classic code injection vulnerability, it's on the level of passing unsanitized data to eval, as here, you're passing arbitrary file names as code to sh (think of a $(reboot).txt lurking in the directory for instance). You'll find it discussed in many Q&As here. See for instance Do I need to encapsulate awk variables in quotes in order to sanitize them? – Stéphane Chazelas Aug 19 '16 at 06:30
@StéphaneChazelas Ah, makes sense - I was just thinking about something directl related to xargs, because xargs option -i without argument defaults to -I{} ( the {} will not be part of the shell code) – Volker Siegel Aug 19 '16 at 12:41

Make grep work for special filenames

3 Answers3