How to search for lines that fulfill more than one criterion in Unix?

Question

I want to find the number of the lines that have both words/patterns "gene" and "+" in them. Is this possible to do this with grep?

Will the word gene always occur before + on the lines that you are interested in? Would the basic regular expression gene.*+ be enough? Do you need to filter out lines that contain words like genes or thegene (i.e. where gene is just a substring and not its own word)? Can you show some example data? — Kusalananda, Oct 15 '20 at 14:41
You can just look for the first word and forward that list to another grep with second word: grep gene | grep +. That is a kind of and operator. You also need to consider all the question Kusalananda is asking. — nobody, Oct 15 '20 at 15:20
@glennjackman I think the goal is to get number of lines and not line numbers. — nobody, Oct 15 '20 at 15:26
I read the question more carefully after I commented: I agree. — glenn jackman, Oct 15 '20 at 15:26
As Glen Jackman pointed out, the question is about the number of lines. So wc should be used at the end to count the lines grep gene | grep + | wc -l. — nobody, Oct 15 '20 at 15:28
@nobody or just make the 2nd grep grep -c + to count matching lines — steeldriver, Oct 15 '20 at 15:36
@Kusalananda the word gene always appears as "gene" and is its own word and it always comes before '+'. — Parnian, Oct 15 '20 at 15:48
Parnian, I gave you an answer assuming gff/gtf files. You might also be interested in our sister site, [bioinformatics.se]. — terdon, Oct 15 '20 at 15:49

score 1 · Accepted Answer · answered Oct 15 '20 at 15:48

Yes, you can do this with grep:

grep -c 'gene.*+' file

That will look for lines where the word gene appears first and as a separate word (the \b means "word-break") and then, on the same line, you also have + as a separate word. The -c flag tells grep to print the number of matching lines. If you also need to find cases where the + comes before gene, you can do:

grep -Ec '(gene.*\+)|(\+.*gene)' file

This, however, will also match things like Eugene+Mary came for dinner which is probably not what you want. Given the words you are looking for, I am guessing that you are looking at gff/gtf files, so you might want to do something more sophisticated and only look for gene in the third field of each line and + in the seventh, on lines that don't start with a # (the gff headers). If this is indeed what you need, you can do:

awk -F"\t" '!/^#/ && $3=="gene" && $7=="+"{c++}END{print c}'

For the Eugene case and grep, we can use word boundary markers: grep -Ec '(\<gene\>.*\+)|(\+.*\<gene\>)' file — glenn jackman, Oct 15 '20 at 17:55

How to search for lines that fulfill more than one criterion in Unix?

1 Answers1