-1

I want to find the number of the lines that have both words/patterns "gene" and "+" in them. Is this possible to do this with grep?

DopeGhoti
  • 76,081
Parnian
  • 103
  • 5
    Will the word gene always occur before + on the lines that you are interested in? Would the basic regular expression gene.*+ be enough? Do you need to filter out lines that contain words like genes or thegene (i.e. where gene is just a substring and not its own word)? Can you show some example data? – Kusalananda Oct 15 '20 at 14:41
  • 1
  • You can just look for the first word and forward that list to another grep with second word: grep gene | grep +. That is a kind of and operator. You also need to consider all the question Kusalananda is asking. – nobody Oct 15 '20 at 15:20
  • @glennjackman I think the goal is to get number of lines and not line numbers. – nobody Oct 15 '20 at 15:26
  • I read the question more carefully after I commented: I agree. – glenn jackman Oct 15 '20 at 15:26
  • As Glen Jackman pointed out, the question is about the number of lines. So wc should be used at the end to count the lines grep gene | grep + | wc -l. – nobody Oct 15 '20 at 15:28
  • @nobody or just make the 2nd grep grep -c + to count matching lines – steeldriver Oct 15 '20 at 15:36
  • @nobody there's no need for wc, you can use grep -c. – terdon Oct 15 '20 at 15:48
  • @Kusalananda the word gene always appears as "gene" and is its own word and it always comes before '+'. – Parnian Oct 15 '20 at 15:48
  • Parnian, I gave you an answer assuming gff/gtf files. You might also be interested in our sister site, [bioinformatics.se]. – terdon Oct 15 '20 at 15:49

1 Answers1

1

Yes, you can do this with grep:

grep -c 'gene.*+' file

That will look for lines where the word gene appears first and as a separate word (the \b means "word-break") and then, on the same line, you also have + as a separate word. The -c flag tells grep to print the number of matching lines. If you also need to find cases where the + comes before gene, you can do:

grep -Ec '(gene.*\+)|(\+.*gene)' file

This, however, will also match things like Eugene+Mary came for dinner which is probably not what you want. Given the words you are looking for, I am guessing that you are looking at gff/gtf files, so you might want to do something more sophisticated and only look for gene in the third field of each line and + in the seventh, on lines that don't start with a # (the gff headers). If this is indeed what you need, you can do:

awk -F"\t" '!/^#/ && $3=="gene" && $7=="+"{c++}END{print c}'
terdon
  • 242,166