11

I need to check if two (specified) words exist on any line in a text file. There are no limits for the characters of the words. For example:

I want to find lines of a text file that contain the two words “cat” and “elephant” together (i.e., on the same line; not necessarily side-by-side):

Cat is smaller than elephant
Elephant is larger than cat
Cats are cute!
Elephants are very strong
Cat and elephants live in different environments
cats are friendly

In the previous examples, how can I find the lines containing both words?

Cat is smaller than elephant
Elephant is larger than cat
Cat and elephants live in different environments

I tried grep and awk with no hope. The problem is there are words that have upper and lower case, so how can I match for both words regardless of their letter status!?

Kusalananda
  • 333,661

4 Answers4

8

With grep

grep -i "cat" file | grep -i "elephant"

Cat is smaller than elephant
Elephant is larger than cat
Cat and elephants live in different environment

The flag in grep is to ignore case (upper/lower)

 -i, --ignore-case         ignore case distinctions

or awk

awk 'BEGIN{IGNORECASE=1} /cat/&&/elephant/{print $0}' file

@glenn jackman suggested that awk statement can be run as follows:

awk '/cat/&&/elephant/' IGNORECASE=1 file
  • The {print $0} block is optional since it is the default action. – glenn jackman Oct 17 '18 at 01:54
  • I would leave your answer as it is. FYI the awk command can be "golfed" to awk '/cat/&&/elephant/' IGNORECASE=1 file -- also I believe IGNORECASE is specific to GNU awk – glenn jackman Oct 17 '18 at 01:58
  • 1
    @glenn: IGNORECASE is indeed GNU; tolower($0)~/cat/ (or similarly with toupper) is standard, but may give undesired results in some (non-English) locales with accented letters and especially Turkish with its dotted and dotless i's. – dave_thompson_085 Oct 17 '18 at 04:49
6
$ grep -Fiw cat <file | grep -Fiw elephant
Cat is smaller than elephant
Elephant is larger than cat

We first extract all lines from the file file that contains the word cat and then narrow down those lines to the ones that contains the word elephant.

This is done using grep -F -i -w where

  • -F makes grep treat the pattern as a fixed string, not as a regular expression,
  • -i makes grep do case-insensitive matching, and
  • -w makes grep match complete words only.

The -w option is an extension of the POSIX standard for grep, but is implemented by most common grep implementations. It basically disallows matches of the given patten when the matching string is part of a longer word.

Note that I'm not matching the line

Cat and elephants live in different environment

This is due to the final s in elephants. I would also not match the line

elephantiasis is catastrophic

for the same reason.

Would you want to allow for a plural s at the end of words, use

$ grep -Eiw 'cats?' <file | grep -Eiw 'elephants?'
Cat is smaller than elephant
Elephant is larger than cat
Cat and elephants live in different environment

Here, we use an (extended) regular expression instead of a fixed string in both invocations of grep. The expressions will match an optional s at the end of the two words. Now we match cat and cats (case-insensitively), but would not match catnip, catsup, or scat.

Kusalananda
  • 333,661
3

with GNU sed:

sed -n '/cat/I {/elephant/I p}' file

or perl

perl -ne 'print if /cat/i and /elephant/i' file

or a single grep

grep -i -e 'cat.*elephant' -e 'elephant.*cat' file
glenn jackman
  • 85,964
3

You can do it in non-GNU awk by using the “poor man’s” trick to get case insensitivity:

awk  '/[Cc][Aa][Tt]/ && /[Ee][Ll][Ee][Pp][Hh][Aa][Nn][Tt]/'  file
where, just as [aeiou] matches any one of a, e, i, o or u[Ee] matches either E or e — that is, a case-insensitive match for “e”.

Note that this approach (like all the other answers posted here so far) will match the line

There are many ways to catch an elephant.
because the word “catch” contains the string “cat”.  If you want to avoid this, try
awk  '/(^|\W)[Cc][Aa][Tt](\W|$)/ && /(^|\W)[Ee][Ll][Ee][Pp][Hh][Aa][Nn][Tt](\W|$)/'  file
where you constrain each word to be preceded by a non-word character (or the beginning of the line) and followed by a non-word character (or the end of the line) — \W matches a non-word character (i.e., a space (or tab) or other non-alphanumeric * character).

(I’m not sure whether this is POSIX-compliant.)

Note that this will now not match

Cat and elephants live in different environments
because the word “elephants” is not the same as the word “elephant”.
__________________________
* In this context, underscore (the “_” character) counts as a letter.