2

I use following command to search for string ELF in plain text files under the current directory recursively:

grep ELF -r .

but it also searches in binary files (e.g. zip file, PDF file), and in code files such as HTML file and .js.

How can I specify it to search only in plain text files that are not source code?

Tim
  • 101,790
  • yes, it outputs Binary file matches. which I don't even want. For matches in js and html files, the outputs are too long because their contents are long, making matches in plain text files difficult to see. – Tim Mar 15 '15 at 01:04
  • Thanks for the suggestions. How can I limit my query to only files within a certain length, and how can I do an initial query which displays only a limited number of matches per file? – Tim Mar 15 '15 at 01:11
  • HTML and JS files are text files. If you only want to search in a subset of text files, tell us which subset. – Gilles 'SO- stop being evil' Mar 15 '15 at 21:12
  • 1
    @Gilles: only plain text files which are not source code. – Tim Mar 15 '15 at 21:16

2 Answers2

5

With GNU grep, pass --binary-files=without-match to ignore binary files. Source code files are text files, so they will be included in the results.

If you want to ignore text files with certain extensions, you can use the --exclude option, e.g.

grep -r --exclude='*.html' --exclude='*.js' …

or you can instead include only explicitly-matching files, e.g.

grep -r --include='*.txt' …

If you want to ignore text files that are source code, you can use the file command to guess which files are source code. This uses heuristics so it may detect source code as non-source-code or vice versa.

find -type f exec sh -c '
  for x do
    case $(file <"$x") in
      *source*) :;; # looks like source code
      *text*) grep -H -e "$0" "$x";; # looks like text
      # else: looks like binary
    esac
  done
' "REGEXP" {} +

or

find -type f exec sh -c '
  for x do
    case $(file -i <"$x") in
      text/plain\;*) grep -H -e "$0" "$x";; # looks like text
      # else: looks like source code or binary
    esac
  done
' "REGEXP" {} +

Alternatively, you may use ack instead of grep. Ack integrates a file classification system based on file names. It's geared towards searching in source code by default, but you can tell it to search different types by passing the --type option. Search ALL files with ack may help.

1

If you want to restrict only by file extension you can use grep --include option:

grep -R --include="*.txt" "pattern" /path/to/dir/

Another approach is to eliminate files which are not text but it will include html and js files, which after update are excluded with option --exclude for example:

find /path/to/dir -type f -print | xargs file | grep text | cut -f1 -d: | xargs grep --exclude=\*.{js,html} "pattern"

Where as mentioned in comment also you may use --exclude-from=FILE option.

taliezin
  • 9,275