20

Is there a command-line tool to text-search a docx file? I tried grep, but it doesn't work with docx even though it works fine with txt and xml files. I could convert the docx to txt first, but I'd prefer a tool that operates directly on docx files. I need the tool to work under Cygwin.

OP edit: Later I found out that the easiest way to achieve the grep is actually to convert those docx to txt then grep over them.

RoundPi
  • 301

5 Answers5

7

My grep solution as a function you can paste in your .bashrc

docx_search(){ local arg wordfile terms=() root=${root:-/}; for arg; do terms+=(-e "$arg"); done; find 2>/dev/null "${root%/}/" -iname '*.docx' -exec bash -c "$(declare -p terms)"'; for arg; do unzip -p "$arg" 2>/dev/null | grep --quiet --ignore-case --fixed-strings "${terms[@]}" && printf %s\\n "$arg"; done' _ {} +; }

It will look for any (case insensitive) occurence of its arguments and print the matching docx file location.


Examples:

$ docx_search 'my example sentence'
/cygdrive/d/example sentences.docx
/cygdrive/c/Users/my user/Documents/example sentences.docx
$ root='/cygdrive/c/Users/my user/' docx_search 'seldom' 'full sentence'
/cygdrive/c/Users/my user/Documents/example sentences.docx
$ 

Readable version:

docx_search(){
  local arg wordfile terms=() root=${root:-/}
  # this 'root' assignment allows you to search in a specific location like /cygdrive/c/ instead of everywhere on the machine
  for arg; do terms+=(-e "$arg"); done
  # We inject the terms to search inside the string with declare -p`
  find 2>/dev/null "${root%/}/" -iname '*.docx' -exec \
    bash -c "$(declare -p terms)"';
      for arg; do
        unzip -p "$arg" 2>/dev/null |
          grep --quiet --ignore-case --fixed-strings "${terms[@]}" &&
          printf %s\\n "$arg"
      done' _ {} +
}
4

I know of several indexing tools that support Word documents. Such tools allow you to index documents, then efficiently search words in the index. They don't permit full text searches.

2

DOCx is compressed and it is not a text format. So what you need is a converter first. After that you can use the find command on the converted file(s).

Nils
  • 18,492
  • Or you can use a search tool that can read inside compressed files. In your last sentence, I suppose you meant grep? – Gilles 'SO- stop being evil' Jan 06 '12 at 23:32
  • @Gilles - look at the original title of the question before Michael edited it. This seemed to be a question about DOS (and I flagged it off-topic). – Nils Jan 07 '12 at 20:14
1

Here's an updated version optimized for performance.

It requires ripgrep and fd-find. Here's how to install them if you do not have them.

fd-find:

sudo apt install fd-find

ripgrep:

curl -LO https://github.com/BurntSushi/ripgrep/releases/download/13.0.0/ripgrep_13.0.0_amd64.deb
sudo apt install ./ripgrep_13.0.0_amd64.deb

Paste this in your .bashrc:


docxgrep() {
keyword="$1"

/usr/bin/fdfind -t f -e docx . | while read -r arg; do
    if unzip -p "$arg" 2>/dev/null | rg -q  --ignore-case --fixed-strings "$keyword"; then
        echo "$arg"
    fi
done

}

Run source ~/.bashrc Now we can search:

$ docxgrep 'hello'        
./Document.docx
Cyrill
  • 11
0

Have you looked at openoffice ninja?
(don't know about cygwin support)

bsd
  • 11,036