2

I want to create a custom list of (scientific) words for purposes like spell checking and OCR based on my collection of scientific papers in pdf format. Using pdftotext I can easily create a text file which contains the wanted words for my scientific field. However the file will be polluted with

  • words which are not specific for science (and which would also be contained in a common dictionary)
  • words which result from improper conversion of e.g. formulas (including words which include special characters etc.)

I want to get rid of the later by requiring that individual words have a minimum length, contain no special characters and appear several times in the list. Secondly I want to get rid of the former by comparing with a second word list. My questions:

Does this sound like a good plan to you? Are there existing tools for this task? How would you do it?

highsciguy
  • 2,574

2 Answers2

3

To select words of at least 4 characters found at least 5 times and not found in /usr/share/dict/words in the PDF files in the current directory.

 find . -name '*.pdf' -exec pdftotext {} - \; |
   tr -cs '[:alpha:]' '[\n*]' |
   tr '[:upper:]' '[:lower:]' |
   grep -E '.{4}' |
   sort |
   uniq -c |
   awk '$1 > 4 {print $2}' |
   comm -23 - <(tr '[:upper:]' '[:lower:]' < /usr/share/dict/words|sort -u)

You need a shell with support for process substitution (ksh, zsh or bash).

If you're going to use perl anyway, you can also do the whole thing in perl:

find . -name '*.pdf' -exec pdftotext {} - \; |
  perl '-Mopen ":locale"' -nle '
     s/^\S+//;s/\S+$//;y/ \t/ /s;
     next unless length > 40;
     $w{lc$_}++ for /[[:alpha:]]{4,}/g;
     END{open W,"</usr/share/dict/words";
     while(<W>){chomp;delete $w{lc$_}};
     print for grep {$w{$_}>4} keys %w}'
  • I made good experience with this one. However I find some words in the result which are incomplete or drawn together. I think that some of this polution may come from words in the beginning or end of the lines, since they are often broken in the pdf files. I know how I would remove them in perl, but how would I do it in awk, which you use anyway? – highsciguy May 19 '13 at 12:34
  • @highsciguy, does pdftotext -raw make any difference? You can always use perl -pe to insert whatever perl processing you want to do. If you give some sample of the problematic output in your question, maybe we can assist. – Stéphane Chazelas May 19 '13 at 12:47
  • I added a line perl -lne '$_=~s/^[^\s]+\s+//; $_=~s/\s+[^\s]+$//; $_=~s/\s{2,}/ /g; if(length($_) > 40){ print $_; }' | after the initial pdftotext which seems to clean many corrupted words. For instance I found that words with umlauts such as 'Jürgen' are sometimes broken in the pdftotext output as 'J"\nurgen'. 'urgen' is then added to the word list if not I filter it by removing first and last words as with this perl line. The man page of pdftotext recommends not to use the -raw switch. – highsciguy May 19 '13 at 14:18
1

Sounds like a very typical plan. I would use shell scripts to do this. You're not dealing with outrageously large quantities of text, so performance should be adequate, and shell scripts are easy to write and re-run. My first cut would be a script like this:

pdf2text files |
tr -cs '[A-Za-z]' '\n' |  
tr '[A-Z]' '[a-z]' |
awk '{ if (length > 6) {print $1;}}' |
fgrep -v -f /usr/share/groff/current/eign |
sort | 
uniq -c |
awk '{print $2, $1}' |
sort -nr +1 -2 |
head -20

That will get you the 20 most frequent words lf length greater than 6.

You can add steps, take out steps, adjust parameters to see what you get.

The fgrep step is the only odd one, and requires that GNU troff be installed. The file /usr/share/groff/current/eign is something like the 100 highest frequency words in english. The "-v" flag only passes words that don't appear in the "eign" file, so it's using "eign" as a stop list. If you don't like what GNU troff as as common words, you can make your own and use that file in the fgrep step.

  • 1
    Amazing how close our solutions are (I swear I didn't read yours before coming up with mine). A few issues with yours: that's the non-standard SysV tr syntax so in most implementations that would include the [ and ] characters. With fgrep, you're matching substrings (also not that comm on sorted files is going to be a lot more efficient). That's also the old non-standard syntax for sort and head – Stéphane Chazelas May 18 '13 at 21:01
  • Thank you for pointing out the non-standard flags. As something of a dinosaur, I feel more familiar with them. –  May 18 '13 at 21:13