I want to create a custom list of (scientific) words for purposes like spell checking and OCR based on my collection of scientific papers in pdf format. Using pdftotext
I can easily create a text file which contains the wanted words for my scientific field. However the file will be polluted with
- words which are not specific for science (and which would also be contained in a common dictionary)
- words which result from improper conversion of e.g. formulas (including words which include special characters etc.)
I want to get rid of the later by requiring that individual words have a minimum length, contain no special characters and appear several times in the list. Secondly I want to get rid of the former by comparing with a second word list. My questions:
Does this sound like a good plan to you? Are there existing tools for this task? How would you do it?
pdftotext -raw
make any difference? You can always useperl -pe
to insert whatever perl processing you want to do. If you give some sample of the problematic output in your question, maybe we can assist. – Stéphane Chazelas May 19 '13 at 12:47perl -lne '$_=~s/^[^\s]+\s+//; $_=~s/\s+[^\s]+$//; $_=~s/\s{2,}/ /g; if(length($_) > 40){ print $_; }' |
after the initialpdftotext
which seems to clean many corrupted words. For instance I found that words with umlauts such as 'Jürgen' are sometimes broken in the pdftotext output as 'J"\nurgen'. 'urgen' is then added to the word list if not I filter it by removing first and last words as with this perl line. The man page ofpdftotext
recommends not to use the-raw
switch. – highsciguy May 19 '13 at 14:18