I use pdftotext
utility to convert pdf file to text.
pdftotext *.pdf *.txt
It works, but it can not work properly with lines, it makes new lines when it was not. Is there any other utility do this job. If not will sed
help to kill new lines?
Depends on what you want.
If you want to retain the layout so that tables of contents are faithfully represented, for example, you have the -layout
flag. If you want a raw stream, you have the -raw
flag, though it doesn't work as efficiently as I suspect you want it to.
What I would suggest is first, convert it to a text file. Let us take an example test.pdf
.
Then,
$ pdftotext test.pdf test.txt
This creates a file called test.txt which contains the output from the pdftotext utility. Then, we put the newly created text file through a bit of perl code:
$ perl -0pe "s/([^\n])\n([^\n])/\1 \2/g;" test.txt > final.txt
And there you have it. The final.txt
file should have what you want. You can copy the perl code as is and change the file names to your liking. Hope that helps.
$ pdftotext test.pdf | sed -r ':a /[a-zA-Z]$/N;s/(.)\n/\1 /;ta' > test.txt
And if that doesn't work, try:
$ pdftotext test.pdf - | fmt > test.txt
– nahorov
Feb 21 '22 at 18:49
$ pdftotext 30545.pdf
and then $ cat 30545.txt | sed -r ':a /[a-zA-Z]$/N;s/(.)\n/\1 /;ta' > test.txt
?
Or perhaps combine the pdftotext command with one of the flags we talked about earlier?
– nahorov
Feb 23 '22 at 13:46
The Calibre e-book Converter does what you want. It has a graphical user interface (GUI), and a command line which works with:
ebook-convert myfile.input_format myfile.output_format --enable-heuristics [other-options]
It can convert pdf to epub or raw .txt format while guessing the original paragraph structure. There are many options that help fine-tune the process, see: https://manual.calibre-ebook.com/generated/en/ebook-convert.html#heuristic-processing
The "Remove unnecessary hyphens" function is activated with `--enable-heuristics; analysis of hyphenated words is made based on a dictionary which is the text itself (if it finds the word "document" somewhere, it knows that "docu-ment" hyphenated at the margin should be de-hyphenated).
There is also --unsmarten-punctuation
, which converts fancy quotes, dashes and ellipsis to their plain equivalents (nameyl "'-...
).
There is also the --html-unwrap-factor parameter
, described as: "Scale used to determine the length at which a line should be unwrapped. Valid values are a decimal between 0 and 1. The default is 0.4
, just below the median line length. If only a few lines in the document require unwrapping this value should be reduced".
For my test document, the default worked fine; still results were even better with lower values:
ebook-convert mydoc.pdf mydoc.txt --enable-heuristics --html-unwrap-factor 0.2
The result is good enough for further processing (e.g. text alignment of pairs of document to create translation memory .tmx files).
On Linux, to process all .pdf files in (sub-)folders, do:
find . -name "*.pdf" | while IFS= read -r file; do if [ ! -e "${file}.txt" ]; then ebook-convert "$file" "${file}.txt" --enable-heuristics --html-unwrap-factor 0.2 ; fi; done
(Note: the filenames must not start with a hyphen.)
The GUI manual is at: https://manual.calibre-ebook.com/conversion.html#heuristic-processing
Note that the result is awful for sentences in (multi-column-) tables, where tools like Tabula (https://tabula.technology/) will help.
Below a screenshot of an example use (here, the output as .epub makes clearer that paragraphs are correctly detected).
ebook-covert
works great and produces a much better txt! I also gave a minimal conversion example at: https://superuser.com/questions/207603/how-to-extract-text-from-pdf-in-script-on-linux/1810994#1810994
– Ciro Santilli OurBigBook.com
Oct 03 '23 at 06:57
pdftotext
makes a best-effort attempt to output the text layer, but it has no way of knowing where paragraphs begin or end because that information is NOT in the PDF. I've been usingpdftotext
for a very long time now (years, probably decades) and I have just come to accept that I'll have to do a LOT of manual text editing withvi
to clean it up. – cas Feb 22 '22 at 02:00fmt
,par
, or a script written in perl or awk or whatever to reformat the paragraphs. I personally do a lot of manual editing with vi and automate a lot of common stuff with perl scripts I wrote for particular "text cleaning" jobs (like dealing with multiple columns frompdftotext -layout
, nuking abominations like "smart" quotes, fixing weird character set issues, reformatting tables, etc). It's non-trivial, and will likely never be possible to 100% automate. – cas Feb 22 '22 at 02:05