1

I use pdftotext utility to convert pdf file to text.

pdftotext *.pdf *.txt

It works, but it can not work properly with lines, it makes new lines when it was not. Is there any other utility do this job. If not will sed help to kill new lines?

  • 1
    PDF text layers don't really work in either paragraphs OR lines. Each text string (and often the individual letters) are precisely placed with postscript instructions so that the text layer roughly aligns with the image on that page. pdftotext makes a best-effort attempt to output the text layer, but it has no way of knowing where paragraphs begin or end because that information is NOT in the PDF. I've been using pdftotext for a very long time now (years, probably decades) and I have just come to accept that I'll have to do a LOT of manual text editing with vi to clean it up. – cas Feb 22 '22 at 02:00
  • e.g. edit the text file created by pdftotext and add extra newlines between each paragraph. Then you could use fmt, par, or a script written in perl or awk or whatever to reformat the paragraphs. I personally do a lot of manual editing with vi and automate a lot of common stuff with perl scripts I wrote for particular "text cleaning" jobs (like dealing with multiple columns from pdftotext -layout, nuking abominations like "smart" quotes, fixing weird character set issues, reformatting tables, etc). It's non-trivial, and will likely never be possible to 100% automate. – cas Feb 22 '22 at 02:05
  • Thanks for clarifying that – Untitled Feb 22 '22 at 16:11

2 Answers2

2

Depends on what you want.
If you want to retain the layout so that tables of contents are faithfully represented, for example, you have the -layout flag. If you want a raw stream, you have the -raw flag, though it doesn't work as efficiently as I suspect you want it to. What I would suggest is first, convert it to a text file. Let us take an example test.pdf. Then,

$ pdftotext test.pdf test.txt

This creates a file called test.txt which contains the output from the pdftotext utility. Then, we put the newly created text file through a bit of perl code:

$ perl -0pe "s/([^\n])\n([^\n])/\1 \2/g;" test.txt > final.txt

And there you have it. The final.txt file should have what you want. You can copy the perl code as is and change the file names to your liking. Hope that helps.

nahorov
  • 31
  • Thanks for help. But this perl code killed all new lines, even those ones which were in the original pdf file. I want to be more precise pdftotext makes a new line on the end of every line. So if paragraph consists of five lines, it makes 5 new lines – Untitled Feb 21 '22 at 15:38
  • Then try the -layout flag? $ pdftotext -layout test.pdf test.txt – nahorov Feb 21 '22 at 15:54
  • It doesn't work. Structure of the text is the same – Untitled Feb 21 '22 at 17:14
  • Ehhh, I wish I had access to the PDF you wanted to convert, I could then get a better idea of what you wanted. Perhaps we might need a better designed sed program. Try this: $ pdftotext test.pdf | sed -r ':a /[a-zA-Z]$/N;s/(.)\n/\1 /;ta' > test.txt And if that doesn't work, try: $ pdftotext test.pdf - | fmt > test.txt – nahorov Feb 21 '22 at 18:49
  • Thanks. First script returns empty file, the second one makes it even worse:((. Here is the link to PDF file http://kompek.rada.gov.ua/uploads/documents/30545.pdf – Untitled Feb 22 '22 at 08:45
  • The file works on my computer as well, maybe do it this way? $ pdftotext 30545.pdf and then $ cat 30545.txt | sed -r ':a /[a-zA-Z]$/N;s/(.)\n/\1 /;ta' > test.txt ? Or perhaps combine the pdftotext command with one of the flags we talked about earlier? – nahorov Feb 23 '22 at 13:46
1

The Calibre e-book Converter does what you want. It has a graphical user interface (GUI), and a command line which works with:

ebook-convert myfile.input_format myfile.output_format --enable-heuristics [other-options]

It can convert pdf to epub or raw .txt format while guessing the original paragraph structure. There are many options that help fine-tune the process, see: https://manual.calibre-ebook.com/generated/en/ebook-convert.html#heuristic-processing

The "Remove unnecessary hyphens" function is activated with `--enable-heuristics; analysis of hyphenated words is made based on a dictionary which is the text itself (if it finds the word "document" somewhere, it knows that "docu-ment" hyphenated at the margin should be de-hyphenated).

There is also --unsmarten-punctuation, which converts fancy quotes, dashes and ellipsis to their plain equivalents (nameyl "'-...).

There is also the --html-unwrap-factor parameter, described as: "Scale used to determine the length at which a line should be unwrapped. Valid values are a decimal between 0 and 1. The default is 0.4, just below the median line length. If only a few lines in the document require unwrapping this value should be reduced".

For my test document, the default worked fine; still results were even better with lower values:

ebook-convert mydoc.pdf mydoc.txt --enable-heuristics --html-unwrap-factor 0.2

The result is good enough for further processing (e.g. text alignment of pairs of document to create translation memory .tmx files).

On Linux, to process all .pdf files in (sub-)folders, do:

find . -name "*.pdf" | while IFS= read -r file; do if [ ! -e "${file}.txt" ]; then ebook-convert "$file" "${file}.txt"   --enable-heuristics --html-unwrap-factor 0.2 ; fi; done

(Note: the filenames must not start with a hyphen.)

The GUI manual is at: https://manual.calibre-ebook.com/conversion.html#heuristic-processing

Note that the result is awful for sentences in (multi-column-) tables, where tools like Tabula (https://tabula.technology/) will help.

Below a screenshot of an example use (here, the output as .epub makes clearer that paragraphs are correctly detected). enter image description here

mayeulk
  • 41
  • Yes, ebook-covert works great and produces a much better txt! I also gave a minimal conversion example at: https://superuser.com/questions/207603/how-to-extract-text-from-pdf-in-script-on-linux/1810994#1810994 – Ciro Santilli OurBigBook.com Oct 03 '23 at 06:57