3

Os ubuntu.

Need to get links or more data (for example binding layer from QuarkXPress application) from pdf to text, in terminal.

Tried pdftotext, but seems links are not exported, pdfgrep is the same.

Is there any solution?

Thanks.

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255

5 Answers5

4

You could try and extract the /URI(...) PDF directives by hand, maybe after removing compression if any using pdftk:

pdftk file.pdf output - uncompress | grep -aPo '/URI *\(\K[^)]*'
2

Using pdfx and filtering all lines starting with - http:

pdfx -v file.pdf | sed -n 's/^- \(http\)/\1/p'
Freddy
  • 25,565
0

Test this:

pdftotext -raw "filename.pdf" && file=`ls -tr | tail -1`; grep -E "https?://.*" "${file}" && rm "${file}"

Example of script

0

You can use pdftohtml and then use lynx to pull the links from the html.

Roh
  • 1
0

First, you need to check if your pdf is compressed or not, see:

How to know if a PDF file is compressed or not and to (un)compress it

If it's compressed, you need to uncompress it.

Then, you can extract links using grep and sed:

strings uncompressed.pdf | grep -Eo '/URI \(.*\)' | sed 's/^\/URI (//g; s/)$//g'
bardom
  • 1