Bash, get links from pdf

Question

Os ubuntu.

Need to get links or more data (for example binding layer from QuarkXPress application) from pdf to text, in terminal.

Tried pdftotext, but seems links are not exported, pdfgrep is the same.

Is there any solution?

Thanks.

so, uh ... pdftohtml? it just works for me, maybe you could share a sample PDF file... — frostschutz, Jul 24 '19 at 13:47
Have a look at https://superuser.com/questions/698811/how-to-export-comments-from-a-pdf-file where comments suggest pdfannotextractor — Jaleks, Sep 25 '22 at 14:42

score 4 · Answer 1 · answered Jul 24 '19 at 14:35

4

You could try and extract the /URI(...) PDF directives by hand, maybe after removing compression if any using pdftk:

pdftk file.pdf output - uncompress | grep -aPo '/URI *\(\K[^)]*'

answered Jul 24 '19 at 14:35

Thank you, but native pdftohtml is enough. – Stanislav Hosek Jul 24 '19 at 15:15
@StanislavHosek. OK. I think I misinterpreted your question and that you were wanting to retrieve the list of linked URLs from a PDF file. – Stéphane Chazelas Jul 24 '19 at 15:49

score 2 · Answer 2 · answered Jul 24 '19 at 15:21

2

Using pdfx and filtering all lines starting with - http:

pdfx -v file.pdf | sed -n 's/^- \(http\)/\1/p'

answered Jul 24 '19 at 15:21

Freddy

Regis Barbosa · Answer 3 · 2019-07-24T20:11:53.140

0

Test this:

pdftotext -raw "filename.pdf" && file=`ls -tr | tail -1`; grep -E "https?://.*" "${file}" && rm "${file}"

edited Jul 24 '19 at 20:11

answered Jul 24 '19 at 14:10

Regis Barbosa

1

You should not parse ls ..., but you can just send output to stdout instead of using a temporary file: pdftotext -raw "filename.pdf" - | grep ... (From man pdftotext: If text-file is ´-', the text is sent to stdout.) – pLumo Jul 24 '19 at 14:29
Nope, it is not working, but thanks. – Stanislav Hosek Jul 24 '19 at 15:15
pdftotext version 0.71.0 / GNU bash, versão 5.0.3(1)-release (x86_64-pc-linux-gnu) / grep (GNU grep) 3.3. @pLumo it works either way for me... :) – Regis Barbosa Jul 24 '19 at 20:07

score 0 · Answer 4 · answered Jan 12 '21 at 10:32

0

You can use pdftohtml and then use lynx to pull the links from the html.

answered Jan 12 '21 at 10:32

Roh

score 0 · Answer 5 · answered Nov 21 '22 at 12:38

First, you need to check if your pdf is compressed or not, see:

If it's compressed, you need to uncompress it.

Then, you can extract links using grep and sed:

strings uncompressed.pdf | grep -Eo '/URI \(.*\)' | sed 's/^\/URI (//g; s/)$//g'

5 Answers5