21

I was wondering how to view and edit the code of a PDF file?

  1. By viewing, I don't want to see the binary format, so I think hexdump may not be what I want. I tried gedit, but no encoding method can be used to decode the PDF content.

  2. By editing, I would like to search for /Fit and change them to /XYZ by for example sed. But my command sed s/\/Fit/\/XYZ/ < 1.pdf > 2.pdf seem not change the appearance of my PDF as I expected, although it doesn't report any error. I was wondering if sed can actually work on PDF files as if they were plain text?

The context of my questions can be found from this question. My OS is Ubuntu 10.10.

Kurt Pfeifle
  • 1,461
Tim
  • 101,790

4 Answers4

39

Regarding your 1st question ("viewing source code, but no binary"): there are a few options which you have in order to de-compress the internal binary streams which are attached to many objects.

My favorite tool for this is QPDF, available on all major OS platforms. The following command de-compresses all streams and all object streams:

 qpdf --qdf --object-streams=disable orig.pdf expanded.pdf

Now you can open your PDF in any text editor. (There may still be some binary blobs in there: for example, font files and ICC profiles, which wouldn't make sense for QPDF to expand).

To re-compress the expanded.pdf again after editing, you can run:

 qpdf expanded.pdf orig2.pdf

(Careful when manually editing PDFs! You need to know a lot about their internal syntax in order to do this right. As soon as you add or delete a single byte, you can get error messages from PDF readers who may no longer be able to open it, because the PDFs internal ToC is corrupted, which is based on byte-offset calculations. Just replacing Fit by XYZ strings should go fine, though...)

K7AAY
  • 3,816
  • 4
  • 23
  • 39
Kurt Pfeifle
  • 1,461
  • 9
    You can also add or remove text. When the length of an object stream changes the byte offsets can be recomputed by using the fix-qdf program that is part of qpdf. You still have to be a bit careful, though. See http://qpdf.sourceforge.net/files/qpdf-manual.html#ref.qdf – H. Rittich Oct 08 '19 at 12:04
  • @H.Rittich: Thx for the comment... In what way do you think does this open a new perspective on the problem? Did you think we don't know that we can add or remove text this way? – Kurt Pfeifle Oct 08 '19 at 13:10
  • 2
    @KursPfeifle: I do not make any assumptions on what you know. The answer states that editing a PDF this way needs to preserve the byte offsets of the objects in the file. It is, however, possible to change the byte offsets when later correcting them by using fix-qdf. Hence, if you want to replace a string by a string of different length, it is possible, but you need to use the fix-qdf tool. I would say, that this is a useful addition to the answer. – H. Rittich Oct 10 '19 at 13:22
  • @H.Rittich: Thx for giving your perspective. When I stressed the need to preserve byte offsets of objects I did not want to advice people about HOW they should do this. Had you worded your comment slightly differently, I would have understood the intention of your comment faster. – Kurt Pfeifle Oct 10 '19 at 15:55
11

You can use sed with binary files (at least GNU sed; some implementations may have trouble with files containing null characters or not ending with a newline character). But the command you used only replaces the first occurrence of /Fit on each line, and lines are pretty much meaningless in a PDF file. You need to replace all occurrences:

 sed s/\/Fit/\/XYZ/g

It would be more robust only replace /Fit if it's not followed by a word constituent (e.g. not replacing /Fitness; I don't know if your file contains occurrences of /Fit that would cause trouble). Here's one way:

perl -pe 's!/Fit\b!/XYZ!g'
  • Thanks! It now works! (1) I was wondering how sed search characters in binary content? Does sed firstly encode the query characters before search? (2) In the last command, what does !, \b and g mean? Can it be done without perl just with sed? – Tim Jul 22 '11 at 14:00
  • 1
    @Tim (1) Sed loads the data into memory, operates on it and prints it out. Why would it need to encode anything? (2) g means to replace all occurrences on each line, in both sed and perl. ! is the separator; you can choose (almost) any character as the separator for the s command (this goes both in sed and perl). \b means a word boundary; it exists in perl but not in sed. – Gilles 'SO- stop being evil' Jul 22 '11 at 14:05
  • About (1), because the characters that you give to sed in the command are human readable. If the content to search in is completely binary, how can sed find the query word there? – Tim Jul 22 '11 at 14:08
  • @Tim Text is binary data that happens to be human readable. – Gilles 'SO- stop being evil' Jul 22 '11 at 14:20
  • So if I understand correctly, regardless of whether a file in which sed searches is binary or text, it is always binary in memory. If the file to be searched in is not purely text, can the query word be specified as binary data to sed? – Tim Jul 22 '11 at 14:27
  • 1
    @Tim Yes, you can pass binary data in the query. You'll have to insert the characters literally in your sed or shell source code. – Gilles 'SO- stop being evil' Jul 22 '11 at 14:30
  • How is that different from specifying text to sed? – Tim Jul 22 '11 at 14:30
  • @Tim It isn't, that's the whole point. If you want to continue this, I suggest we move to chat. – Gilles 'SO- stop being evil' Jul 22 '11 at 14:33
2

Use LibreOffice or OpenOffice to open the PDF, view it, replace things, write a new PDF, etc. I think that you can even use it from the command line or programmatically if there are a lot of documents to process.

Note that PDFs from some sources, e.g. Scanners, often contain the pages as images rather than as text so you will be out of luck with them for using search and replace.

  • 4
    (1/2) Be aware of the following fact: LibreOffice is not a native PDF editor. When it opens a PDF, it converts all pages to a vector image (which may keep the raster parts from the original PDF as raster parts) and opens it in the LibreOffice *Draw* part of the LibreOffice suite. Then, when it saves the edited PDF file, it will be a PDF file which was exported from the native LibreOffice Draw format (with the suffix .odg) to PDF. – Kurt Pfeifle Apr 29 '17 at 16:28
  • 4
    (2/2) This workflow may have unexpected side effects. Moreover, the LibreOffice Draw application may not be able to correctly import all elements from the original PDF. However, in many cases it still may be a useful tool for all those folks who have no better means available. – Kurt Pfeifle Apr 29 '17 at 16:28
1

sed is line-oriented, that makes it not well suited for binary files, which are structured as blocks not lines.
Try using bbe (bbe-.sourceforge.net) instead.

Alternatively, both Emacs (GNU and XEmacs) and vim open PDF files seamlessly. It's not very pretty printed of course, as it is mixed text and binary, but it's sufficient for your editing purposes.
There is a Pdftk plugin for vim that makes everything easier, download here (zip file).
As you probably know, both above editors have powerful search-and-replace capabilities.

Also, converting the PDF to QDF mode before makes editing PDF files really easy.

Philomath
  • 2,877
  • You may also try to edit with sed using the -b switch. if it works I will add this to my answer. – Philomath Jul 22 '11 at 13:03
  • @Tim: what do you mean by "does not show anything", just empty? any error message? Also, can you try with XEmacs ? (all three of them worked for me). – Philomath Jul 22 '11 at 13:17
  • Never mind about -b, it's cygwin specific. – Philomath Jul 22 '11 at 13:18
  • Emacs says "File 1.pdf is large (9MB), really open? (y or n)". I chose "y", and then nothing is there. – Tim Jul 22 '11 at 13:30
  • Most probably a Emacs problem, do you have XEmacs? (I just opened a 31 MB PDF without any problems). – Philomath Jul 22 '11 at 13:36
  • (1) XEmacs works, thanks! So are Emacs and XEmacs different things? Is it that XEmacs can display mixture of readable characters and binary, while Emacs cannot? (2) In (X)Emacs, can the search and replace be done once for all, instead of one by each occurrence? (3) For bbe, is its website bbe-.sourceforge.net or bbe.sourceforge.net? The latter does not work either. What is it for? – Tim Jul 22 '11 at 14:14
  • (1) They have many differences, see here and here, this topic is quite heated, so take everything with a grain of salt. (2) Yes, when it asks the first time whether to replace enter '!' (exclamation mark). (3) It's the first one, if you can't get to it just search "bbe" on SourceForge. – Philomath Jul 22 '11 at 14:36
  • Thanks! (3) I looked at the cached page of bbe, but still wonder what bbe can do that sed cannot? I.e., what is the advantage of bbe? – Tim Jul 22 '11 at 15:23