84

I have a scanned PDF file in which two different real pages appear together on one virtual page.

The resolution is with good quality. The problem is I have to zoom when reading and drag from left to the right.
Is there some command (convert, pdftk, ...) or script that can convert this pdf file with normal pages (one page from book = one page in pdf file)?

Olivier
  • 155
xralf
  • 15,415

12 Answers12

88

Just an addition since I had issues with the python script (and several other solutions): for me mutool worked great. It's a simple and small addition shipped with the elegant mupdf reader. So you can try:

mutool poster -y 2 input.pdf output.pdf

For horizontal splits, replace y with x. And you can, of course, combine the two for more complex solutions.

Really happy to have found this (after years of daily mupdf usage :)


Installing mupdf and mutool from source

(mutool comes shipped with mupdf starting from version 1.4: http://www.mupdf.com/news)

wget http://www.mupdf.com/downloads/mupdf-1.8-source.tar.gz
tar -xvf mupdf-1.8-source.tar.gz
cd mupdf-1.8-source
sudo make prefix=/usr/local install

Or go to the downloads page to find a newer version.

Installing mutool from a Linux distribution package

On Debian, the package containing mutool is mupdf-tools:

apt-get install mupdf-tools
Pierre H.
  • 105
martz
  • 981
  • 3
    I had a djvu... I turned it into a postscript (quite fast), then into a pdf (turtle slow) -- and finally mutool cut it so fast I thought it hadn't work -- it had! – Julien Puydt Apr 16 '15 at 09:32
  • 2
    yes, i was also really pleased with the speed. – martz Apr 17 '15 at 00:51
  • 8
    This one is the easiest and better. mutool was made for this. Also, beware of -y, I think in most cases what you want is -x. – fiatjaf Dec 24 '15 at 20:13
  • mutool poster -x 2 input.pdf output.pdf does the trick really fast. You are my hero! – andrej Apr 13 '17 at 16:09
  • 2
    This utility is very fast, however I have a problem with the page order. The command allocates the right page at first position and the left page in the second one. Can somebody help me with this issue? – garciparedes Sep 23 '17 at 11:02
  • 1
    @garciparedes https://unix.stackexchange.com/questions/184799/exchange-odd-and-even-pages-in-a-pdf-file – Nobody Sep 26 '18 at 11:53
53

Here's a small Python script using the old PyPdf library that does the job neatly. Save it in a script called un2up (or whatever you like), make it executable (chmod +x un2up), and run it as a filter (un2up <2up.pdf >1up.pdf).

#!/usr/bin/env python
import copy, sys
from pyPdf import PdfFileWriter, PdfFileReader
input = PdfFileReader(sys.stdin)
output = PdfFileWriter()
for p in [input.getPage(i) for i in range(0,input.getNumPages())]:
    q = copy.copy(p)
    (w, h) = p.mediaBox.upperRight
    p.mediaBox.upperRight = (w/2, h)
    q.mediaBox.upperLeft = (w/2, h)
    output.addPage(p)
    output.addPage(q)
output.write(sys.stdout)

Ignore any deprecation warnings; only the PyPdf maintainers need be concerned with those.

If the input is oriented in an unusual way, you may need to use different coordinates when truncating the pages. See Why my code not correctly split every page in a scanned pdf?


Just in case it's useful, here's my earlier answer which uses a combination of two tools plus some manual intervention:

  • Pdfjam (at least version 2.0), based on the pdfpages LaTeX package, to crop the pages;
  • Pdftk, to put the left and right halves back together.

Both tools are needed because as far as I can tell pdfpages isn't able to apply two different transformations to the same page in one stream. In the call to pdftk, replace 42 by the number of pages in the input document (2up.pdf).

pdfjam -o odd.pdf --trim '0cm 0cm 14.85cm 0cm' --scale 1.141 2up.pdf
pdfjam -o even.pdf --trim '14.85cm 0cm 0cm 0cm' --scale 1.141 2up.pdf
pdftk O=odd.pdf E=even.pdf cat $(i=1; while [ $i -le 42 ]; do echo O$i E$i; i=$(($i+1)); done) output all.pdf

In case you don't have pdfjam 2.0, it's enough to have a PDFLaTeX installation with the pdfpages package (on Ubuntu: you need texlive-latex-recommended Install texlive-latex-recommended and perhaps (on Ubuntu: texlive-fonts-recommended Install texlive-fonts-recommended), and use the following driver file driver.tex:

\batchmode
\documentclass{minimal}
\usepackage{pdfpages}
\begin{document}
\includepdfmerge[trim=0cm 0cm 14.85cm 0cm,scale=1.141]{2up.pdf,-}
\includepdfmerge[trim=14.85cm 0cm 0cm 0cm,scale=1.141]{2up.pdf,-}
\end{document}

Then run the following commands, replacing 42 by the number of pages in the input file (which must be called 2up.pdf):

pdflatex driver
pdftk driver.pdf cat $(i=1; pages=42; while [ $i -le $pages ]; do echo $i $(($pages+$i)); i=$(($i+1)); done) output 1up.pdf
  • The PyPdf library works perfect. I only changed it a little and run it with python conv_pdf.py res.pdf . How would you run your script shebang from commandline? – xralf May 03 '11 at 08:03
  • I'd like to try the version with pdfjam (because of slight scaling) too, but after the installation of pdfjam package my shell won't recognize pdfjam command. – xralf May 03 '11 at 08:34
  • @xralf: My python script just reads from standard input and writes to standard output. The pdfjam version requires pdfjam 2.0; it's only a small wrapper around pdfpages, and I've added the bit of LaTeX it generates so you can use that directly. The scaling issue is probably solvable with pypdf, it could be a page size issue (I may or may not be able to help if you give more details on what's happening and especially the page sizes involved). – Gilles 'SO- stop being evil' May 03 '11 at 10:22
  • Thank you, the difference is in very slightly worse resolution, but this doesn't matter. I will turn back to it when I know more about Latex (it's too complex for me now and the solution is really good with PyPdf). – xralf May 03 '11 at 11:10
  • @Gilles: Nice! (1) With some change, I guess your un2up script will work on my scanned pdf, where two paper pages sit side by side in a pdf page with the earlier page being the left half. un2up splits each pdf page in height and then put the later page in front of the earlier one. Do you have some suggestions? (2) I also wonder how to specify the MediaBox for the output pdf in the script? After splitting, I convert the split pdf into djvu, but I got the warning that "File has an empty MediaBox.Using the current page size instead." Thanks! – Tim Aug 08 '11 at 02:16
  • 1
    @Gilles Versy useful script. I've expected to see something like that in pdfjam, pdftk. Anyway, some people may want some modifications to split pages over other axis and use different ordering. This is possible to with changing few lines and using q.mediaBox.lowerRight = (w, h/2) – ony Feb 26 '12 at 20:44
  • Great script!! If someone is getting error decrypting pdf even if pdf is not encrypted, add following line before for loop: if (input.getIsEncrypted()): input.decrypt('') From https://bugs.launchpad.net/pypdf/+bug/355479 – cram1010 Mar 28 '12 at 18:27
  • Thanks! I had some trouble getting this script to work in 2020, and although I eventually got it work (change pyPdf to PyPDF2, open and write from files opened in binary mode instead of stdin/stdout, and this instead of copy.copy(p)), another alternative script is here. – ShreevatsaR Oct 06 '20 at 05:55
  • I'm the current maintainer of pypdf , PyPDF2 and pdfly: Please don't use PyPDF2. We upgraded it and moved to pypdf. And if you want a CLI program, I think pdfly will suit you well in many cases :-) – Martin Thoma Sep 09 '23 at 11:45
  • 1
    Note that this script no longer works with recent PyPDF2 2.x versions. I've posted an updated version of the script here: https://unix.stackexchange.com/a/758425/233460 – Matthijs Kooijman Oct 08 '23 at 18:58
20

Imagemagick can do it in one step:

$ convert in.pdf -crop 50%x0 +repage out.pdf
Michael Mrozek
  • 93,103
  • 40
  • 240
  • 233
tomas
  • 209
  • 2
    Thanks. If I add the -density 400 parameter` it has even better quality. – xralf May 20 '11 at 14:27
  • 16
    It looks like convert uses raster as an intermediate format. That causes blurish look even when original PDF contains vector objects. – ony Feb 26 '12 at 20:40
  • Does anyone know of a way to do this without rasterizing page contents along the way...or at least to set a higher resolution? – Tomislav Nakic-Alfirevic Feb 28 '13 at 21:47
  • 5
    this rendered texts into images and created pdf from images. Maybe nice for pics but useless for text extraction. – andrej Apr 13 '17 at 16:06
7

Based on answer from Gilles and how to find PDF page count I wrote

#!/bin/bash

pdforiginal=$1
pdfood=$pdforiginal.odd.pdf
pdfeven=$pdforiginal.even.pdf
pdfout=output_$1
margin=${2:-0}
scale=${3:-1}

pages=$(pdftk $pdforiginal dump_data | grep NumberOfPages | awk '{print $2}')

pagesize=$(pdfinfo $pdforiginal | grep "Page size" | awk '{print $5}')
margin=$(echo $pagesize/2-$margin | bc -l)

pdfjam -o $pdfood --trim "0cm 0cm ${margin}pt 0cm" --scale $scale $pdforiginal
pdfjam -o $pdfeven --trim "${margin}pt 0cm 0cm 0cm" --scale $scale  $pdforiginal

pdftk O=$pdfood E=$pdfeven cat $(i=1; while [ $i -le $pages ]; do echo O$i E$i; i=$(($i+1)); done) output $pdfout

rm $pdfood $pdfeven

So I can run

./split.sh my.pdf 50 1.2

where 50 for adjust margin and 1.2 for scale.

Anton Bessonov
  • 151
  • 2
  • 5
6

ImageMagick's Convert command can help you to crop your file in 2 parts. See http://www.imagemagick.org/Usage/crop/

If I were you, I'd write a (shell) script like this:

  1. Split your file with pdfsam: 1 page = 1 file on disk (Format doesn't matter. Choose one that ImageMagick knows. I'd just take PS or PDF.
  2. For each page, crop the first half and put it to a file named ${PageNumber}A

  3. Crop the second half and put it to a file named ${PageNumber}B.

    You get 1A.pdf, 1B.pdf, 2A.pdf, 2B.pdf, etc.

  4. Now, assemble this again in a new PDF. There are many methods to do this.
tiktak
  • 492
  • 3
  • 7
  • 1
    Wouldn't using ImageMagick rasterize the files? And you should explain that last part inline, especially for the benefit of the non-francophones in the audience. – Gilles 'SO- stop being evil' May 02 '11 at 22:17
  • 1
    Because you don't need to understand French. It just show how you can use ImageMagick's convert, pdftk, or ghostscript (gs) alone to achieve this goal. I like using pdftk. "Rastering" doesn't matter as it's a scanned document. – tiktak May 04 '11 at 09:29
5

The best solution was mutool see above:

sudo apt install mupdf-tools pdftk

the split:

mutool poster -y 2 input.pdf output.pdf

but then you need to rotate the pages left:

pdftk output.pdf cat 1-endleft output rotated.pdf
5

Here's a variation of the PyPDF code posted by Gilles. This function will work no matter what the page orientation is:

import copy
import math
import pyPdf

def split_pages(src, dst):
    src_f = file(src, 'r+b')
    dst_f = file(dst, 'w+b')

    input = pyPdf.PdfFileReader(src_f)
    output = pyPdf.PdfFileWriter()

    for i in range(input.getNumPages()):
        p = input.getPage(i)
        q = copy.copy(p)
        q.mediaBox = copy.copy(p.mediaBox)

        x1, x2 = p.mediaBox.lowerLeft
        x3, x4 = p.mediaBox.upperRight

        x1, x2 = math.floor(x1), math.floor(x2)
        x3, x4 = math.floor(x3), math.floor(x4)
        x5, x6 = math.floor(x3/2), math.floor(x4/2)

        if x3 > x4:
            # horizontal
            p.mediaBox.upperRight = (x5, x4)
            p.mediaBox.lowerLeft = (x1, x2)

            q.mediaBox.upperRight = (x3, x4)
            q.mediaBox.lowerLeft = (x5, x2)
        else:
            # vertical
            p.mediaBox.upperRight = (x3, x4)
            p.mediaBox.lowerLeft = (x1, x6)

            q.mediaBox.upperRight = (x3, x6)
            q.mediaBox.lowerLeft = (x1, x2)

        output.addPage(p)
        output.addPage(q)

    output.write(dst_f)
    src_f.close()
    dst_f.close()
moraes
  • 151
3

I made a webpage for this purpose just now: here. From mobile or desktop, upload a PDF containing two-page spreads, and it will split each such page into two of half the width. Click on the new filename to download the resulting PDF.

Everything is self-contained in a single HTML file so if you don't want to rely on webpages that may go away (not that I have any intention of taking it down), you can save the webpage and use it offline. It also seems that the file size doesn't double, unlike some of the other solutions.


The main reason I wrote this, rather than using one of the existing answers on this page directly, is that I noticed that in many PDFs containing two-page spreads, there are also some pages that correspond to a single page. For example, the first page may be the book cover. We don't want to slice such pages in half; we want to split only the pages that are genuinely made of two pages side-by-side (identifiable as being wider than they are tall).

Alternatively, here are a couple of Python scripts implementing that feature (it shouldn't make a difference if your PDF contains only two-page spreads, but otherwise, replace the width > height check with True if you don't need it):

  1. A modification of the accepted answer above, updated to Python 3 (using PyPDF2) and because copy.copy seems to not always work:

    #!/usr/bin/env python3
    import copy
    import sys
    import PyPDF2
    

    '''Run as: python3 un2up.py foo.pdf foo-split.pdf Generates foo-split.pdf.''' input = PyPDF2.PdfFileReader(open(sys.argv[1], 'rb')) output = PyPDF2.PdfFileWriter() for p in [input.getPage(i) for i in range(0, input.getNumPages())]: (w, h) = p.mediaBox.upperRight if w > h: q = PyPDF2.pdf.PageObject.createBlankPage( None, p.mediaBox.getWidth(), p.mediaBox.getHeight()) q.mergePage(p) p.mediaBox.upperRight = (w/2, h) q.mediaBox.upperLeft = (w/2, h) output.addPage(p) output.addPage(q) else: output.addPage(p) output.write(open(sys.argv[2], 'wb'))

  2. A light modification of an example from the pdfrw library here:

    '''Run as:
        python3 unspread.py foo.pdf
    Generates file `unspread.foo.pdf`.'''
    

    import sys import os from pdfrw import PdfReader, PdfWriter, PageMerge

    def splitpage(src): '''Split a page into two (left and right).''' # Yield a result for each half of the page x, y, width, height = src.MediaBox if width > height: for x_pos in (0, 0.5): yield PageMerge().add(src, viewrect=(x_pos, 0, 0.5, 1)).render() else: yield src

    inpfn = sys.argv[1] outfn = 'unspread.' + os.path.basename(inpfn) writer = PdfWriter(outfn) for page in PdfReader(inpfn).pages: writer.addpages(splitpage(page)) writer.write()

ShreevatsaR
  • 131
  • 5
2

moraes solution did not work for me. The main problem was the x5 and x6 calculation. Here an offset has to be considered, i.e. if lowerLeft is not at (0,0)

So here is another variation, with additional adaptions to use PyPDF2 and python 3:

import copy
import math
import PyPDF2
import sys
import io 

def split_pages(src, dst):
    src_f = io.open(src, 'r+b')
    dst_f = io.open(dst, 'w+b')

    input = PyPDF2.PdfFileReader(src_f)
    output = PyPDF2.PdfFileWriter()

    for i in range(input.getNumPages()):
        p = input.getPage(i) 
        q = copy.copy(p)
        q.mediaBox = copy.copy(p.mediaBox)

        x1, x2 = p.cropBox.lowerLeft
        x3, x4 = p.cropBox.upperRight        

        x1, x2 = math.floor(x1), math.floor(x2)
        x3, x4 = math.floor(x3), math.floor(x4)

        x5 = math.floor((x3-x1) / 2 + x1)
        x6 = math.floor((x4-x2) / 2 + x2)

        if x3 > x4:        
            # horizontal
            p.mediaBox.upperRight = (x5, x4)
            p.mediaBox.lowerLeft = (x1, x2)

            q.mediaBox.upperRight = (x3, x4)
            q.mediaBox.lowerLeft = (x5, x2)
        else:
            # vertical        
            p.mediaBox.lowerLeft = (x1, x6)
            p.mediaBox.upperRight = (x3, x4)

            q.mediaBox.upperRight = (x3, x6)
            q.mediaBox.lowerLeft = (x1, x2)

        output.addPage(p)
        output.addPage(q)

    output.write(dst_f)
    src_f.close()
    dst_f.close()

if __name__ == "__main__":
    if ( len(sys.argv) != 3 ):
        print ('Usage: python3 double2single.py input.pdf output.pdf')
        sys.exit(1)

    split_pages(sys.argv[1], sys.argv[2])
vbar
  • 41
2

Perhaps an updated version of @Gilles accepted answer with PyPDF2. Using Python 3 on MacOS.

#!/usr/bin/env python

usage pdf_split_double_page.py 2-up-file.pdf output-file.pdf

import copy, sys from PyPDF2 import PdfFileReader, PdfFileWriter

input = PdfFileReader(sys.argv[1]) output = PdfFileWriter() for p in [input.getPage(i) for i in range(0,input.getNumPages())]: q = copy.copy(p) (w, h) = p.mediaBox.upperRight p.mediaBox.upperRight = (w/2, h) q.mediaBox.upperLeft = (w/2, h) output.addPage(p) output.addPage(q) with open(sys.argv[2], "wb") as out_file: output.write(out_file)

Don't forget to chmod u+x <script_name>.py file.

1

Based on the answer by Benjamin at AskUbuntu, I would recommend using the GUI tool called gscan2pdf.

  1. Import the PDF scan file to gscan2pdf. Note that non-image PDF files may not work. Scans are fine, so you don't have to worry.

    enter image description here

  2. It might take a while depending on the size of the document. Wait till it loads up.

  3. Press Ctrl+A to select all pages and then rotate (Ctrl+Shift+C) them if necessary.

    enter image description here

  4. Go to Tools >> Clean up. Select Layout as double and # output pages = 2.

    enter image description here

  5. Hit OK and wait till the job is finished.

    enter image description here

  6. Save the PDF file. Done.

1

Here's an updated version of the script from Gilles' answer, adapted to use the more modern PyPDF2 2.x (confusingly there's also a PyPDF2 1.x in addition to the original PyPDF version, but PyPDF2 2.x seems to be the current version and is what I get when I install python3-pypdf on Ubuntu 23.04).

The script below follows does the same as to the original script, just adapted to the newer naming conventions. Also, it now accepts a filename as a commandline argument (and auto derives a filename for the output) instead of reading from stdin/stdout, as pypdf now seems to seek in the input file, which does not seem to work in stdin (even when redirected from a file).

#!/usr/bin/env python
import copy, sys, pathlib
from pypdf import PdfWriter, PdfReader

inp = pathlib.Path(sys.argv[1]) outp = inp.parent / (inp.stem + '-split' + inp.suffix)

input = PdfReader(inp) output = PdfWriter() first = False for p in [input.pages[i] for i in range(0, len(input.pages))]: q = copy.copy(p) (w, h) = p.mediabox.upper_right p.mediabox.upper_right = (w/2, h) q.mediabox.upper_left = (w/2, h) output.add_page(p) output.add_page(q) output.write(outp)