Split pages in pdf

Question

I have a scanned PDF file in which two different real pages appear together on one virtual page.

The resolution is with good quality. The problem is I have to zoom when reading and drag from left to the right.
Is there some command (convert, pdftk, ...) or script that can convert this pdf file with normal pages (one page from book = one page in pdf file)?

For the records, the reverse operation (joining multiple pages) can be obtain from the command line (rather than "print to file") with pdfnup, from the pdfjam suite. — Skippy le Grand Gourou, Sep 26 '17 at 08:19

score 88 · Answer 1 · edited Sep 01 '20 at 13:08

88

Just an addition since I had issues with the python script (and several other solutions): for me mutool worked great. It's a simple and small addition shipped with the elegant mupdf reader. So you can try:

mutool poster -y 2 input.pdf output.pdf

For horizontal splits, replace y with x. And you can, of course, combine the two for more complex solutions.

Really happy to have found this (after years of daily mupdf usage :)

Installing `mupdf` and `mutool` from source

(mutool comes shipped with mupdf starting from version 1.4: http://www.mupdf.com/news)

wget http://www.mupdf.com/downloads/mupdf-1.8-source.tar.gz
tar -xvf mupdf-1.8-source.tar.gz
cd mupdf-1.8-source
sudo make prefix=/usr/local install

Or go to the downloads page to find a newer version.

Installing `mutool` from a Linux distribution package

On Debian, the package containing mutool is mupdf-tools:

apt-get install mupdf-tools

edited Sep 01 '20 at 13:08

Pierre H.

105

answered Jan 16 '15 at 05:34

martz

981

3

I had a djvu... I turned it into a postscript (quite fast), then into a pdf (turtle slow) -- and finally mutool cut it so fast I thought it hadn't work -- it had! – Julien Puydt Apr 16 '15 at 09:32
2

yes, i was also really pleased with the speed. – martz Apr 17 '15 at 00:51
8

This one is the easiest and better. mutool was made for this. Also, beware of -y, I think in most cases what you want is -x. – fiatjaf Dec 24 '15 at 20:13
mutool poster -x 2 input.pdf output.pdf does the trick really fast. You are my hero! – andrej Apr 13 '17 at 16:09
2

This utility is very fast, however I have a problem with the page order. The command allocates the right page at first position and the left page in the second one. Can somebody help me with this issue? – garciparedes Sep 23 '17 at 11:02
1

@garciparedes https://unix.stackexchange.com/questions/184799/exchange-odd-and-even-pages-in-a-pdf-file – Nobody Sep 26 '18 at 11:53

Gilles 'SO- stop being evil' · Accepted Answer · 2021-01-28T17:31:59.810

53

Here's a small Python script using the old PyPdf library that does the job neatly. Save it in a script called un2up (or whatever you like), make it executable (chmod +x un2up), and run it as a filter (un2up <2up.pdf >1up.pdf).

#!/usr/bin/env python
import copy, sys
from pyPdf import PdfFileWriter, PdfFileReader
input = PdfFileReader(sys.stdin)
output = PdfFileWriter()
for p in [input.getPage(i) for i in range(0,input.getNumPages())]:
    q = copy.copy(p)
    (w, h) = p.mediaBox.upperRight
    p.mediaBox.upperRight = (w/2, h)
    q.mediaBox.upperLeft = (w/2, h)
    output.addPage(p)
    output.addPage(q)
output.write(sys.stdout)

_{Ignore any deprecation warnings; only the PyPdf maintainers need be concerned with those.}

If the input is oriented in an unusual way, you may need to use different coordinates when truncating the pages. See Why my code not correctly split every page in a scanned pdf?

Just in case it's useful, here's my earlier answer which uses a combination of two tools plus some manual intervention:

Pdfjam (at least version 2.0), based on the pdfpages LaTeX package, to crop the pages;
Pdftk, to put the left and right halves back together.

Both tools are needed because as far as I can tell pdfpages isn't able to apply two different transformations to the same page in one stream. In the call to pdftk, replace 42 by the number of pages in the input document (2up.pdf).

pdfjam -o odd.pdf --trim '0cm 0cm 14.85cm 0cm' --scale 1.141 2up.pdf
pdfjam -o even.pdf --trim '14.85cm 0cm 0cm 0cm' --scale 1.141 2up.pdf
pdftk O=odd.pdf E=even.pdf cat $(i=1; while [ $i -le 42 ]; do echo O$i E$i; i=$(($i+1)); done) output all.pdf

In case you don't have pdfjam 2.0, it's enough to have a PDFLaTeX installation with the pdfpages package (on Ubuntu: you need texlive-latex-recommended and perhaps (on Ubuntu: texlive-fonts-recommended ), and use the following driver file driver.tex:

\batchmode
\documentclass{minimal}
\usepackage{pdfpages}
\begin{document}
\includepdfmerge[trim=0cm 0cm 14.85cm 0cm,scale=1.141]{2up.pdf,-}
\includepdfmerge[trim=14.85cm 0cm 0cm 0cm,scale=1.141]{2up.pdf,-}
\end{document}

Then run the following commands, replacing 42 by the number of pages in the input file (which must be called 2up.pdf):

pdflatex driver
pdftk driver.pdf cat $(i=1; pages=42; while [ $i -le $pages ]; do echo $i $(($pages+$i)); i=$(($i+1)); done) output 1up.pdf

edited Jan 28 '21 at 17:31

answered May 02 '11 at 20:51

Gilles 'SO- stop being evil'

829,060

The PyPdf library works perfect. I only changed it a little and run it with python conv_pdf.py res.pdf . How would you run your script shebang from commandline? – xralf May 03 '11 at 08:03
I'd like to try the version with pdfjam (because of slight scaling) too, but after the installation of pdfjam package my shell won't recognize pdfjam command. – xralf May 03 '11 at 08:34
@xralf: My python script just reads from standard input and writes to standard output. The pdfjam version requires pdfjam 2.0; it's only a small wrapper around pdfpages, and I've added the bit of LaTeX it generates so you can use that directly. The scaling issue is probably solvable with pypdf, it could be a page size issue (I may or may not be able to help if you give more details on what's happening and especially the page sizes involved). – Gilles 'SO- stop being evil' May 03 '11 at 10:22
Thank you, the difference is in very slightly worse resolution, but this doesn't matter. I will turn back to it when I know more about Latex (it's too complex for me now and the solution is really good with PyPdf). – xralf May 03 '11 at 11:10
@Gilles: Nice! (1) With some change, I guess your un2up script will work on my scanned pdf, where two paper pages sit side by side in a pdf page with the earlier page being the left half. un2up splits each pdf page in height and then put the later page in front of the earlier one. Do you have some suggestions? (2) I also wonder how to specify the MediaBox for the output pdf in the script? After splitting, I convert the split pdf into djvu, but I got the warning that "File has an empty MediaBox.Using the current page size instead." Thanks! – Tim Aug 08 '11 at 02:16
1

@Gilles Versy useful script. I've expected to see something like that in pdfjam, pdftk. Anyway, some people may want some modifications to split pages over other axis and use different ordering. This is possible to with changing few lines and using q.mediaBox.lowerRight = (w, h/2) – ony Feb 26 '12 at 20:44
Great script!! If someone is getting error decrypting pdf even if pdf is not encrypted, add following line before for loop: if (input.getIsEncrypted()): input.decrypt('') From https://bugs.launchpad.net/pypdf/+bug/355479 – cram1010 Mar 28 '12 at 18:27
Thanks! I had some trouble getting this script to work in 2020, and although I eventually got it work (change pyPdf to PyPDF2, open and write from files opened in binary mode instead of stdin/stdout, and this instead of copy.copy(p)), another alternative script is here. – ShreevatsaR Oct 06 '20 at 05:55
I'm the current maintainer of pypdf , PyPDF2 and pdfly: Please don't use PyPDF2. We upgraded it and moved to pypdf. And if you want a CLI program, I think pdfly will suit you well in many cases :-) – Martin Thoma Sep 09 '23 at 11:45
1

Note that this script no longer works with recent PyPDF2 2.x versions. I've posted an updated version of the script here: https://unix.stackexchange.com/a/758425/233460 – Matthijs Kooijman Oct 08 '23 at 18:58

score 20 · Answer 3 · edited May 20 '11 at 10:23

20

Imagemagick can do it in one step:

$ convert in.pdf -crop 50%x0 +repage out.pdf

edited May 20 '11 at 10:23

Michael Mrozek

93,103
40
240
233

answered May 20 '11 at 07:28

tomas

209

2

Thanks. If I add the -density 400 parameter` it has even better quality. – xralf May 20 '11 at 14:27
16

It looks like convert uses raster as an intermediate format. That causes blurish look even when original PDF contains vector objects. – ony Feb 26 '12 at 20:40
Does anyone know of a way to do this without rasterizing page contents along the way...or at least to set a higher resolution? – Tomislav Nakic-Alfirevic Feb 28 '13 at 21:47
5

this rendered texts into images and created pdf from images. Maybe nice for pics but useless for text extraction. – andrej Apr 13 '17 at 16:06

score 7 · Answer 4 · edited Apr 13 '17 at 12:37

Based on answer from Gilles and how to find PDF page count I wrote

#!/bin/bash

pdforiginal=$1
pdfood=$pdforiginal.odd.pdf
pdfeven=$pdforiginal.even.pdf
pdfout=output_$1
margin=${2:-0}
scale=${3:-1}

pages=$(pdftk $pdforiginal dump_data | grep NumberOfPages | awk '{print $2}')

pagesize=$(pdfinfo $pdforiginal | grep "Page size" | awk '{print $5}')
margin=$(echo $pagesize/2-$margin | bc -l)

pdfjam -o $pdfood --trim "0cm 0cm ${margin}pt 0cm" --scale $scale $pdforiginal
pdfjam -o $pdfeven --trim "${margin}pt 0cm 0cm 0cm" --scale $scale  $pdforiginal

pdftk O=$pdfood E=$pdfeven cat $(i=1; while [ $i -le $pages ]; do echo O$i E$i; i=$(($i+1)); done) output $pdfout

rm $pdfood $pdfeven

So I can run

./split.sh my.pdf 50 1.2

where 50 for adjust margin and 1.2 for scale.

score 6 · Answer 5 · answered May 02 '11 at 21:07

6

ImageMagick's Convert command can help you to crop your file in 2 parts. See http://www.imagemagick.org/Usage/crop/

If I were you, I'd write a (shell) script like this:

Split your file with pdfsam: 1 page = 1 file on disk (Format doesn't matter. Choose one that ImageMagick knows. I'd just take PS or PDF.
For each page, crop the first half and put it to a file named ${PageNumber}A
Crop the second half and put it to a file named ${PageNumber}B.

You get 1A.pdf, 1B.pdf, 2A.pdf, 2B.pdf, etc.
Now, assemble this again in a new PDF. There are many methods to do this.

answered May 02 '11 at 21:07

tiktak

492
3
7

1

Wouldn't using ImageMagick rasterize the files? And you should explain that last part inline, especially for the benefit of the non-francophones in the audience. – Gilles 'SO- stop being evil' May 02 '11 at 22:17
1

Because you don't need to understand French. It just show how you can use ImageMagick's convert, pdftk, or ghostscript (gs) alone to achieve this goal. I like using pdftk. "Rastering" doesn't matter as it's a scanned document. – tiktak May 04 '11 at 09:29

score 5 · Answer 6 · answered Apr 08 '18 at 22:56

5

The best solution was mutool see above:

sudo apt install mupdf-tools pdftk

the split:

mutool poster -y 2 input.pdf output.pdf

but then you need to rotate the pages left:

pdftk output.pdf cat 1-endleft output rotated.pdf

answered Apr 08 '18 at 22:56

Eduard Florinescu

11,703

Still no overlap... – MUY Belgium Apr 21 '18 at 23:20

score 5 · Answer 7 · answered Apr 01 '13 at 12:20

Here's a variation of the PyPDF code posted by Gilles. This function will work no matter what the page orientation is:

import copy
import math
import pyPdf

def split_pages(src, dst):
    src_f = file(src, 'r+b')
    dst_f = file(dst, 'w+b')

    input = pyPdf.PdfFileReader(src_f)
    output = pyPdf.PdfFileWriter()

    for i in range(input.getNumPages()):
        p = input.getPage(i)
        q = copy.copy(p)
        q.mediaBox = copy.copy(p.mediaBox)

        x1, x2 = p.mediaBox.lowerLeft
        x3, x4 = p.mediaBox.upperRight

        x1, x2 = math.floor(x1), math.floor(x2)
        x3, x4 = math.floor(x3), math.floor(x4)
        x5, x6 = math.floor(x3/2), math.floor(x4/2)

        if x3 > x4:
            # horizontal
            p.mediaBox.upperRight = (x5, x4)
            p.mediaBox.lowerLeft = (x1, x2)

            q.mediaBox.upperRight = (x3, x4)
            q.mediaBox.lowerLeft = (x5, x2)
        else:
            # vertical
            p.mediaBox.upperRight = (x3, x4)
            p.mediaBox.lowerLeft = (x1, x6)

            q.mediaBox.upperRight = (x3, x6)
            q.mediaBox.lowerLeft = (x1, x2)

        output.addPage(p)
        output.addPage(q)

    output.write(dst_f)
    src_f.close()
    dst_f.close()

ShreevatsaR · Answer 8 · 2020-10-09T21:59:24.767

I made a webpage for this purpose just now: here. From mobile or desktop, upload a PDF containing two-page spreads, and it will split each such page into two of half the width. Click on the new filename to download the resulting PDF.

Everything is self-contained in a single HTML file so if you don't want to rely on webpages that may go away (not that I have any intention of taking it down), you can save the webpage and use it offline. It also seems that the file size doesn't double, unlike some of the other solutions.

The main reason I wrote this, rather than using one of the existing answers on this page directly, is that I noticed that in many PDFs containing two-page spreads, there are also some pages that correspond to a single page. For example, the first page may be the book cover. We don't want to slice such pages in half; we want to split only the pages that are genuinely made of two pages side-by-side (identifiable as being wider than they are tall).

Alternatively, here are a couple of Python scripts implementing that feature (it shouldn't make a difference if your PDF contains only two-page spreads, but otherwise, replace the width > height check with True if you don't need it):

A modification of the accepted answer above, updated to Python 3 (using PyPDF2) and because copy.copy seems to not always work:

#!/usr/bin/env python3
import copy
import sys
import PyPDF2
'''Run as:
    python3 un2up.py foo.pdf foo-split.pdf
Generates foo-split.pdf.'''
input = PyPDF2.PdfFileReader(open(sys.argv[1], 'rb'))
output = PyPDF2.PdfFileWriter()
for p in [input.getPage(i) for i in range(0, input.getNumPages())]:
    (w, h) = p.mediaBox.upperRight
    if w > h:
        q = PyPDF2.pdf.PageObject.createBlankPage(
            None, p.mediaBox.getWidth(), p.mediaBox.getHeight())
        q.mergePage(p)
        p.mediaBox.upperRight = (w/2, h)
        q.mediaBox.upperLeft = (w/2, h)
        output.addPage(p)
        output.addPage(q)
    else:
        output.addPage(p)
output.write(open(sys.argv[2], 'wb'))

A light modification of an example from the pdfrw library here:

'''Run as:
    python3 unspread.py foo.pdf
Generates file `unspread.foo.pdf`.'''
import sys
import os
from pdfrw import PdfReader, PdfWriter, PageMerge
def splitpage(src):
    '''Split a page into two (left and right).'''
    # Yield a result for each half of the page
    x, y, width, height = src.MediaBox
    if width > height:
        for x_pos in (0, 0.5):
            yield PageMerge().add(src, viewrect=(x_pos, 0, 0.5, 1)).render()
    else:
        yield src
inpfn = sys.argv[1]
outfn = 'unspread.' + os.path.basename(inpfn)
writer = PdfWriter(outfn)
for page in PdfReader(inpfn).pages:
    writer.addpages(splitpage(page))
writer.write()

score 2 · Answer 9 · answered Oct 09 '19 at 12:00

moraes solution did not work for me. The main problem was the x5 and x6 calculation. Here an offset has to be considered, i.e. if lowerLeft is not at (0,0)

So here is another variation, with additional adaptions to use PyPDF2 and python 3:

import copy
import math
import PyPDF2
import sys
import io 

def split_pages(src, dst):
    src_f = io.open(src, 'r+b')
    dst_f = io.open(dst, 'w+b')

    input = PyPDF2.PdfFileReader(src_f)
    output = PyPDF2.PdfFileWriter()

    for i in range(input.getNumPages()):
        p = input.getPage(i) 
        q = copy.copy(p)
        q.mediaBox = copy.copy(p.mediaBox)

        x1, x2 = p.cropBox.lowerLeft
        x3, x4 = p.cropBox.upperRight        

        x1, x2 = math.floor(x1), math.floor(x2)
        x3, x4 = math.floor(x3), math.floor(x4)

        x5 = math.floor((x3-x1) / 2 + x1)
        x6 = math.floor((x4-x2) / 2 + x2)

        if x3 > x4:        
            # horizontal
            p.mediaBox.upperRight = (x5, x4)
            p.mediaBox.lowerLeft = (x1, x2)

            q.mediaBox.upperRight = (x3, x4)
            q.mediaBox.lowerLeft = (x5, x2)
        else:
            # vertical        
            p.mediaBox.lowerLeft = (x1, x6)
            p.mediaBox.upperRight = (x3, x4)

            q.mediaBox.upperRight = (x3, x6)
            q.mediaBox.lowerLeft = (x1, x2)

        output.addPage(p)
        output.addPage(q)

    output.write(dst_f)
    src_f.close()
    dst_f.close()

if __name__ == "__main__":
    if ( len(sys.argv) != 3 ):
        print ('Usage: python3 double2single.py input.pdf output.pdf')
        sys.exit(1)

    split_pages(sys.argv[1], sys.argv[2])

score 2 · Answer 10 · answered Jul 05 '20 at 04:11

Perhaps an updated version of @Gilles accepted answer with PyPDF2. Using Python 3 on MacOS.

#!/usr/bin/env python
usage pdf_split_double_page.py 2-up-file.pdf output-file.pdf
import copy, sys
from PyPDF2 import PdfFileReader, PdfFileWriter
input = PdfFileReader(sys.argv[1])
output = PdfFileWriter()
for p in [input.getPage(i) for i in range(0,input.getNumPages())]:
    q = copy.copy(p)
    (w, h) = p.mediaBox.upperRight
    p.mediaBox.upperRight = (w/2, h)
    q.mediaBox.upperLeft = (w/2, h)
    output.addPage(p)
    output.addPage(q)
with open(sys.argv[2], "wb") as out_file:
    output.write(out_file)

Don't forget to chmod u+x <script_name>.py file.

score 1 · Answer 11 · answered Jul 03 '17 at 08:20

Based on the answer by Benjamin at AskUbuntu, I would recommend using the GUI tool called gscan2pdf.

Import the PDF scan file to gscan2pdf. Note that non-image PDF files may not work. Scans are fine, so you don't have to worry.
It might take a while depending on the size of the document. Wait till it loads up.
Press Ctrl+A to select all pages and then rotate (Ctrl+Shift+C) them if necessary.
Go to Tools >> Clean up. Select Layout as double and # output pages = 2.
Hit OK and wait till the job is finished.
Save the PDF file. Done.

Tested, failed with complex pdf documents with a huge nulber of images. — MUY Belgium, Apr 21 '18 at 23:28

Matthijs Kooijman · Answer 12 · 2023-10-08T19:14:24.700

Here's an updated version of the script from Gilles' answer, adapted to use the more modern PyPDF2 2.x (confusingly there's also a PyPDF2 1.x in addition to the original PyPDF version, but PyPDF2 2.x seems to be the current version and is what I get when I install python3-pypdf on Ubuntu 23.04).

The script below follows does the same as to the original script, just adapted to the newer naming conventions. Also, it now accepts a filename as a commandline argument (and auto derives a filename for the output) instead of reading from stdin/stdout, as pypdf now seems to seek in the input file, which does not seem to work in stdin (even when redirected from a file).

#!/usr/bin/env python
import copy, sys, pathlib
from pypdf import PdfWriter, PdfReader
inp = pathlib.Path(sys.argv[1])
outp = inp.parent / (inp.stem + '-split' + inp.suffix)
input = PdfReader(inp)
output = PdfWriter()
first = False
for p in [input.pages[i] for i in range(0, len(input.pages))]:
    q = copy.copy(p)
    (w, h) = p.mediabox.upper_right
    p.mediabox.upper_right = (w/2, h)
    q.mediabox.upper_left = (w/2, h)
    output.add_page(p)
    output.add_page(q)
output.write(outp)

Split pages in pdf

12 Answers12

Installing `mupdf` and `mutool` from source

Installing `mutool` from a Linux distribution package

usage pdf_split_double_page.py 2-up-file.pdf output-file.pdf

Linked

Split pages in pdf

12 Answers12

Installing mupdf and mutool from source

Installing mutool from a Linux distribution package

usage pdf_split_double_page.py 2-up-file.pdf output-file.pdf

Linked

Installing `mupdf` and `mutool` from source

Installing `mutool` from a Linux distribution package