Introduction to PyMuPDF's fitz module
In this article I will show you how I use the fitz module from the PyMuPDF library to extract the first page from a PDF file as a PNG image file. fitz is a Python wrapper for the MuPDF library, which is a lightweight PDF and XPS viewer and toolkit. The fitz module provides a high-level interface for working with PDF documents, including reading, editing, and creating PDF files.
Here are some of the features provided by the fitz module:
Reading PDF files
You can use fitz.open() to open a PDF file and retrieve a Document object, which provides access to the pages of the PDF document.
Working with pages
You can use methods of the Page object, such as get_text() and get_image() to extract text and images from a PDF page, or insert_image() and insert_text() to add new elements to the page.
Creating PDF files
You can use the fitz.open() function with the 'pdf' argument to create a new PDF document, and then use the Document object to add pages and elements to the document.
Manipulating PDF files
You can use methods of the Document object, such as delete_page() and insert_page() to remove or add pages to a PDF document, or save() and close() to save the document to a file.
In my case I only needed a way to "convert" the first page to a png image file, sort of like a screenshot
Here is the script:
import sys
import os
import fitz #from pymupdf
pdf_path = sys.argv[1]
doc = fitz.open(pdf_path)
page = doc.load_page(0) # 0 based page indexing 0 is the first page of the pdf document
width, height = page.bound()[2], page.bound()[3]
dpi = 300
pix = page.get_pixmap(matrix=fitz.Matrix(dpi/72, dpi/72))
pix.set_dpi(dpi, dpi)
image_bytes = pix.tobytes()
# Save image file
with open(f"{pdf_path}.png", 'wb') as f:
f.write(image_bytes)
doc.close()
And I call the python script from a bash script because I did not want to install fitz globally in my machine but I have it installed in a local virtual envirement
The bash script activates a local virtual environment and runs the Python script. This is necessary because the fitz module used in the Python script needs to be installed in a virtual environment. A virtual environment is used to create isolated Python environments, allowing you to install packages and dependencies specific to a project without affecting the global Python installation on your machine. By using a virtual environment, you can avoid package conflicts and ensure that your Python script runs correctly with the necessary dependencies. The bash script demonstrates how to activate a local virtual environment and run the Python script within it, ensuring that the fitz module is available to the script.
The bash script
#!/bin/bash
source "${HOME}/path/to/myapp/venv/bin/activate"
python "${HOME}/bin/pdf2png.py" "$1"
And that's it. now you have a script to extract your pdf pages as images. you can add features to the script for example to extract a specific page or iterate over all of them.
1 Comments, latest
Ahmed
March 2023To install fitz: pip install pymupdf