Banner of Boost Productivity: Remove Text from PDFs Automatically with Python

Step-by-Step Tutorial: Removing Text from PDFs Using Python

Category: Python

Date: 8 months ago
Views: 755

Introduction

In today's digital world, PDF documents are used everywhere, serving as the go-to format for sharing and preserving content. However, there are times when we need to tweak these documents, whether it's removing repetitive headers, footers, or other unwanted text. Fortunately, with the power of Python scripting, this task becomes a breeze. In this article, I'll walk you through a simple yet effective method to remove text from PDF pages using Python.

Understanding the Technique

The technique we'll be using is simple and straightforward, we simply draw rectangles over the text we want to remove. By covering the unwanted text with rectangles, we effectively conceal it from view, achieving the desired outcome without altering the original document structure. if the page background is white, we use white rectangles, if it is yellow, we use yellow.

Step-by-Step Guide

1. Setting Up the Environment

As always, let's begin by setting up a local environment for this script. In my machine, I have a local environment for all the scripts I use in my everyday operations. Open your terminal and execute the following commands:

    
python -m venv "${HOME}/bin/venv"
source "${HOME}/bin/venv/bin/activate"

Before we dive into the process, let's ensure we have everything set up. Start by installing the necessary Python libraries, mainly PyMuPDF. Open your terminal and execute the following command to install it

    
pip install PyMuPDF

2. Using the Python Script

Now, let's get to the heart of the matter. I've prepared a Python script that automates the text removal process. Simply input the coordinates of the text you want to remove, and let the script do the rest.


import sys
import os
import fitz  # PyMuPDF

def erase_pattern_or_draw_rectangle(pdf_path,color, pattern_rect, output_path=None,page_number=None):
    """
    Erases a pattern or draws a black rectangle over it in each page of the PDF.

    Args:
        pdf_path (str): Path to the PDF file.
        pattern_rect (tuple): Tuple containing the coordinates (x0, y0, x1, y1) of the pattern to erase.
        output_path (str, optional): Path to save the modified PDF. If not provided, modifies the original PDF.
    """
    # Open the PDF
    pdf_document = fitz.open(pdf_path)

    # Iterate over each page
    for page_num in range(len(pdf_document)):
        if page_number is not None and page_number != page_num:
            continue
        # Get the page
        page = pdf_document[page_num]
        page.draw_rect(pattern_rect,
                       color=fitz.utils.getColor(color),
                       fill=fitz.utils.getColor(color),
                       fill_opacity=1)


    # Save the modified PDF if output_path is provided
    if output_path:
        pdf_document.save(output_path)
        pdf_document.close()

if len(sys.argv) >= 7:
    pdf_path = sys.argv[1]
    color = sys.argv[2]
    try:
        # Convert coordinates to integers
        pattern_rect = tuple(map(int, sys.argv[3:7]))
        page_number=None
        if len(sys.argv) == 8:
            page_number=int(sys.argv[7])
        input_filename, input_file_extension = os.path.splitext(pdf_path)
        output_filename = input_filename + "_modified" + input_file_extension
        erase_pattern_or_draw_rectangle(pdf_path, color, pattern_rect, output_filename,page_number)
        print("check output")
    except ValueError:
        print("Error: Invalid coordinates. Please provide four integers for the rectangle coordinates.")
else:
    print("Usage  : python script.py <pdf_path> <color> <x0> <y0> <x1> <y1> [page_number]")
    print("Example: python script.py <pdf_path> white 0 0 10 5")

The script take the following arguments from the command line

pdf_path: the path of the pdf document we want to edit
color: the name of the color of the rectangle, only supported color names in the fitz package from pymupdf are allowed, otherwise you get errors
four numbers x0 y0 x1 y1: representing the coordinates of the rectangle, (x0,y0) the coordinates of the upper left corner of the rectangle and (x1,y1) the coordinates of the lower right corner of the rectangle
page_num (optional): a page number, if provided we will draw the rectangle in only this page, otherwise we draw the rectangle in all pages, keep in mind that fitz starts page numbering from zero.

3. Selecting the Target Text

Identifying the text you want to remove is the first step. Whether it's a pesky header, footer, or watermark, if you don't like, just hide it. Just execute the script with black color and some coordinates: 0 0 200 30, these coordinates will create a black rectangle from the upper right corner of the page, with length 200 and width 30. You can proceed from there and keep changing the values untill you cover the exact text you want to hide, then change the color to white. And poof it's gone. just like magic

Example Use Cases

Removing repeated headers or footers from reports or invoices.
Eliminating watermarks or logos from PDF templates.
Cleaning up scanned documents by removing OCR-generated text.

Enhancements and Further Exploration

Looking to take your PDF editing skills to the next level? Consider exploring advanced techniques such as text extraction, annotation, and metadata editing. With Python by your side, the possibilities are endless.

You can check my previous articles on the subject of editing PDF documents

Conclusion

With Python scripting, removing text from PDF pages has never been easier. By following this step-by-step guide, you'll be able to tackle even the most stubborn text with confidence. So go ahead, give it a try, and unlock the full potential of your PDF documents.