How to extract text from images using Python?

Category: Python

Date: March 2023
Views: 1.13K

As someone who loves programming and technology, I find the concept of extracting text from images incredibly fascinating. Whether it's scanning a document or recognizing text from an image, the potential applications are limitless. One of the most widely used libraries for this purpose is Pytesseract, an open-source tool for optical character recognition (OCR) that extracts text from images.

Have you ever watched a programming video tutorial on youtube and you just wished if you could copy the code right from the video. well I do it all the time, I take a screenshot of the video and extract the text from it using Pytesseract

I was amazed the first time I used Pytesseract. With just a few lines of code, I could extract text from an image that was previously unreadable. Pytesseract is an open-source library that enables optical character recognition (OCR), extracting text from images.

Before you get started with Pytesseract, install it with the command

   
pip install pytesseract

Also, make sure to install PIL by running:

   
pip install pillow

of course the Pytesseract module calls a subprocess of tesseract program that should be installed in your system.

in my Arch linux system I installed it using

   
sudo pacman -S tesseract-data

It shouldn't be any different in other distributions

To extract text from an image using Pytesseract, load the image using the PIL module and pass it to Pytesseract's image_to_string() method. This method will process the image and return the extracted text as a string.

   
import pytesseract
from PIL import Image

#Load the image
image = Image.open('example.png')

#Extract text from the image
text = pytesseract.image_to_string(image)

#Print the extracted text
print(text)

It's that straightforward! There are also optional parameters you can use to improve the accuracy of text extraction. For instance, you can specify the language of the text using the lang parameter, pass it like: lang="eng", or lang="fra" for French . Pytesseract defaults to English, but it supports many other languages too.

It's important to note that Pytesseract is not perfect. Its accuracy depends on the image quality and complexity of the text. Therefore, if you're dealing with low-quality images or complex fonts, you may need to experiment with different configurations to achieve the best results.

In summary, Pytesseract is a robust tool for extracting text from images. It's simple to use and can be a game-changer for automating text extraction tasks. However, keep in mind its limitations and experiment with different configurations to achieve the best results. Happy coding!