In this tutorial, I will enumerate the steps needed to perform OCR using Google’s Open Source OCR engine Tesseract. It was developed initially at HP Labs.
Install the dependencies.
We need image processing toolkit Leptonica to compile Tesseract, otherwise unlike older versions it will not compile.
Compile and install Tesseract.
Get and install the English language data.
This is it!! We are done with installing Tesseract on Ubuntu. Now, let’s test it on a image.
Testing “HELLO WORLD!!”
Now I have got this pretty old scanned page of a poem eulogizing Sherlock.
We will run Tesseract from command line as shown below.
tesseract - is the command.
image.png - is the path to the image on which we are running OCR. I am assuming that
image.png is in
output - The output will be stored in an image text file named
output.txt will be stored in the current directory.
Now let’s check the output in
There are a lot of misspelled words in the output file like
rogether instead of
together. These words can be corrected by simply using a Python spell checker module like PyEnchant.
- Unofficial Tesseract Documentation - Click here
- More Language packs - Click here
- Installation on Ubuntu