In this tutorial, I will enumerate the steps needed to perform OCR using Google’s Open Source OCR engine Tesseract. It was developed initially at HP Labs.

Installation

Install the dependencies.

sudo apt-get install libpng-dev libjpeg-dev libtiff-dev zlib1g-dev 
sudo apt-get install gcc g++ 
sudo apt-get install autoconf automake libtool checkinstall

We need image processing toolkit Leptonica to compile Tesseract, otherwise unlike older versions it will not compile.

cd ~
wget http://www.leptonica.org/source/leptonica-1.69.tar.gz
tar -zxvf leptonica-1.69.tar.gz
cd leptonica-1.69
./configure
make
sudo checkinstall
sudo ldconfig

Compile and install Tesseract.

cd ~ 
wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.02.tar.gz
tar -zxvf tesseract-ocr-3.02.02.tar.gz
cd tesseract-ocr
./autogen.sh
./configure
make # (this may take a while)
sudo make install
sudo ldconfig

Get and install the English language data.

wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.eng.tar.gz
tar -zxvf tesseract-ocr-3.02.eng.tar.gz
sudo mv ./tesseract-ocr/tessdata/* /usr/local/share/tessdata/

This is it!! We are done with installing Tesseract on Ubuntu. Now, let’s test it on a image.

Testing “HELLO WORLD!!”

Now I have got this pretty old scanned page of a poem eulogizing Sherlock.

Image to test
Figure 1: Image to test [JPG]

We will run Tesseract from command line as shown below.

tesseract image.png output

Here -

  1. tesseract  - is the command.
  2. image.png  - is the path to the image on which we are running OCR. I am assuming that image.png is in pwd.
  3. output  - The output will be stored in an image text file named

By default output.txt will be stored in the current directory.

Now let’s check the output in output.txt

2213
(rout wan w. suns)

HERE dwell rogether still two men of note
Who never lived and so can never die:
How very near they seem, ye: how remote
Tm age berm me world went all awry.
But still the game’: afoot for rhose with ears
Avtuned to catch the distant View-halloo:
England is England yer, for all our fenrs—
Only those lhlngs the heart ézlin/ex are true.

A yellow fog swirls pm the window-pane

A: night descends upon lhls fabled street:

A lonely hansom splashes through the rain,

The ghostly gas lamps ran at (Wenly feet.

Here, though the world explode, these two survive,
And it is always eighteen ninety-five.

MAW. H‘ .9“ Vmczwr Snuuus-rr

There are a lot of misspelled words in the output file like rogether instead of together. These words can be corrected by simply using a Python spell checker module like PyEnchant.

Additional Notes

  1. Unofficial Tesseract Documentation - Click here
  2. More Language packs - Click here
  3. Installation on Ubuntu