In this tutorial, I will enumerate the steps needed to perform OCR using Google’s Open Source OCR engine Tesseract. It was developed initially at HP Labs.
Install the dependencies.
sudo apt-get install libpng-dev libjpeg-dev libtiff-dev zlib1g-dev sudo apt-get install gcc g++ sudo apt-get install autoconf automake libtool checkinstall
We need image processing toolkit Leptonica to compile Tesseract, otherwise unlike older versions it will not compile.
cd ~ wget http://www.leptonica.org/source/leptonica-1.69.tar.gz tar -zxvf leptonica-1.69.tar.gz cd leptonica-1.69 ./configure make sudo checkinstall sudo ldconfig
Compile and install Tesseract.
cd ~ wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.02.tar.gz tar -zxvf tesseract-ocr-3.02.02.tar.gz cd tesseract-ocr ./autogen.sh ./configure make # (this may take a while) sudo make install sudo ldconfig
Get and install the English language data.
wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.eng.tar.gz tar -zxvf tesseract-ocr-3.02.eng.tar.gz sudo mv ./tesseract-ocr/tessdata/* /usr/local/share/tessdata/
This is it!! We are done with installing Tesseract on Ubuntu. Now, let’s test it on a image.
Testing “HELLO WORLD!!”
Now I have got this pretty old scanned page of a poem eulogizing Sherlock.
We will run Tesseract from command line as shown below.
tesseract image.png output
tesseract- is the command.
image.png- is the path to the image on which we are running OCR. I am assuming that
output- The output will be stored in an image text file named
output.txt will be stored in the current directory.
Now let’s check the output in
2213 (rout wan w. suns) HERE dwell rogether still two men of note Who never lived and so can never die: How very near they seem, ye: how remote Tm age berm me world went all awry. But still the game’: afoot for rhose with ears Avtuned to catch the distant View-halloo: England is England yer, for all our fenrs— Only those lhlngs the heart ézlin/ex are true. A yellow fog swirls pm the window-pane A: night descends upon lhls fabled street: A lonely hansom splashes through the rain, The ghostly gas lamps ran at (Wenly feet. Here, though the world explode, these two survive, And it is always eighteen ninety-ﬁve. MAW. H‘ .9“ Vmczwr Snuuus-rr
There are a lot of misspelled words in the output file like
rogether instead of
together. These words can be corrected by simply using a Python spell checker module like PyEnchant.