Tesseract

Tesseract is an open source text recognition (OCR) Engine with support for over a hundred languages.

Downloading language models¶

Tesseract needs language models to be able to transcribe images. These can be downloaded from https://github.com/tesseract-ocr/tessdata. Download the ones that you want and save them in a directory.

Then you need to tell Tesseract where to find the models using the environment variable TESSDATA_PREFIX. For example, if you save the models in ~/tessdata, then you would use the following command:

export TESSDATA_PREFIX=~/tessdata

Using from Python¶

Tesseract can be used from Python using the wrapper package "pytesseract".

Tip

We recommend you install this in a Virtualenv.