Tesseract
Tesseract is an open source text recognition (OCR) Engine with support for over a hundred languages.
Downloading language models¶
Tesseract needs language models to be able to transcribe images. These can be downloaded from https://github.com/tesseract-ocr/tessdata. Download the ones that you want and save them in a directory.
Then you need to tell Tesseract where to find the models using the environment variable TESSDATA_PREFIX. For example, if you save the models in ~/tessdata, then you would use the following command:
Using from Python¶
Tesseract can be used from Python using the wrapper package "pytesseract".
Tip
We recommend you install this in a Virtualenv.