Download Tesseract OCR for free. Commercial quality OCR. A commercial quality OCR engine originally developed at HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by UNLV.
Tesseract is one of the most powerful open source OCR engine available today. OCR stands for Optical Character Recognition. This is the process of extracting texts from images. For example, consider the following image which has some text in it that has to be extracted out. Tesseract Open Source OCR Engine (main repository) machine-learning ocr tesseract lstm tesseract-ocr hacktoberfest ocr-engine C Apache-2.0 7,236 39,590 304 (9 issues need help) 4 Updated Apr 17, 2021. Tessdoc Tesseract documentation HTML 117 370 8 4 Updated Apr 17, 2021.
Monday, 24 August, 2020
Optical character recognition (OCR) is the conversion of images containing text to machine-encoded text. A popular tool for this is the open source project Tesseract. Tesseract can be used as standalone application from the command line. Alternatively it can be integrated into applications using its C++ API. For other programming languages various wrapper APIs are available. In this post we will use the Java Wrapper Tess4J.
Getting started
We start with adding the Tess4J maven dependency to our project:
Next we need to make sure the native libraries required by Tess4j are accessible from our application. Tess4J jar files ship with native libraries included. However, they need to be extracted before they can be loaded. We can do this programmatically using a Tess4J utility method:
With LoadLibs.extractTessResources(..) we can extract resources from the jar file to a local temp directory. Note that the argument (here win32-x86-64) depends on the system you are using. You can see available options by looking into the Tess4J jar file. We can instruct Java to load native libraries from the temp directory by setting the Java system property java.library.path.
Other options to provide the libraries might be installing Tesseract on your system. If you do not want to change the java.library.path property you can also manually load the libraries using System.load(..).
Next we need to provide language dependent data files to Tesseract. These data files contain trained models for Tesseracts LSTM OCR engine and can be downloaded from GitHub. For example, for detecting german text we have to download deu.traineddata (deu is the ISO 3166-1-alpha-3 country code for Germany). We place one or more downloaded data files in the resources/data directory.
Detecting Text
Now we are ready to use Tesseract within our Java application. The following snippet shows a minimal example:
First we create a new Tesseract instance. We set the language we want to recognize (here: german). With setOcrEngineMode(1) we tell Tesseract to use the LSTM OCR engine.
Next we set the data directory with setDatapath(..) to the directory containing our downloaded LSTM models (here: resources/data).
Finally we load an example image from the classpath and use the doOCR(..) method to perform character recognition. As a result we get a String containing detected characters.
For example, feeding Tesseract with this photo from the German wikipedia OCR article might produce the following text output.
Text output:
Summary
Tesseract is a popular open source project for OCR. With Tess4J we can access the Tesseract API in Java. A little bit of set up is required for loading native libraries and downloading Tesseracts LSTM data. After that it is quite easy to perform OCR in Java. If you are not happy with the recognized text it is a good idea to have a look at the Improving the quality of the output section of the Tesseract documentation.
You can find the source code for the shown example on GitHub.
Leave a reply
Has anyone tried Tessa? The free version does OCR okay, but functionality is really handicapped. I wanted to get feedback from others before I upgraded to the paid version. Okay, it's only US$10, maybe I'll try it anyhow.
Here's the history behind this... For a while now, I've used a Windows 7 system with a Canon scanner and Nuance OmniPage 18 for OCR. But lately I've decided to eliminate the Windows 7 system if possible, so I needed to move scanning and OCR over to my Mac. Well, I got the scanner up and running in no time (excellent job, Canon, thanks!), but OmniPage turned out to be more of a headache. The OmniPage DVD did not come with a Mac version, and a call to Nuance Support informed me that this version of OmniPage doesn't run on the Mac. It turns out I could buy OmniPage Pro X for about $45 on Amazon. I might do that.
Tessa Ocr Pro
Has anyone used OmniPage Pro on a Mac? The OmniPage feature I liked on my Windows 7 box was the MS Word integration plugin. I hope that OmniPage Pro for Mac has a similar plugin for Pages. Can anyone confirm this?
I know there are other OCR options, but right now I like OmniPage because I'm used to it, and I like Tessa because I like to use software that's based on free and open source software when I have a chance to do so.
Tessa Ocr Online
Tessa Ocr Free
Any thoughts or opinions on either package would be appreciated.