-
Notifications
You must be signed in to change notification settings - Fork 948
Advanced Tesseract Configuration
Tesseract has many configuration parameters for controlling all sorts of aspects of the recognition process. These parameters are enumerated and described in G8TesseractParameters.h.
You can read the value of a given parameter using the variableValueForKey:
method.
For example, to read the value of the whitelist parameter (the whitelist consists of only the characters Tesseract should recognize):
// Assuming "tesseract" is an already initialized `G8Tesseract` object
NSString *whitelist = [tesseract variableValueForKey:kG8ParamTesseditCharWhitelist];
You can set the value of a given parameter one of three ways: individually (after initialization), using a dictionary (during initialization), or by using one or more configuration files (during initialization). Additionally, you can use any combination of these methods in tandem. The parameters will be set (and possibly overridden) in the order in which you use each method.
// Assuming "tesseract" is an already initialized `G8Tesseract` object
// Set the whitelist to recognize only the numbers 0 through 9
[tesseract setVariableValue:@"0123456789" forKey:kG8ParamTesseditCharWhitelist];
// During initialization, set the whitelist to recognize only the numbers 0 through 9
// and disable word dictionaries
G8Tesseract *tesseract = [[G8Tesseract alloc] initWithLanguage:@"eng"
configDictionary:@{
kG8ParamTesseditCharWhitelist: @"0123456789",
kG8ParamLoadSystemDawg : @"F",
kG8ParamLoadFreqDawg : @"F",
}
configFileNames:nil
cachesRelatedDataPath:nil
engineMode:G8OCREngineModeTesseractOnly];
Let's say you have one or more Tesseract configuration files in your "tessdata" folder (under a subdirectory of "tessconfigs" or "configs", as required by Tesseract). You can initialize a G8Tesseract
object using these files as part of initialization by providing an array of the absolute file paths to the configuration files:
debugConfig.txt
tessdata_manager_debug_level 1
recognitionConfig.txt
load_system_dawg F
load_freq_dawg F
user_words_suffix user-words
user_patterns_suffix user-patterns
tessedit_char_whitelist 0123456789
Note that the above configuration files use the actual Tesseract parameter key strings instead of the variables defined in G8TesseractParameters.h.
ViewController.m
// Construct the paths to our config files
NSString *resourcePath = [NSBundle bundleForClass:G8Tesseract.class].resourcePath;
NSString *tessdataFolderName = @"tessdata";
NSString *tessdataFolderPathFromTheBundle = [[resourcePath stringByAppendingPathComponent:tessdataFolderName] stringByAppendingString:@"/"];
NSString *debugConfigFileName = @"debugConfig.txt";
NSString *recognitionConfigFileName = @"recognitionConfig.txt";
NSString *tessConfigsFolderName = @"tessconfigs";
NSString *debugConfigFilePath = [[tessdataFolderPathFromTheBundle stringByAppendingPathComponent:tessConfigsFolderName] stringByAppendingPathComponent:debugConfigsFileName];
NSString *recognitionConfigFilePath = [[tessdataFolderPathFromTheBundle stringByAppendingPathComponent:tessConfigsFolderName] stringByAppendingPathComponent:recognitionConfigsFileName];
// Initialize the `G8Tesseract` object using the config files
G8Tesseract *tesseract = [[G8Tesseract alloc] initWithLanguage:kG8Languages
configDictionary:nil
configFileNames:@[debugConfigFilePath, recognitionConfigFilePath]
cachesRelatedDataPath:nil
engineMode:G8OCREngineModeTesseractOnly];
What if we want to be able to download and use language/configuration files at runtime for use with Tesseract? Since our "tessdata" folder is read-only in our application's bundle, we can't store our newly downloaded files there.
The solution is to use a custom path relative to your app's Caches directory for storing the "tessdata" folder. When you initialize your G8Tesseract
object, set the option cachesRelatedDataPath
to be a filepath string relative to the Caches directory.
Note that even if you use this option, you must still create a referenced folder in your Xcode project called "tessdata", even if you don't put any files in it.
For example, let's say we want our "tessdata" folder to be located at "Caches/foo/bar/tessdata":
G8Tesseract *tesseract = [[G8Tesseract alloc] initWithLanguage:@"eng"
configDictionary:nil
configFileNames:nil
cachesRelatedDataPath:@"foo/bar"
engineMode:G8OCREngineModeTesseractOnly];
Upon executing the code above, the directory "Caches/foo/bar/tessdata" will be created (if it doesn't already exist), and all of the contents of the referenced "tessdata" folder in the Xcode project will be copied there. Finally, Tesseract will be initialized to use "Caches/foo/bar/tessdata" as its tessdata location, and it will search for any language/configuration files there.
So if you later download a new language/configuration file, store it in "Caches/foo/bar/tessdata" and re-initialize Tesseract with the same cachesRelatedDataPath but this time specifying the new language/configuration file for the initWithLanguage
and/or configFileNames
options.