Multi-page document recognition settings

Aspose.OCR for Java allows for very flexible customization of recognition accuracy, performance, and other settings by configuring the properties of the DocumentRecognitionSettings object.

These settings are applicable when extracting text from multi-page TIFF images and scanned PDF documents.

Method	Parameter	Default state	Description
`setAllowedCharacters`	Case-sensitive string of characters or one of the predefined character sets: `CharactersAllowedType.ALL` - try to recognize all characters. `CharactersAllowedType.LATIN_ALPHABET` - only recognize case-insensitive Latin / English text (`A` to `Z` and `a` to `z`), without accented characters. `CharactersAllowedType.DIGITS` - recognize only binary, octal, decimal, or hexadecimal numbers (`0`-`9` and `A` to `F`).	All characters from the selected recognition language.	The whitelist of characters Aspose.OCR engine will look for.
`setAutoContrast`	`true` - enable `false` - disable	Disabled	Automatically increase the contrast of images before proceeding to recognition.
`setAutoDenoising`	`true` - enable `false` - disable	Disabled	Automatically remove noise from images before proceeding to recognition.
`setAutoSkew`	`true` - enable `false` - disable	Enabled	Automatically correct image tilt (deskew) before proceeding to recognition.
`setDetectAreas`	`true` - enable `false` - disable	Enabled	Automatically select the optimal areas detection algorithm that suits the most common use cases.
`setDetectAreasMode`	`DetectAreasMode`	Automatic	Manually override the default document areas detection method.
`setIgnoredCharacters`	Case-sensitive string of characters	All characters are recognized	A blacklist of characters that are ignored during recognition.
`setLanguage`	Recognition language	Extended Latin characters, including diacritics	Specify a language for recognition.
`setLinesFiltration`	`true` - enable `false` - disable	Enabled	Set to `true` to recognize text in tables. Set to `false` to improve performance by ignoring table structures and treating tables as plain text.
`setPagesNumber`	Number of pages, `int`	1	The number of pages to be recognized in a multi-page file.
`setSkew`	Skew angle, `double`	0	Manually rotate the image by the specified degree.
`setStartPage`	Page number, `int`	First page	The page number from which to start recognition of the multi-page file. The first page number is `0`.
`setThreadsCount`	Number of threads, `int`	Automatic	The number of CPU threads used for recognition.
`setThresholdValue`	Binarization threshold, `int`	Automatic	Override the automatic binarization settings.
`setUpscaleSmallFont`	`true` - enable `false` - disable	Disabled	Improve small font recognition and detection of dense lines.

Applicable to

Example

The following code example shows how to fine-tune recognition:

// Create instance of OCR API
AsposeOCRPdf api = new AsposeOCRPdf();
// Specify recognition settings
DocumentRecognitionSettings recognitionSettings = new DocumentRecognitionSettings();
recognitionSettings.setStartPage(3);
recognitionSettings.setPagesNumber(10);
recognitionSettings.setLanguage(Language.Fra);
// Extract text from image
ArrayList<RecognitionResult> results = api.RecognizePdf("source.pdf", recognitionSettings);
// Save results to Microsoft Word document
AsposeOCR ocr = new AsposeOCR();
ocr.SaveMultipageDocument("result.docx", Format.Docx, results);

Image recognition settings Receipt recognition settings