OCR

When you extract document data with Sensible, Sensible automatically performs optical character recognition (OCR) on the document for you, except in advanced cases. If the document doesn't require OCR, Sensible automatically extracts embedded text directly from the document to optimize performance.

For advanced cases, you can configure how Sensible OCRs documents using the following parameters:

option	configurable for	notes
OCR Level parameter	document types	Use this option to configure the criteria by which Sensible determines if a whole document requires OCR.
OCR preprocessor	configs	Use this option to OCR specified pages or page ranges in a document.
OCR Engine parameter	document types	Use this option to choose your OCR provider, for example, Amazon, Google, or Microsoft.

For an overview of how Sensible handles OCR, see the following steps:

Sensible converts non-image file types into PDFs or extracts the text directly, depending on the file type. If Sensible extracts text directly in this step, it skips the following steps.
Sensible transforms the bytes of the document into raw text, and determines whether the document needs OCR:
- If the file type is an image (for example, PNG), Sensible runs OCR for the whole document, as specified by the document type's OCR Engine parameter.
- (Configurable) if the file is a PDF, Sensible processes the file using heuristics to determine if the whole document needs OCR. Configure this step using the document type's OCR Level parameter and OCR Engine.
(Configurable) After additional intervening steps, Sensible applies your configured preprocessors, including the OCR preprocessor. This preprocessor runs for documents that don't trigger whole-document OCR in a previous step.

Notes

For more information about OCR versus embedded text extraction, see Solving direct text extraction from PDFs.
For information about extracting data from non-text images, such as photographs, charts, or illustrations, see the Query Group method's Multimodal Engine parameter. You can use the Multimodal Engine parameter as an alternative to OCR to extract from poor-quality text images, such as handwriting.