Feb 2023

In the last month, we introduced the ability to convert document extractions to CSV and the ability to validate the quality of extracted OCR text using confidence scores.

New feature: Extract document data to CSV

You can now download any document extraction as a a comma-separated value (CSV) file in addition to downloading it in Excel format. This feature lets you easily convert tables, rows, labels, checkboxes, and other document primitives into spread-style layouts using the same rules as for SenseML to spreadsheet conversion. You can also use the API to compile multiple extractions into one CSV file.

Improvement: OCR confidence scores at high verbosity

If you configure a high-output verbosity, then you can now view confidence scores for anchor text and extracted text that was OCR'd. These scores measure the quality of the source text images on a scale of zero to one. For example, illegible handwriting or a blurry scanned text receive scores closer to zero. Sensible returns a null confidence score for text that wasn't OCR'd.

You can write validations to test that the data you extract meets a minimum threshold for OCR quality. For example, if you wanted to check that the source text for a quoted rate value isn't too blurry, you could write a rule like the following:

   [{"not: {"exists":{"var":"quote_rate.valueConfidence"}}},
   {">=": [{"var":"quote_rate.valueConfidence"},"0.90"]}]}