Validating extractions
Quality control the data extractions in a document type by writing validations. For example, use validations to:
- Test an extracted field's value using Boolean, logic, numeric, array, string, and other operations. For example, test that the field
bank_account_balanceis non-null and greater than $100, or check fields' sums (e.g., initial balance + sum of credits - sum of debits = ending balance). - If Sensible extracted a field from OCR'd text, test the confidence score for the field's anchor and value as a measure of the quality of the text images. For example, test that text in a scanned document isn't blurry or illegible.
- Use validation output to automatically flag extractions for review in the Sensible app. For example, flag a document extraction for human review if 4 out of 5 validations fail.
A validation is a condition for a field or fields written in JsonLogic that outputs a Boolean and either passes (for true output) or fails (for false output). Validations have access to fields' values in the parsed_document object.
Create validations
Sensible app
To create validations in the Sensible app:
- Click the document type.
- Click Create validation.
- Enter the parameters for the validation.
- Click Create.
Parameters
A validation has the following parameters:
| id | value | notes |
|---|---|---|
| description (required) | string | A description of the test |
| severity (required) | error, warning, skipped | The severity of the failing test. |
| prerequisite fields | array | Use this parameter to generate skipped error messages when optional extracted fields are null. For example, if a missing broker's email address doesn't greatly affect the quality of your extraction, then write a condition to verify broker.email is properly formatted, but specify ["broker\\.email"] in this parameter to skip the verification if the email is null. For an example, see Validation 3 in the Examples section. |
| condition (required) | JsonLogic object | Tests an extracted field or fields with a condition that evaluates to true or false. Write the condition using Boolean, logic, numeric, array, string, and other JsonLogic operations. |
| scope | array of section or subsection IDs | To create a validation for a field contained in sections, specify the field's parent section ID in this parameter. For example, ["section_1"]. To validate a field in nested sections, specify the scope in descending order. For example, to validate field_a in child_section_2 that has parent parent_section_1, specify ["parent_section_1", "child_section_2"] in this parameter. Sensible applies the validation to each instance of field_a in each instance of child_section_2.For an example, see Validations 6. |
Examples
Say that you have a document type for scanned sales quotes, called "sales_quotes", with configs for
- company_A
- company_B
- company_C
You test sales quote extractions from all the companies with the following validations:
Validation 1
- Description: If OCR'd, the source text for quoted rate value is a high-quality, unblurred image.
- Severity: warning
- Condition:
{
"or": [
{
"!": [
{
"exists": [
{
"var": "quote_rate.valueConfidence"
}
]
}
]
},
{
"and": [
{
">=": [
{
"var": "quote_rate.valueConfidence"
},
"0.90"
]
},
{
">=": [
{
"var": "quote_rate.anchorConfidence"
},
"0.90"
]
}
]
}
]
}Notes: Since some sales quotes for company_A are scanned documents, check if the field came from OCR'd text. If it was OCR'd (confidence score is not null), then test that it has a high OCR confidence score for both the anchor text and the extracted value text. This validation requires that you set a high verbosity setting in the SenseML configuration.
Validation 2
-
Description: The quoted rate value isn't null
-
Severity: error
-
Condition:
{"exists":[{"var":"quote_rate.value"}]}
Notes: Tests that a field (quote_rate) isn't null using the Sensible exists operation.
Validation 3
- Description: The quote duration is a round number
- Severity: warning
- Condition:
{"==":[{"%":[{"var":"quote_duration.value"},2]},0]}
Notes: Retrieves the value of an extracted quote_duration field using the JsonLogic var operation, then uses the JsonLogic modulo operation (%) to divide the rate by 2 and passes the test if the remainder equals ("==") 0.
Validation 4
- Description: Broker's email is in string@string format
- Severity: warning
- Prerequisite fields:
["broker\\.email"] - Condition:
{"match":[{"var":"broker\\.email.value"},"^\\S+\\@\\S+$"]}
Notes: If broker.email isn't null, then uses a Sensible operation (match) to test that the email matches a regular expression. If broker.email is null, skips this condition.
Validation 5
-
Description: The zip code is valid for USA or CA
-
Severity: warning
-
Condition:
{"or":[
{"and":[
{"==":[{"var":"country.value"},"US"]},
{"match":[{ "var":"zip_code.value" },"^[0-9]{5}$"]}]},
{"and":[
{"==":[{"var":"country.value"},"CA"]},
{"match":[{"var":"zip_code.value"},"^[A-Z][0-9][A-Z] [0-9][A-Z][0-9]$"]}]}
]} Notes: Tests that the zip_code is a 5-digit number if the country field equals USA, or 6 alphanumeric characters if the country field equals Canada. Uses a Sensible operation (match) to test regular expressions.
Validations output
For example output of the preceding conditions, see the following extraction excerpt and validation output:
Extraction excerpt
{
"parsed_document": {
"quote_rate": {
"source": "$800",
"value": 800,
"unit": "$",
"type": "currency"
},
"quote_duration": {
"type": "number",
"value": "6"
},
"broker.email": null,
"country": {
"type": "string",
"value": "USA"
},
"zip_code": {
"type": "number",
"value": "12345678901234456"
}
}Validations output
For the preceding extraction excerpt, Sensible outputs the following validations:
-
Validation 4: Sensible skips the broker email because the prerequisite field
broker.emailis null -
Validation 5: fails because
zip_codeis 17 digits
{
"validations": [{
"description": "The zip code is valid for USA or CA",
"severity": "warning"
}, {
"description": "Broker's email is in string@string format",
"severity": "skipped",
"message": "Missing prerequisites: broker.email"
}],
"validation_summary": {
"fields": 5,
"fields_present": 4,
"errors": 0,
"warnings": 1,
"skipped": 1
}
}
}Validation 6
The following example shows how to validate a field in sections.
- Description: engine spec exists.
- Severity: warning
- Condition:
{"exists":[{"var":"engine"}]} - Scope
["car_models","trim_specs"]
Config
The following example uses the config from Table grid example.
Example document
| Example document | Download link |
|---|
Output
The output for this example returns nulls for the engine field for the SV trim and the EX trim:
The two nulls trigger two validation warnings. Each warning reports the location of the null output:
For example, in the first failed validation, the "car_models",0... "trim_specs, 1" indices indicate null output for the engine field in the car_model[0] (Nissan) section's trim_spec[1] (SV trim) subsection. The second instance, "car_models", 1..."trim_specs", 0 indicates a null engine output for the EX trim subsection.
Notes
Alternative for dynamic validations
Use validations for static checks against fixed conditions. To compare extracted data against values that vary per request, for example, figures from an upstream system or context from a prior pipeline step, use the Extra Data method as an alternative:
-
Pass the expected values into the API request body using the
extra_datarequest body parameter. -
At extraction time, use the Extra Data computed field method to dynamically make the values available to your config.
-
Then compare or transform your extracted data to the expected data using logic-based or LLM-based computed field methods.
For more information, see the Extra Data computed field method.
Updated 10 days ago