Handling document variations

Overview

When you extract data from a set of similar documents (a document type), you encounter variations in layout, formatting, and content. You often aim to extract data across all these variations and output them into a unified data schema. To achieve this goal, you can take the following steps:

You author data-extraction fields using various SenseML methods to extract the same target data from differing sources.
You conditionally execute the fields you author depending on the variations you encounter in the source documents.

This topic covers conditional execution.

Example

You extract data from bank statements. The statements all convey the same basic information, but vary in large and small ways:

Each major bank has its own distinctive layout and formatting for its statements. Some combine checking and savings into one statement, and others separate them.
Some affiliate banks have slightly different layouts but share many similarities.
You have a long tail of small regional credit unions you want to extract from, and it would be overwhelming to qualify all the minor variations.

To create a unified output schema for these banks, you can conditionally execute data-extraction fields based on which bank issued the statement, whether a SenseML field returns null, and other factors.

In order of granularity, here are the options for conditionally executing SenseML:

Handling document variations

option	granularity	how it works	example use case	full example
configs	config	Sensible determines the best-fitting config for a document, based either on: - Sensible's default scoring or - configurable "fingerprints" (characteristic text in the document)	In a `bank_statements` document type, you extract bank statements from Chase, Wells Fargo, Bank of America, and a long tail of small regional banks. 1. For each major bank, you author a config (a collection of data-extraction fields). The `wells_fargo_config` extracts data if the document contains the text "Wells Fargo"; the `boa_config` extracts data if the document contains the text "Bank of America", and so on. 2. For the long-tail regional banks, you author a fallback, generalized, LLM-based config that runs if the names of the major banks are absent in the document.	Import Sensible out-of-the-box support for common forms and browse the configs in each document type in the Sensible app
conditional execution	subset of fields in a config	Based on a pass/fail logical condition, Sensible executes alternate subsets of fields in a config.	You want to extract data from two affiliate banks' statements. The statements' layouts are so similar that you can reuse 90 percent of your SenseML fields to handle both. Rather than authoring two separate configs, you can handle the remaining 10 percent with conditional field execution.	Conditional execution
fallback fields	single field	If a field fails to extract data, Sensible falls back to another identically named field in a config. The fallback field uses an alternate extraction method.	You want to extract a `total_amount` field that appears in a table in document revision A and in a free-text paragraph in document revision B. You define two fields in one config with the same ID (`total_amount`), which use the Row method and the Query Group method, respectively.	Fallback fields