Fingerprint
Fingerprints test for matching text in a document to determine whether it's a good fit for a config or not. There are two types of fingerprints:
- one for optimizing extraction performance for standalone documents
- one for segmenting PDF portfolios into separate documents.
If you use a config for both portfolio and standalone versions of the same document, Sensible automatically converts between the two and uses the appropriate fingerprint.
fingerprints for: | notes |
---|---|
standalone documents | Improve performance by testing for matching text in a document before running or skipping a config in a given document type. By skipping configs that fail a fingerprint, you can save processing time. This is relevant if a config contains computationally expensive operations like selective OCR, table recognition, or box recognition methods. |
portfolios | Segment PDF portfolios (multiple documents combined into one PDF) into standalone documents by testing for text that characterizes specified pages for documents in the portfolio. For more information, see Document portfolios. |
Standalone documents
Parameters
A fingerprint consists of an array of tests, where each test is a string, a Match object, or array of Match objects. For more information, see Match object.
Behind the scenes, Sensible automatically expands this simple syntax to syntax for portfolio fingerprints using "page" : "any"
.
Examples
The following fingerprint tests a vendor-specific config "anyco_life_insurance_quote" in a document type "life insurance quotes". This fingerprint tests that a document is a life insurance quote from Anyco by looking for three known key phrases.
{
"fingerprint": {
"tests": [
{
"type": "startsWith",
"text": "anyco"
},
"[email protected]",
"life insurance quote"
]
},
"fields": []
}
The config preferentially runs if the fingerprint finds the phrases.
Notes
A fingerprint for standalone documents changes Sensible's default behavior of running all the configs in a single document type. For example, if you extract company A and company B quotes, by default Sensible runs both the company A and the company B configs for a given document, then returns the extraction with the highest score.
The following table shows how this default behavior changes when you configure the following levels of strictness for a document type's fingerprints. You can configure strictness in the Sensible app in the document type settings:
Strictness level | Description | If more than one config's tests pass over 50% | If no configs' tests passes over 50% or if no configs contain a fingerprint |
---|---|---|---|
standard | If any of the configs in the document type contain a fingerprint, then Sensible runs extractions using any configs that pass over 50% of the fingerprint tests. | Sensible chooses the output from the passing config with the highest score (highest number of non-null fields minus penalties for validation errors or warnings). | Sensible falls back to the default behavior of running extractions for the document using all configurations, and returns the one that has the highest score. |
strict | The doc type must have at least one config containing a fingerprint. | Sensible chooses the output from the passing config that has the highest score. | Sensible returns a 400 error. |
Portfolios
Parameters
A fingerprint consists of an array of tests. The following table shows parameters for each test:
key | value | description |
---|---|---|
match (required) | a string, a Match object, or array of Match objects. | Specifies the text to match for the test. |
offset | integer | Specifies where to start or end the document segment, offset in pages relative to the first or last page defined by the Match parameter. For example, if you specify that the page that contains the phrase "A summary of your rights" is the first page of a segment, and Sensible finds a match for the first page on the zero-indexed page 3 of a portfolio: - specifying "offset": -1 starts the document segment on page 2 of the portfolio.- specifying "offset": 1 starts the document segment on page 4 of the portfolio. |
page | first , last , every , any | For PDF portfolios (multiple documents combined into one PDF, such as an invoice, a contract, and a tax form), tests for document starts and ends to segment the portfolio into documents. - Sensible discards orphaned last matches. In other words, if you specify last , then Sensible must find at least one other fingerprint of a different page type preceding the last match in order to recognize the document. For more information see Document portfolios. - If you reuse the same config between portfolios and standalone documents, then for standalone document extractions, Sensible ignores the configured value of this parameter and treats it as "page" : "any" . This way, Sensible avoids strictly matching to extraneous front or back matter (for example, a fax cover page) in single documents. |
Examples
For an example of using fingerprints to extract multiple documents combined into one PDF portfolio, see Document portfolios.
Updated 1 day ago