Fingerprint

Fingerprints test for matching text in a document to determine whether it's a good fit for a config or not. There are two types of fingerprints:

  • one for optimizing extraction performance for standalone documents
  • one for segmenting portfolio files into separate documents.

If you use a config for both portfolio and standalone versions of the same document, Sensible automatically converts between the two and uses the appropriate fingerprint.

fingerprints for:notes
standalone documents Improve performance by testing for matching text in a document before running or skipping a config in a given document type. By skipping configs that fail a fingerprint, you can save processing time. This is relevant if a config contains computationally expensive operations like selective OCR, table recognition, or box recognition methods.
To test for matching text at the field level instead of the document type level, specify field fallbacks. For more information, see Field query object.
portfolios A portfolio contains multiple documents combined into one file, such as an invoice, a contract, and a tax form. Sensible uses fingerprints to segment a portfolio into documents. Fingerprints test for matching text that characterizes first, last, or other pages for documents in the portfolio. For more information, see Multi-document extraction.

Standalone documents

Parameters

A fingerprint consists of an array of tests, where each test is a string, a Match object, or array of Match objects. For more information, see Match object.

Behind the scenes, Sensible automatically expands this simple syntax to syntax for portfolio fingerprints using "page" : "any".

Examples

The following fingerprint tests a vendor-specific config "anyco_life_insurance_quote" in a document type "life insurance quotes". This fingerprint tests that a document is a life insurance quote from Anyco by looking for three known key phrases.

{
  "fingerprint": {
    "tests": [
      {
        "type": "startsWith",
        "text": "anyco"
      },
      "[email protected]",
      "life insurance quote"
    ]
  },
  "fields": []
}

The config preferentially runs if the fingerprint finds the phrases.

Portfolios

Parameters

A fingerprint consists of an array of tests, where each test contains a Page parameter and a Match parameter:

"fingerprint": {
    "tests": [
      {
        "page": "every",
        "match": [
          {
            "text": "this text always shows up on every page of the document",
            "type": "includes"
          }
        ]
      },
      {
        "page": "last",
        "match": [
          {
            "text": "this text always shows up on the last page of the document",
            "type": "startsWith"
          }
        ]
      }
    ]
  }

The following table shows parameters for each test:

keyvaluedescription
match (required)a string, a Match object, or array of Match objects.Specifies the text to match for the test.
If you specify a Match array in this parameter, then Sensible must find all the matches in the array on the same page for the test to pass.
If you want to specify fallback matches for the same page type, specify the matches in separate tests. For example, a form has revisions 1 and 2 that have slightly different wordings on the first page. Specify one test with a first page type and wording A, and specify a second test with a first page type and wording B.
offsetintegerSpecifies where to start or end the document segment, offset in pages relative to the first or last page defined by the Match parameter. For example, if you specify that the page that contains the phrase "A summary of your rights" is the first page of a segment, and Sensible finds a match for the first page on the zero-indexed page 3 of a portfolio:
- specifying "offset": -1 starts the document segment on page 2 of the portfolio.
- specifying "offset": 1 starts the document segment on page 4 of the portfolio.
pagefirst, last, every, anyConfigure with the following enums:
first - The first page of a document segment must meet the match criteria.
last - The last page of a document segment must meet the match criteria. If you specify last, you must pair it with a different page type, such as every.
every - Every page in the document segment must meet the match criteria. If you define this page type, you must pair it with a different page type, such as last.
any- Any page in the document segment can meet the criteria.
Notes:
- For an example see Multi-document extraction.
- If you reuse the same config between portfolios and standalone documents, then for standalone document extractions, Sensible ignores the configured value of this parameter.

Tips

Use the following tips when you define fingerprints for portfolios:

  • If the first page contains unique text, Sensible recommends specifying solely a first page test.

  • If the first page doesn't contain unique text and the last page does, Sensible recommends specifying a last page test and an every page test.

  • Avoid specifying an any page test unless other page types fail to segment the document.

Examples

For an example of using fingerprints to extract multiple documents from a portfolio file, see Multi-document extraction.

Notes

For information about configuring fingerprint strictness for standalone documents, see Fingerprint mode.