Sensible

The Sensible Developer Hub

Welcome to the Sensible developer hub. You'll find comprehensive guides and documentation to help you start working with Sensible as quickly as possible, as well as support if you get stuck. Let's jump right in!

Get Started    

Anchor object

An anchor is a string, Match object, or array of Match objects. Sensible searches first for a text "anchor" because it's a computationally quick and inexpensive way to narrow down the location in the document where you want to extract data. Then, Sensible can take action based on that location, such as using a "method" to expand out from the anchor and grab the data you want.

If you want to be syntactically concise, you can define a simple string anchor like:

{
  "fields": [
    {
      "id": "simple_label",
      "anchor": "this is a string to anchor on",
      "method": {
        "id": "label",
        "position": "below"
      }
    }
  ]
} 

String anchors are expanded behind the scenes to case-insensitive includes matches. The preceding example is expanded automatically as:

      "anchor": {
        "match": {
          "type": "includes",
          "text": "this is a string to anchor on",
        }
      },

These are the top-level components of an expanded Anchor object:

Anchor parameters

keyvaluesdescription
match (required)Match object or array of match objectsSee Match object.
startstring, Match, or Match arrayDefines a point in the document at which to start searching for thematch you define for the anchor. By default not included in the anchor output. This parameter can be useful if you want to limit your search to a specific section of a document. For example, you can define a start that matches a section heading.
Note that lines that "follow" the start line are most reliably those that are positioned below the start line. Lines to the left are not included, and lines to the right are only considered "following" if they are at exactly the same height as the start line on the page. In other words a line qualifies as "successive" to the start line first by its y-axis position and then by its x-axis position.
endstring, Match, or Match arrayDefines a point in the document at which to stop searching for thematch you define for the anchor. By default not included in the anchor output. If unspecified, matches to end of document.
Note that lines that "precede" the end line are most reliably those that are positioned above the end line. Lines to the right are not included, and lines to the left are only considered "preceding" if they are at exactly the same height as the end line on the page. In other words, a line qualifies as "preceding" the start line first by its y-axis position and then by its x-axis position.
includeEndbooleanWhether to include the text in the matching end line in the anchor output.

Examples

Here's an example of an Anchor object that uses all these parameters:

{
  "fields": [
    {
      "id": "simple_label",
      "anchor": {
        "start": "My section heading to start matching on",
        "end": "My footer text to stop matching on",
        "includeEnd": true,
        "match": 
          [
            {
              "type": "includes",
              "text": "Only finds anchor if you match this string in a line that is between the start and end lines (best to ensure start is above and end is below the text).",
            },
          ]      
      },
      "method": {
        "id": "label",
        "position": "below"
      }
    }
  ]
}

Match object

Matches are instructions for matching lines of text in a document. They are valid elements in anchors. There are three different types of Match objects including simple, regex, and first matches. See the following sections:

Simple Match

Match using strings.

Parameters

keyvaluesdescription
textstringThe string to match
typeequals, startsWith, endsWith, includesequals: The matching line must exactly contain the string
startsWith: Match at beginning of line
endsWIth: Match at end of line
includes: Match anywhere in line

Examples

  {
  "fields": [
    {
      "id": "simple_label",
      "anchor": {
        "match": {
          "type": "startsWith",
          "text": "The line must start with this text",
        }
      },
      "method": {
        "id": "label",
        "position": "below"
      }
    }
  ]
} 


Regex Match

Match using a regular expression.

Parameters

keyvaluesdescription
pattern (required)valid JS regexJavascript-flavored regular expression. Capturing groups are not supported (see the Regex method instead). Note you have to double escape characters, since the regex is contained in a JSON object (for example, \\s not \s to represent a whitespace character).
flagsJS-flavored regex flags.Flags to apply to the regex. for example: "i" for case-insensitive, "g", "m", etc.
type (required)regex

Examples

For an example, see the Passthrough method example.

First match

This is a convenience or utility match to just match the first line encountered. It is useful in conjunction with the pageFilter preprocessor.

Parameters

keyvaluesdescription
typefirstMatches the first line encountered, usually on a specified page.

Examples

This example grabs all the lines in the document after the first line:

{
  "fields": [
    {
      "id": "all_lines_in_doc",
      "anchor": {
        "match": {
          "type": "first",
        }
      },
      "method": {
        "id": "documentRange"
      }
    }
  ],
}

This example performs OCR only on a known page, which is useful since OCR is slow and computationally expensive. For example, some document types may have inserted scans on specific known pages, so you need to OCR only those pages and let Sensible perform direct text extraction on all other pages:

{
  "preprocessors": [
    {
      "type": "ocr",
      "match": {
        "type": "first"
      },
      "pageOffset": 7
    },
  ],
  "fields": [
    {
      "id": "some_key",
      "anchor":  "some string",
      "method": {
        "id": "label",
        "position":"above"
      }
    }
  ],
}

Match arrays

For an array of Match objects, all matches must be found to successfully create an anchor. Each Match object must target a separate successive line. In other words, the second match starts its search in the line after the line matched by the first Match object in the array, and so on.

{
  "fields": [
    {
      "id": "simple_label",
      "anchor": {
        "start": "My section heading to start matching on",
        "end": "My footer text to stop matching on",
        "includeEnd": true,
        "match": 
          [
            {
              "type": "includes",
              "text": "Only finds anchor if you match this string in a line...",
            },
            {
              "type": "startsWith",
              "text": "...Followed by the 1st occurence of this string in another line",
            },
          ]      
      },
      "method": {
        "id": "label",
        "position": "below"
      }
    }
  ]
}

Methods influence matches

In addition to the conditions you set in the Match object itself (such as isCaseSensitive), the Method type you select in the id parameter influences whether text qualifies for the anchor match.

In other words, if you set a Label method, then only text that qualifies as a label matches in the anchor. If the text is too far away from any other lines to be used as a label, it won't match, even if all the conditions you set in the Match object itself are otherwise met.

In the following example, the PDF contains two instances of the string "Python". Even though we set "match":"last" in the config, the config only matches the first instance of Python. Why? We set a label method, and only the first instance of Python is close enough to the text below it to qualify as a label for that text ( "position":"below").

On the other hand, if we set the method to row, then both instances of "Python" qualify, and we successfully match the last instance:

Updated 5 days ago


Anchor object


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.