Sensible

The Sensible Developer Hub

Welcome to the Sensible developer hub. You'll find comprehensive guides and documentation to help you start working with Sensible as quickly as possible, as well as support if you get stuck. Let's jump right in!

Get Started    

Quickstart

Get structured data from an auto insurance quote

Let's get started with Sensible! In this quickstart you'll learn SenseML, a JSON-formatted query language for extracting information from PDFs. SenseML is powered by a mix of techniques, including machine learning, heuristics, and rules.

If you can write basic SQL queries, you can write SenseML queries! SenseML shields you from the underlying complexities of PDFs, so you can write queries that are visually and logically clear to a human programmer.

In this quickstart, you'll:

Get an account

  1. You'll need an account. Get one at Sensible.so. If you don't have an account, you can still read along to get a rough idea of how things work.

  2. Log into the Sensible app at app.sensible.so using your API key.

Create a config

  1. Click Create document type and name it "auto_insurance_quote". Leave the defaults and click Create.

  1. Download the following PDF document:

    auto_insurance_anyco_goldenDownload link
  2. Click Upload document and choose the generic car insurance quote you just downloaded.

  3. Click Create configuration, name it "anyco" (for the fictional company providing the quote), and click Create.

  4. Click the configuration name to edit the configuration:

When the configuration opens, you see an empty config pane on the left, the PDF in the middle, and an empty output pane on the right:

  1. For this quickstart, let's extract only a few pieces of information:

    • the policy number
    • the policy period
    • the premium for comprehensive insurance
  2. Paste in this config in the left pane in the editor to extract the data:

    {
    "fields": [
      {
        "id": "policy_period",
        "anchor": "policy period",
        "method": {
          "id": "label",
          "position": "right"
        }
      },
      {
        "id": "comprehensive_premium",
        "anchor": "comprehensive",
        "type": "currency",
        "method": {
          "id": "row",
          "tiebreaker": "second"
        }
      },
      {
        "id": "policy_number",
        "anchor": {
          "match": [
            {
              "text": "policy number",
              "type": "startsWith"
            }
          ]
        },
        "method": {
          "id": "box"
        }
      }
    ]
    }
    
  3. Click Publish to publish the config.

The following image shows this example in the Sensible app:

You should see the following extracted data in the right pane:

{
  "policy_number": {
    "type": "string",
    "value": "123456789"
  },
  "policy_period": {
    "type": "string",
    "value": " April 14, 2021 - Oct 14, 2021"
  },
  "comprehensive_premium": {
    "source": "$150",
    "value": 150,
    "unit": "$",
    "type": "currency"
  }
}

Congratulations! You've just created your first config! If you want to process car insurance quotes generated by a different company, you can create a new config and upload a new "golden" PDF.

How it works

  • Each "field" is a basic query unit in Sensible, and the field id is used as the key in the key/value JSON output. For more information, see Field.

  • Sensible searches first for a text "anchor" because it's a computationally quick and inexpensive way to narrow down the location in the document where you want to extract data. Then, Sensible uses a "method" to expand out from the anchor and grab the data you want. For more information about defining complex anchors, see Anchor. This config uses three types of methods:

How it works: label method

To grab the policy period from this text:

The config uses the Label method:

      {
        "id": "policy_period",
        "anchor": "policy period:",
        "method": {
          "id": "label",
          "position": "right"
        }
      }

This tells SenseML that:

  • The anchor ("policy period") is text that is pretty close to the text we want to grab, so it can serve as a "label" for that text ("id": "label").
  • SenseML should grab the text to the right of the label ("position": "right").

This config returns:

    "policy_period": {
      "type": "string",
      "value": " April 14, 2021 - Oct 14, 2021"
     }   

You can grab text to the right, left, above, or below a label. For example, how would you use a label to grab the driver's name? Try it out.

Key concept: lines

See those gray boxes around the text in the following image?

Each gray box show the boundaries for a "line." Sensible recognizes lines using whitespaces (and other factors), so multiple "lines" can occupy the same x-axis.

The Label method can operate within a single line, or across multiple lines. Given that, let's ask this question: could we use the Label method to grab a line in the following image?

For example, could we use "Bodily injury" as the anchor text and return "$25,000 each" for an insurance limit? Would something like the following config work?

{
    "id": "doesnt_work_returns_null",
    "anchor": "bodily injury",
    "method": {
        "id": "label",
        "position": "right"
    }
}

Try it, and you'll see it doesn't work.

This is because the Label method works only for closely proximate lines (sensitive to spacing and font size), or for text in the same line. In this case, the problem is that the gap between the two lines of text is too big:

Let's take a look at a purpose-built Row method instead to grab text in a table.

How it works: row method

To grab this comprehensive premium of $150:

this config uses the Row method:

      {
          "id": "comprehensive_premium",
          "anchor": "comprehensive",
          "type": "currency",
          "method": {
              "id": "row",
              "tiebreaker": "second",
          }
      }

This tells Sensible that:

  • The anchor text ("comprehensive") is part of a row in a table ("id": "row").
  • The returned value is a currency ("type": "currency"). For other data types you can define, see Field query object.
  • Sensible should grab the second line in the row after the anchor ("tiebreaker": "second"). The tiebreaker lets you select which line in the row you want and can include maximums and minimums (< and >).
  • It's not shown, but the default behavior is to grab lines to the right of the anchor in the row ("position":"right").

This returns:

    "comprehensive_premium": {
      "source": "$150",
      "value": 150,
      "unit": "$",
      "type": "currency"
    }

But wait! Why didn't "tiebreaker": "second" select $250 instead of $150, since $250 is the second line after the anchor (the first line is just a bunch of dots, "............")?

The reason is that "tiebreaker": "second" evaluates after the datatype we set in the field, "type": "currency". So, instead of looking for the second line after the anchor in general, it looks for the second line that contains a currency. Convenient, right? There are two such lines, $250 and $150, and $150 is the second one.

Key concept: visualize anchors and matches

In the app, you can visually inspect anchors and methods by looking at their color coding:

  • Orange boxes show lines matched by the Anchor object.
  • Blue boxes show lines matched by the Method object.

To continue the Row method example from the previous section, in the following image the orange box shows that "Comprehensive" is the anchor line:

Why are all the lines after "Comprehensive" colored blue, when the previous example output only included one line, $150?

The answer is that the Sensible app shows you the entire scope of the method match, not just what the method outputs. So the Row method matches all the elements in the row after the anchor, but then narrows down the actual output to $150 using "tiebreaker": "second". Seeing the entire method match in the app can help you troubleshoot unexpected output.

How it works: box method

To grab the policy number from this box:

The config uses the Box method:

      {
        "id": "policy_number",
        "anchor": {
          "match": [
            {
              "text": "policy number",
              "type": "startsWith"
            }
          ]
        },
        "method": {
          "id": "box",
        }
      },

This tells Sensible the following:

  • The anchor is inside a box ("id": "box").

  • The anchor text is "policy number". Notice that the anchor line is a little more complex than previous examples, because we also define a match type ("type": "startsWith"). Notice you can write a simpler anchor as "anchor":"policy number", or you can expand to complex anchors. For more information, see Anchor object.

This returns:

  {
      "policy_number": {
          "type": "string",
          "value": "123456789"
      }
  }

Note: Sensible grabs the box contents, but not the anchor itself. In general, Sensible returns methods results, not anchor results (unless you define a Passthrough method). Similarly, most Sensible methods ignore the anchor line (the line containing the anchor text) and do not include it in the output.

Advanced queries

You can get more advanced with this auto insurance config. For example:

  • You can use a Column method to return all the listed premiums ($90, $15, $130).
  • The limits listed in the table (for example, "$25,00 each person/$50,000 per accident") are tricky for the Row method to capture since they can be a variable number of lines. Row methods depend on strict x-axis alignment of lines, so you'd only be able to grab the first line. Instead, you can use the Table method to more reliability capture the data in each cell of the whole table. Or, use an xRangeFilter parameter in the Document Range method to capture the limits.
  • What if the document listed multiple emails, and you just wanted to capture all those emails? You could use a regular expression (regex) in your anchor coupled with a Passthrough method, or the Regex method.

We'll save these and other techniques for a later tutorial! To check out other methods, see Methods.

Sanity test the config

Before integrating the config with an application and writing tests against it, let's sanity test the config by uploading another quote.

  1. Repeat the steps in the previous section to upload a second generic car insurance quote:

    auto_insurance_anyco_golden_2Download link
  2. Click the anyco config, select the "auto_insurance_anyco_golden_2" PDF, and look at the output:

    Uh oh! It looks like this policy period spills over onto the next line, so Sensible misses the end year (2021). That seems like sloppy PDF formatting, but let's work with it.

How can you capture the policy period reliably?

Document Range method

As you become more familiar with Sensible, your first impulse might be to use the Document Range method, which grabs multiple lines of text, like paragraphs, after an anchor. In this case, the PDF doesn't fit neatly into the Document Range method, because the first line we want is also part of the anchor (the orange box). As a result, the Document Range leaves out the first line of the period and only grabs the year in the method match (the blue box):

There's a workaround: specify to include the anchor in the Document Range method and filter out unwanted text in the anchor (the words "Policy period"). Try it out by replacing your existing policy_period field with this example:

   {
      "id": "policy_period",
      "anchor": "policy period",
      "method": {
        "id": "documentRange",
        "includeAnchor": true,
        "wordFilters": [
          "policy period"
        ],
        "stop": {
          "text": "for customer",
          "type": "startsWith"
        },
      }
    }

Region method

There are multiple ways to capture the policy period. Let's explore a completely new approach: a Region method. A region is a rectangular space defined by coordinates relative to the anchor.

Replace your existing policy_period field with the following field in the editor:

    {
      "id": "policy_period",
      "anchor": {
        "match": [
          {
            "text": "policy period",
            "type": "startsWith"
          }
        ]
      },
      "method": {
        "id": "region",
        "offsetX": -0.1,
        "offsetY": -0.1,
        "width": 3.5,
        "height": 0.45,
        "start": "left",
        "wordFilters": [
          "policy period",
        ]
      }
    },

This field defines a region in inches relative to the anchor. Since the region overlaps the anchor, it uses wordFilters to remove the anchor text in the output. See the green box in the editor? This box dynamically resizes as you adjust the region parameters (such as the Height and Start parameters), so you can visually tweak the region till you're satisfied.

Let's double check that this region also works with our first PDF:

Yes, it does. If you're feeling picky, try resizing the region using the green box for visual feedback, until the lower edge of the box doesn't overlap the customer service line in the first PDF (auto_insurance_anyco_golden_1.pdf). But even if you don't fiddle with the region size, you can rest easy that you won't accidently capture the customer service line. This is because the Region method only captures lines that are completely contained in the region.

  1. Click Publish to save your changes to the config.

In a production scenario, continue testing PDFs until you're confident your configs will work with the PDF document type you've defined.

Integrate with your application

When you're satisfied with your config, use the Sensible API to integrate with your application.

Test the extraction API

First, make sure you can extract data using the config and PDF you created in previous steps:

  1. In your local file system, locate the the example generic car insurance quote you downloaded in a previous step. For this example, use something like this to extract data:
curl --request POST \
  --url https://api.sensible.so/v0/extract/auto_insurance_quote \
  --header 'Authorization: Bearer YOUR_API_KEY' \
  --header 'Content-Type: application/pdf' \
  --data-binary '@/PATH_TO_DOWNLOADED_PDF/auto_insurance_anyco_golden.pdf'
  1. Important! Remember to click Publish in the Sensible app to publish your config, or this request won't work:

.

  1. For an easy way to run this cURL request, download the Postman desktop app.

  2. Copy the previous code sample. Replace YOUR_API_KEY with your API key.

  3. In the Postman desktop app, click Import, select Raw text, and paste in the code sample:

  4. Correct the path to your downloaded PDF:

    • If you're in the command line: Replace PATH_TO_DOWNLOADED_PDF with the local directory path to the PDF.
    • If you're in Postman: In the request, click the Body tab, select binary, then click Select file and select your PDF:

  1. Click Send, and you should see a response like this:
{
    "id": "b0ac180d-55d2-4946-80c0-a87243319746",
    "created": "2021-05-20T18:02:37.019Z",
    "status": "COMPLETE",
    "type": "auto_insurance_quote",
    "parsed_document": {
        "policy_number": {
            "type": "string",
            "value": "123456789"
        },
        "policy_period": {
            "type": "string",
            "value": " April 14, 2021 - Oct 14, 2021"
        },
        "comprehensive_premium": {
            "source": "$150",
            "value": 150,
            "unit": "$",
            "type": "currency"
        }
    }
}

Note: You don't have to specify the config you created (anyco) in this call. Sensible looks at all the configs for the document type you made in this quickstart (auto_insurance_quote), and automatically chooses the one that fits best!

Now you can use the Sensible API to generate upload and download URLs for multiple car insurance quote PDFs, retrieve results, and integrate with your application!

Updated 4 days ago


Quickstart


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.