Getting started

Get structured data from an auto insurance quote

Let's get started with Sensible! In this tutorial, you'll learn SenseML, a JSON-formatted query language for extracting structured data from PDFs. SenseML uses a mix of techniques, including machine learning, heuristics, and rules.

If you can write basic SQL queries, you can write SenseML queries. SenseML shields you from the underlying complexities of PDFs, so you can write queries that are visually and logically clear to a human programmer.

In this tutorial, you'll:

Get an account

  1. Get an account at sensible.so. If you don't have an account, you can still read along to get a rough idea of how things work.

  2. Log into the Sensible app at app.sensible.so using your API key.

Create a config

  1. Click Create document type and name it "auto_insurance_quote." Leave the defaults and click Create.

Click to enlargeClick to enlarge

  1. Download the following PDF document:

    auto_insurance_anycoDownload link
  2. Click Upload document and choose the generic auto_insurance_anyco car insurance quote you just downloaded.

  3. Click Create configuration, name it "anyco" (for the fictional company providing the quote), and click Create.

  4. Click the Anyco configuration to edit it. When the configuration opens, you see an empty config pane on the left, the PDF in the middle, and an empty output pane on the right:

Click to enlargeClick to enlarge

Extract data

For this tutorial, you'll extract these fields:

  • the policy number
  • the policy period
  • a couple of premiums
  1. Paste this config into the left pane in the editor to extract the data:
{
  "fields": [
    {
      "id": "policy_period",
      "anchor": "policy period",
      "method": {
        "id": "label",
        "position": "right"
      }
    },
    {
      "id": "comprehensive_premium",
      "anchor": "comprehensive",
      "type": "currency",
      "method": {
        "id": "row",
        "tiebreaker": "second"
      }
    },
    {
      "id": "property_liability_premium",
      "anchor": "property",
      "type": "currency",
      "method": {
        "id": "row",
        "tiebreaker": "second"
      }
    },
    {
      "id": "policy_number",
      "type": "string",
      "anchor": {
        "match":
          {
            "text": "policy number",
            "type": "startsWith"
          }
      },
      "method": {
        "id": "box"
      }
    }
  ]
}
  1. Click Publish and choose Production to publish the config.

The following image shows this example in the Sensible app:

Click to enlargeClick to enlarge

You should see the following extracted data in the right pane:

{
  "policy_period": {
    "type": "string",
    "value": "April 14, 2021 - Oct 14, 2021"
  },
  "comprehensive_premium": {
    "source": "$150",
    "value": 150,
    "unit": "$",
    "type": "currency"
  },
  "property_liability_premium": {
    "source": "$10",
    "value": 10,
    "unit": "$",
    "type": "currency"
  },
  "policy_number": {
    "value": "123456789",
    "type": "string"
  }
}

Congratulations! You created your first config. If you want to process car insurance quotes generated by a different company, you can create a new config and upload a new reference PDF.

How it works

  • Each "field" is a basic query unit in Sensible. Sensible uses the field id as the key in the key/value JSON output. For more information, see Field.
  • Sensible searches first for a text "anchor" because it's a computationally quick and inexpensive way to narrow down the location of the target data to extract. For more information about defining complex anchors, see Anchor.
  • Then, Sensible uses a "method" to expand out from the anchor and extract the data you want. For more information about methods, see Methods.
  • This config uses three types of methods:

How it works: Label method

To extract the policy period from the document:
Click to enlargeClick to enlarge

The config uses the Label method:

      {
        "id": "policy_period",
        "anchor": "policy period:",
        "method": {
          "id": "label",
          "position": "right"
        }
      }

This describes the data to extract:

  • The anchor ("policy period") is text that's pretty close to the text to extract, so it can serve as a "label" for that text ("id": "label").
  • The text to extract is to the right of the label ("position": "right").

This config returns:

    "policy_period": {
      "type": "string",
      "value": " April 14, 2021 - Oct 14, 2021"
     }   

You can extract text to the right, left, above, or below a label. For example, how would you use a label to extract the driver's name? Try it out.

Key concept: lines

See those gray boxes around the text in the following image?

Click to enlargeClick to enlarge

Each gray box shows the boundaries for a "line." Sensible recognizes lines using whitespaces and other factors, so "lines" can occupy the same height on the page.

The Label method can operate in a single line, or on consecutive lines. Here's a question: for the preceding image, can you use the Label method to anchor on "Bodily injury" and return "$25,000 each"? Try it out:

{
    "id": "doesnt_work_returns_null",
    "anchor": "bodily injury",
    "method": {
        "id": "label",
        "position": "right"
    }
}

This returns null, because the Label method works for text in the same line or in proximate lines, sensitive to spacing and font size. In this case, the problem is that the gap between the two lines of text is too big:

Click to enlargeClick to enlarge

Take a look instead at a purpose-built Row method instead to extract text in a table.

How it works: Row method

To extract the comprehensive premium of $150:

Click to enlargeClick to enlarge

The config uses the Row method:

      {
          "id": "comprehensive_premium",
          "anchor": "comprehensive",
          "type": "currency",
          "method": {
              "id": "row",
              "tiebreaker": "second",
          }
      }

This describes the data to extract:

  • The anchor text ("comprehensive") is part of a row in a table ("id": "row").
  • The returned value is a currency ("type": "currency"). For other data types you can define, see Field query object.
  • The text to extract is the second line in the row after the anchor ("tiebreaker": "second"). Use tiebreakers to select lines in rows, for example maximum and minimum values (< and >).
  • By default, the Row method extracts values to the right of the anchor. You can override the default by specifying ("position":"left").

This returns:

    "comprehensive_premium": {
      "source": "$150",
      "value": 150,
      "unit": "$",
      "type": "currency"
    }

But wait! Why didn't "tiebreaker": "second" select $250 instead of $150, since $250 is the second line after the anchor (the first line is just a bunch of dots, "............")?

The reason is that "tiebreaker": "second" evaluates after the data type specified in the field, "type": "currency". Instead of looking for the second line after the anchor in general, Sensible looks for the second line that contains a currency. Convenient, right?

Key concept: visualize anchors and matches

In the app, you can visually inspect anchors and methods by looking at their color coding:

  • Orange boxes show lines matched by the Anchor object.
  • Blue boxes show lines matched by the Method object.
  • Pale blue boxes show lines discarded by the Method object. Seeing the entire method match in the app can help you troubleshoot unexpected output.

To continue the Row method example from the previous section, in the following image the orange box shows that "Comprehensive" is the anchor line:

Click to enlargeClick to enlarge

The dark and pale blue boxes show you that the Row method matches all the lines in the row after the anchor, but then narrows down the actual output to $150 using "tiebreaker": "second".

How it works: Box method

To extract the policy number from this document:

Click to enlargeClick to enlarge

The config uses the Box method:

{
      "id": "policy_number",
      "type": "string",
      "anchor": {
        "match": 
          {
            "text": "policy number",
            "type": "startsWith"
          }  
      },
      "method": {
        "id": "box"
      }
    }

This describes the data to extract:

  • The anchor is inside a box ("id": "box").
  • The anchor text is policy number.
  • The anchor line is a little more complex than previous examples, because it also defines a match type ("type": "startsWith"). You can write a simpler string anchor as "anchor":"policy number", or you can expand to complex anchors. For more information, see Anchor object.

This returns:

  "policy_number": {
    "value": "123456789",
    "type": "string"
  }

Note: Sensible extracts the box contents, but not the anchor itself. By default, Sensible returns method results, not anchor results.

Advanced queries

You can get more advanced with this auto insurance config. For example:

  • You can use a Column method to return all the listed premiums ($90, $15, $130).
  • The limits listed in the table are tricky for the Row method to capture since they can be a variable number of lines. Row methods depend on strict horizontal alignment of lines, so Sensible extracts the first line. Instead, use the Table method to more reliably capture the data in each cell of the whole table. Or, use an xRangeFilter parameter in the Document Range method to capture the limits.
  • What if the document listed emails, and you just wanted to capture all those emails? You could use a regular expression (regex) in a "match":"all" anchor coupled with a Passthrough method, or the Regex method.
  • You can split the policy period into two dates, either by using the Split computed field method, or by setting the Date type on the field and using a tiebreaker.

To check out other methods, see Methods.

Test the config

Before integrating the config with an application and writing validation tests against it, double check the config by uploading another quote.

  1. Repeat the steps in the previous section to upload a second generic car insurance quote:

    auto_insurance_anyco_2Download link
  2. Click the anyco config, select the "auto_insurance_anyco_2" PDF, and look at the output. Unlike the first document, the policy period takes up two lines, so Sensible misses the end year (2021):

    {
      "policy_period": {
        "type": "string",
        "value": "May 20, 2021 - Nov 20,"
      }
    

Click to enlargeClick to enlarge

That seems like sloppy PDF formatting, but let's work with it. There are several options for capturing the policy period reliably, including:

  • Document Range method
  • Region method

Alternative 1: Document Range method

You can use the Document Range method to extract the policy period. This method extracts succeeding lines of text after an anchor. You need to configure some optional parameters, because the Document Range method by default discards anchor lines. Since the date range is part of the anchor line (the line containing "policy period"), you need to specify to:

  • include the anchor with "includeAnchor": true
  • filter out unwanted text in the anchor (the words "Policy period") with a Word Filters parameter.

Try it out by replacing your existing policy_period field with this example:

   {
      "id": "policy_period",
      "anchor": "policy period",
      "method": {
        "id": "documentRange",
        "includeAnchor": true,
        "wordFilters": [
          "policy period"
        ],
        "stop": {
          "text": "for customer",
          "type": "startsWith"
        },
      }
    }

Alternative 2: Region method

You can use the Region method to extract the policy period. A region is a rectangular space defined by coordinates relative to the anchor.

Replace the existing policy_period field with the following field in the Sensible app:

    {
      "id": "policy_period",
      "anchor": {
        "match": [
          {
            "text": "policy period",
            "type": "startsWith"
          }
        ]
      },
      "method": {
        "id": "region",
        "offsetX": -0.2,
        "offsetY": -0.1,
        "width": 3.6,
        "height": 0.45,
        "start": "left",
        "wordFilters": [
          "policy period",
        ]
      }
    },

This field defines a region in inches relative to the anchor. Since the region overlaps the anchor, specify a Word Filters parameter to remove the anchor text in the output. See the green box representing the region in the editor? This box dynamically resizes as you adjust the region parameters (such as the Height and Start parameters), so you can visually tweak the region till you're satisfied.

Click to enlargeClick to enlarge

Let's double check that this region also works with the first PDF:

Click to enlargeClick to enlarge

Yes, it works too.

If you're feeling picky, try resizing the region using the green box for visual feedback, until the lower edge of the box doesn't overlap the customer service line in the first PDF (auto_insurance_anyco_1.pdf). But even if you don't fine tune the region size, you can rest easy that you won't accidently capture the customer service line. This is because the Region method excludes text that partly lies outside a region.

  1. Click Publish and choose Production to save your changes to the config.

In a production scenario, continue testing PDFs until you have confidence your configs work with the PDF document type you've defined. Then, write tests to validate the extractions in production.

Integrate with your application

When you're satisfied with your config, use the Sensible API to integrate with your application. If you're new to APIs, then see Try asynchronous extraction from your URL for a tutorial.

Validate extractions in production

In a previous section, you tested a couple of PDFs manually. Now it's time to scale up and quality control the extractions by writing tests that run for all API extractions in a doc type.

Use JsonLogic to validate that the extracted information makes sense for the car insurance document:

  • Test that the property damage liability premium is cheaper than the comprehensive premium:
    • {"<":[{"var":"property_liability_premium.value"},{"var":"comprehensive_premium.value"}]}
  • Test that the policy number is a nine-digit number:
    • {"match":[{"var":"policy_number.value"},"\\d{9}"]}

To add these tests:

  1. In the auto_insurance_quote document type, click Create validation. Add the following input to the dialog:
    • Set the Severity to Warning
    • Set the Description to "prop. damage less than comprehensive"
    • Set the Condition to:
{"<":
 [
     {"var":"property_liability_premium.value"},
     {"var":"comprehensive_premium.value"}
 ]
}

Click to enlargeClick to enlarge

  1. Click Create.
  2. Repeat the previous steps to create another validation with the following settings:
    • Set the Severity to Error
    • Set the Description to "policy number is a nine-digit number"
    • Set the Condition to:
{"match":
  [
      {"var":"policy_number.value"},"\\d{9}"
  ]
}
  1. To test the validations with a PDF that's missing information, try out an API call with the following example PDF that has these errors:

    • the policy number is missing
    • the property damage liability premium is $200 more than the comprehensive premium
auto_insurance_anyco_3Download link

You should receive a response with errors and warnings in the Validations array:

{
    "id": "11404335-1ea4-4414-a5ca-1ccef568ebec",
    "created": "2021-09-21T17:36:56.339Z",
    "status": "COMPLETE",
    "type": "auto_insurance_quote",
    "configuration": "anyco",
    "parsed_document": {
        "policy_period": null,
        "comprehensive_premium": {
            "source": "$100",
            "value": 100,
            "unit": "$",
            "type": "currency"
        },
        "property_liability_premium": {
            "source": "$300",
            "value": 300,
            "unit": "$",
            "type": "currency"
        },
        "policy_number": null
    },
    "validations": [
        {
            "description": "prop. damage less than comprehensive",
            "severity": "warning"
        },
        {
            "description": "policy number is a nine-digit number",
            "severity": "error"
        }
    ],
    "validation_summary": {
        "fields": 4,
        "fields_present": 2,
        "errors": 1,
        "warnings": 1,
        "skipped": 0
    }
}

Next


Did this page help you?