Summarizer

Transforms short snippets of free text using an LLM (GPT-3.5 Turbo Instruct). The Summarizer computed field method takes as input another extracted field's output, and transforms the text based on prompts or short samples of extracted values you provide. Use this method to transform another method's output when you can't use types or other computed field methods. For example, if you use the Row method to return an inconsistently formatted ranking (first, 1st, 1), then you can use this method to consistently format the ranking. You can reformat with instructions like reformat each inconsistently formatted instance of rank as a word (i.e. \"first\" not \"1st\" or \"1\") .

Parameters

The following parameters are in the computed field's Method parameter:

keyvaluedescription
id (required)summarizerThe Anchor parameter is optional for fields that use this method. If you omit an anchor, Sensible searches the entire document for the data you want to extract.
source_id (required)field IDSpecifies a field whose output is a snippet of text with the information you want to transform.
fields (required)string arrayIDs of the fields you want to extract. These IDs have an impact on the free-text extraction, so choose names that have a meaningful relationship to the target data to extract. For example, for a dollar amount of rent to extract, rent, rents, and rent_in_dollars are good naming choices.
instructionsstringPrompt for the LLM, describing how to transform information from the text in the Source ID parameter.
For more information about how to write prompts, see Query Group tips.
For an example of using this parameter, see the Examples section.
samplesobject arrayShort snippets of text similar to the text in the Source ID parameter, with examples of the information to extract.
Use in addition to the Instructions parameter to increase the LLM's accuracy.
Contains these parameters:
prompt (string): An example of the sort of free text that you want to transform
values (string array): The target information to transform from this prompt. This array is a parallel array to the Fields parameter's array. Parallel arrays are the same length and same sequence. If the LLM can't find the target information in the Source ID parameter, it can generate an arbitrary value. To override this behavior, specify a Sample parameter whose Prompt parameter has a text snippet that's missing the target data, and whose Values array indicates the data is missing (for example, "N/A" or "not found").
For an example of using this parameter, see the Examples section.

Examples

The following example shows using the Summarizer method to:

  • Enforce consistent formatting for extracted class rankings.
  • Parse an extracted address into its constituent parts.

Config

{
  "fields": [
    {
      /* get historical data on class ranks
         for student Sanchez */
      "anchor": "sanchez",
      "id": "sanchez_ranks",
      "method": {
        "id": "row",
        "position": "right"
      }
    },
    {
      /* enforce consistent formatting for the extracted ranks */
      "id": "ranks_reformatted",
      "method": {
        "id": "summarizer",
        "source_id": "sanchez_ranks",
        "instructions": "reformat each inconsistently formatted instance of rank as a word (i.e. \"first\" not \"1st\" or \"1\") then return the ranks",
        "fields": [
          "rank"
        ],
        "samples": []
      }
    },
    {
      "method": {
        /* get the high school's info */
        "id": "queryGroup",
        "queries": [
          {
            "id": "high_school_name",
            "description": "high school's name ",
            "type": "string"
          },
          {
            "id": "high_school_addr",
            "description": "high school's address",
            "type": "string"
          }
        ]
      }
    },
    {
      "id": "address_reformatted",
      "method": {
        "id": "summarizer",
        "source_id": "high_school_addr",
        /* break the address into its separate parts */
        "instructions": "Parse the provided  address into its constituent parts: street, city, state, postal code, and country. Convert state abbreviations to full state names, and infer the country if not explicitly provided.",
        "fields": [
          "street",
          "city",
          "state",
          "postalCode",
          "country"
        ],
        "samples": [
          {
            /* provide an example of how to handle missing address components */
            "prompt": "145 Park Lane, apt B Florida 33101",
            "values": [
              [
                "145 Park Lane, apt B",
                "No City",
                "Florida",
                "33101",
                "United States"
              ]
            ]
          },
        ]
      }
    },
  ]
}

Example document
The following image shows the example document used with this example config:

Click to enlarge

Example documentDownload link

Output

{
  "sanchez_ranks": {
    "type": "string",
    "value": "3rd fourth 3"
  },
  "ranks_reformatted": [
    {
      "rank": "third fourth third"
    }
  ],
  "high_school_name": {
    "value": "SF High School",
    "type": "string",
    "confidenceSignal": "confident_answer"
  },
  "high_school_addr": {
    "value": "123 Main Street, San Francisco, CA 94000",
    "type": "string",
    "confidenceSignal": "confident_answer"
  },
  "address_reformatted": [
    {
      "street": "123 Main Street",
      "city": "San Francisco",
      "state": "California",
      "postalCode": "94000",
      "country": "United States"
    }
  ]
}