Field query object

A field is the basic SenseML query unit for extracting a piece of document data. SenseML is a query language that lets you extract structured data from documents, for example, from PDFs. The output of a field is a JSON key-value pair that structures the extracted data.

A field uses a method to extract data. SenseML contains layout-based and large language model (LLM)-based methods.

For more information, see the Examples section.

Parameters
Examples
Notes

Parameters

The Field object has the following top-level parameters:

ParameterValueDescription for layout-based methodsDescription for LLM-based methods
id (required)stringSensible uses the ID as the key in the structured key/value output. In the API response, this output is in the parsed_document object.
If a field fails and returns null, you can specify a backup, or fallback field to target the same data with a different method. To specify fallbacks between fields, specify consecutive fields that use the same ID.
For example, to capture differences in wording between document revisions, define two fields with the same ID, which anchor on synonymous text that 's present or absent in different document revisions. For more examples, see Using fallbacks.
Fallback fields can be of any kind. For example, you can fallback from a field, to a computed field, to a section group.
Limitations:
- Fallbacks don't work across nested structures. For example, you can't fall back from a parent section group's field to a child section group's field.
- Fallbacks don't work within a Query Group method. To specify fallbacks, define them in separate query groups.
Same
anchorstring, Match object, or array of Match objectsRequired
Matched text that narrows down the location of the target data to extract.
For more information, see Anchor object.
Optional
If the matched text is present anywhere in the document, Sensible runs the method on the whole document, otherwise it returns null. For more information, see Anchor object.
method (required)objectDefines how to spatially expand out from the anchor and extract the target data. Use for documents that have a relatively consistent spatial layout. For example, 1040 forms have relatively consistent layout. For more information, see Methods.Describes the contents of the target data to extract in natural-language prompts for an LLM. Use for documents that have a relatively inconsistent spatial layout, for example, legal contracts. For more information, see LLM-based methods.
typesee TypesThe data type to extract, for example, a currency, an address, or a custom type you define. This structured output includes the type information. If the field captures other data in addition to the data matching the type, Sensible suppresses the additional data from the output. For more information, see Types.same
matchfirst,last,all, allWithNull,mostFrequentIf there are multiple anchors, specifies which one to use to extract output for layout-based methods.

- first specifies the first anchor in the document that returns non-null output.

- last specifies the last anchor in the document that returns non-null output.

- all matches all anchors and returns non-null extracted output under a single key. For example, something like:
{
"name_of_output_key": [
{
"type": "string",
"value": "extracted data for first anchor match"
},
{
"type": "string",
"value": "extracted data for second anchor match"
} ]
}


- allWithNull matches all anchors and returns extracted output, including null output, under a single key. For example, use this option if you're using the Zip computed field method to zip together parallel arrays, where array elements can be nulls. For an example, see Zip.

- mostFrequent matches all anchors, extracts the corresponding output, then returns the most frequently occurring non-null output. This is useful for OCR text, like poor-quality scans or photographs. For example, a scanned document repeats a box titled 1 Wages four times with the same dollar value, 21850.20. Due to OCR errors, the extracted outputs are 21050.20, 21850.20, 21850.20 and 21850.58. This option returns the most frequent, and therefore the mostly likely correct output, 21850.20.
not applicable

Examples

Example 1

The following example shows a layout-based field and an LLM-based field.

Config

{
  "fields": [
    {
      /* LAYOUT-BASED EXAMPLE */
      "id": "name_of_output_key",
      /* an anchor is some text to match. define complex anchors using match
         arrays 
      */
      "anchor": "here's an anchor",
      /* this method uses spatial information 
         to locate the target data relative to the anchor 
      */
      "method": {
        "id": "label",
        /* target data is below the anchor */
        "position": "below"
      }
    },
    /* LLM-BASED EXAMPLE */
    {
      "id": "overview_table",
      "method": {
        /* this method uses LLMs to search
           for your target data based on your prompts ("descriptions") 
        */
        "id": "nlpTable",
        "description": "table describing SenseML",
        "columns": [
          {
            "id": "attribute",
            "description": "attribute of SenseML",
          },
          {
            "id": "description",
            "description": "description ",
          }
        ]
      }
    }
  ]
}

Example document
The following image shows the example document used with this example config:

Click to enlarge

Example documentDownload link

Output

{
  "name_of_output_key": {
    "type": "string",
    "value": "Below the matching anchor, this is the data to extract. The anchor is a label for this data."
  },
  "overview_table": {
    "columns": [
      {
        "id": "attribute",
        "values": [
          {
            "value": "key concepts",
            "type": "string"
          },
          {
            "value": "key method categories",
            "type": "string"
          }
        ]
      },
      {
        "id": "description",
        "values": [
          {
            "value": "Fields, anchors, and methods",
            "type": "string"
          },
          {
            "value": "LLM-based, layout-based, and computed",
            "type": "string"
          }
        ]
      }
    ],
    "title": {
      "type": "string",
      "value": "SenseML overview"
    }
  }
}

Example 2

The following example shows all the top-level parameters of the Field object:

{
  "fields": [
    {
      "id": "name_of_output_key",
      "anchor": "text to match",        
      "type":"accountingCurrency",
      "match":"last",
      "method": {
        "id": "row",
        "position": "right",
      }
    }
  ],
}

Next

The Field object contains: