Extract lines inside a box. This method works by default with boxes that have a light background and dark, continuous borders.

Parameters
Examples

Parameters

Note: For the full list of parameters available for this method, see Global parameters for methods. The following table shows parameters most relevant to or specific to this method.

keyvaluedescription
id (required)boxExtracts all lines in a box. If you define an anchor that's outside the box borders, then use offset parameters to define a point that's inside the box borders so that Sensible recognizes the box.
positionright, left, below, above. default: center of the anchor line's bounding boxUse this parameter to fine tune box recognition. Defines the starting point for the box recognition relative to the anchor. For example, right specifies starting at the midpoint of the anchor line's right boundary, and below specifies starting at the midpoint of the anchor line's bottom boundary. Sensible searches outward from this point until it finds dark pixels signifying the box border.
For an example of how to use this parameter, see the following Examples section.
offsetXnumber in inches default: 0Searches for a box starting at a point offset from the point defined by the Position parameter. Positive values offset to the right, negative values offset to the left. For an example of how to use this parameter, see the following Examples section.
offsetYnumber in inches default: 0Searches for a box starting at a point offset from the point defined by the Position parameter. Positive values offset down the page, negative values offset up the page. For an example of how to use this parameter, see the following Examples section.
offsetBoxesobject. default: noneRecognize a box offset from the point defined in the Position parameter by a number of contiguous boxes that share borders. For example, use this parameter for tables or grids where borders surround every cell. Contains the following parameters:
- direction: The direction to search in (above, below, right, left, relative to the starting box.
- number: The number of boxes to offset by.
For an example of how to use this parameter, see the following Examples section.
darknessThresholdnumber between 0 and 1. default: 0.9The brightness threshold below which to consider a pixel a box boundary. White is 1.0. Configure this parameter for checkboxes with dark backgrounds relative to the surrounding background.
If the document has a white background, the default value is 0.9.
If the document has dark or mottled background, for example as the result of a scan, then Sensible automatically chooses a default value based on the amount of contrast in the document. For an example of how to use this parameter, see the following Examples section.

Examples

Simple box

The following example shows extracting a dollar amount from a box in a 1099 form, based on anchor text matching in the box.

Config

{
 "fields": [
   {
     "id": "rents_income",
     "type": "currency",
     "method": {
       "id": "box",
     },
     "anchor": "rents"
   }
 ]
}

Example document
The following image shows the example PDF used with this example config:

Click to enlargeClick to enlarge

Example PDFDownload link

Output

{
  "rents_income": {
    "source": "4,200",
    "value": 4200,
    "unit": "$",
    "type": "currency"
  }
}

Dark box

The following example shows extracting text from a box with a dark background and light text using the darknessThreshold parameter.

Config

{
  "fields": [
    {
      "id": "dark_box",
      "method": {
        "id": "box",
        "darknessThreshold": 0.8
      },
      "anchor": "dark box with light text",
    }
  ]
}

Example document
The following image shows the example PDF used with this example config:

Click to enlargeClick to enlarge

Example PDFDownload link

Output

{
  "dark_box": {
    "type": "string",
    "value": "Here’s some more text"
  }
}

Offset boxes

The following example shows recognizing boxes relative to other boxes using the Box Offset parameter.

Config

{
  "fields": [
    {
      "id": "auto_limit_in_policy_1",
      "anchor": "auto only",
      "match": "first",
      "method": {
        "id": "box",
        "offsetBoxes": {
          "direction": "right",
          "number": 1
        }
      }
    },
    {
      "id": "injury_limit_in_policy_2",
      "anchor": "dollar amount",
      "match": "last",
      "method": {
        "id": "box",
        "offsetBoxes": {
          "direction": "below",
          "number": 2
        }
      }
    },
    {
      "id": "offset_boxes",
      "anchor": "spanning multiple",
      "method": {
        "id": "box",
        "offsetBoxes": {
          "direction": "below",
          "number": 3
        }
      }
    },
  ]
}

Example document

The following image shows the example PDF used with this example config:

Click to enlargeClick to enlarge

Example PDFDownload link

Output

{
  "auto_limit_in_policy_1": {
    "type": "string",
    "value": "$2,000"
  },
  "injury_limit_in_policy_2": {
    "type": "string",
    "value": "$4,000"
  },
  "offset_boxes": {
    "type": "string",
    "value": "Third offset box"
  }
}

Notes

The following image illustrates how Sensible recognizes offset boxes after the first box:

Click to enlargeClick to enlarge

  1. Recognize the starting box by searching for the dark borders of the box. The search starts at the green dot defined by the Position parameter. The search expansion is in all directions, not just the cardinal directions shown by the red arrows in the image.
  2. Find a bottom border ("direction": "below") that's shared with the next box. Choose a point ( represented as the second green dot) on the bottom border that's in the middle of the starting box's border and that's just inside the next box's borders.
  3. Search from that point to recognize the next box's borders.
  4. Repeat steps 2 and 3 for the next box.

When boxes are complex (inconsistently sized, spanned, or aligned, as in the preceding image), Sensible's methods for recognizing boxes can be correspondingly complex. In such cases, use the Sensible app to visually examine and understand the extraction. Or, see the following example for an alternative approach.

Box coordinates

You can use the Offset X and Offset Y parameters:

  • to anchor on text outside the target box, for example, if all the box's contents are variable.
  • as an alternative to the Offset Boxes parameter. Offsets provide faster performance, but are more sensitive to inconsistent box positioning across PDFs and require more configuration.

The following example shows the same PDF as the Offset Boxes example, but uses distances in inches rather than boxes to define the offsets.

Config

{
  "fields": [
    {
      "id": "auto_limit_in_policy_1",
      "anchor": "auto only",
      "match": "first",
      "method": {
        "id": "box",
        "offsetX": 1.5,
        "offsetY": 0.0
      }
    },
    {
      "id": "injury_limit_in_policy_2",
      "anchor": "dollar amount",
      "match": "last",
      "method": {
        "id": "box",
        "offsetX": 0.0,
        "offsetY": 1.0
      }
    },
    {
      "id": "oddly_formatted_boxes",
      "anchor": "spanning multiple",
      "method": {
        "id": "box",
        "offsetX": 1.0,
        "offsetY": 2.5
      }
    },
  ]
}

Example document

The following image shows the example PDF used with this example config:

Click to enlargeClick to enlarge

Example PDFDownload link

The red arrows in the preceding image show the offsets in inches from the point defined by the Position parameter. The green dots move as you adjust the inches coordinates, so you can visually tweak your measurements in the Sensible app.

Output

{
  "auto_limit_in_policy_1": {
    "type": "string",
    "value": "$2,000"
  },
  "injury_limit_in_policy_2": {
    "type": "string",
    "value": "$6,000"
  },
  "oddly_formatted_boxes": {
    "type": "string",
    "value": "Third offset box"
  }
}

Troubleshoot box recognition

Use the Position parameter to fine tune box recognition.

PROBLEM

In the following error example, Sensible searches for borders starting from the middle point of the anchor line's left boundary ("position": "left"). Since the point (the green dot) overlaps the box border, Sensible can't recognize the box.

Config

{
  "fields": [
    {
      "id": "box_test",
      "anchor": "big anchor text",
      "method": {
        "id": "box",
        "position": "left",
        "wordFilters": [
          "cramped"
        ]
      }
    }
  ]
}

PDF

The following image shows the example PDF used with this example config:

Click to enlargeClick to enlarge

Example PDFDownload link

SOLUTION

If you specify "position": "right", the green dot is far enough inside the borders that Sensible recognizes the box:

Config

{
  "fields": [
    {
      "id": "box_test",
      "anchor": "big anchor text",
      "method": {
        "id": "box",
        "position": "right",
        "wordFilters": [
          "cramped"
        ]
      }
    }
  ]
}

PDF

Click to enlargeClick to enlarge

Output

{
  "box_test": {
    "type": "string",
    "value": " with some text to extract"
  }
}

Notes

The Box method is an alternative to the Region method that requires less configuration and is slightly slower. Use the Region method instead of the Box method for faster performance, or if the borders of a box are incomplete or discontinuous.


Did this page help you?