Merge lines

Merges lines distributed along a horizontal axis more aggressively than the built-in line merger. This preprocessor solves line-recognition problems caused by poor-quality PDF scans, handwritten text, and other PDF formatting. For example, this preprocessor solves:

  • oversplit lines
  • lines overlapping on the x-axis
  • "jittery" lines misaligned on the y-axis

There are limitations to the combinations of parameter values you can set. For more information, see the Notes section.

Parameters
Examples
Notes

Parameters

keyvaluedescription
type (required)mergeLinesmerges lines distributed along a horizontal axis.
directlyAdjacentThreshold (required)number >= 0.16Usually, it's recommended to leave the default for this parameter (0.16).
Sensible uses the default setting for this parameter to transform separate tokens output from Google OCR into lines.
This parameter specifies the fraction of line height under which to merge two adjacent lines distributed along an x-axis without a space. For example, at 0.16, this preprocessor merges two lines separated by a small gap whose width is less than 16% of the line height. Choosing a larger number merges more aggressively.
adjacentThreshold (required)number >= 0.6Corrects oversplit lines.
Specifies the fraction of line height under which to merge two adjacent lines distributed along an x-axis with a space. The built-in merger uses 0.6, so choosing a larger number merges more aggressively.
For an example, see the Examples section.
yOverlapThresholdnumber between 0 and 1.0. default: 1.0Merges lines that aren't perfectly aligned at the same height on the page.
Specifies the y overlap above which the Merge Lines preprocessor merges two adjacent lines. Y overlap is the section of the joint y-axis range of two lines that's occupied by both lines. For example, if two lines share the same minimum and maximum y-axis values, their overlap is 1. If one line's extent is from 0 to 10 and the other line’s extent is from 2 to 12 on the y-axis, their overlap is .667 (8 / 12).
For an example, see the Examples section.
minXGapThresholdnumber in inchesConfigure this parameter if two lines overlap on an x-axis. The default behavior is to merge these overlapping lines into one line. To split them instead, set a cap on the amount of allowable overlap. For example:
0 - splits lines if their line boundaries are touching but not overlapping.
0.1 - splits lines if their boundaries overlap a little, up to 0.1 inches.
2.0 - splits lines even when they overlap a lot, up to 2.0 inches.
For an example, see the Examples section.

Examples

Handwriting OCR

Use the Merge Lines preprocessor to clean up OCRed handwriting text. This preprocessor is useful for Google OCR, which by default groups text into words rather than lines.

PROBLEM

Without a Merge Line preprocessor, the placeholder handwritten data in an example PDF is oversplit by Google OCR:

Click to enlargeClick to enlarge

For example, the phrase Name (First, Middle, Last, Suffix, Trust or Custodian) isn't one line, but is instead split on words.

SOLUTION

CONFIG

{
  "preprocessors": [
    {
      "type": "mergeLines",
      "directlyAdjacentThreshold": 0.15,
      "adjacentThreshold": 0.8,
      "yOverlapThreshold": 0.8,
      "minXGapThreshold": 0.1
    }
  ],
  "fields": [
    {
      "id": "name_line",
      "anchor": "Name",
      "method": {
        "id": "label",
        "position": "right",
        "includeAnchor": true
      },
    }
]
}

PDF

The following image shows the example PDF used with this example config:

Click to enlargeClick to enlarge

Example PDFDownload link

To run this example, verify that the document type uses Google OCR (click the gear icon for the Document Type and select Google):

Click to enlargeClick to enlarge

OUTPUT

{
  "name_line": {
    "type": "string",
    "value": "Name (First, Middle, Last, Suffix, Trust or Custodian)"
  }
}

Modify this example to observe the effects of the different parameters on the output. For example:

  • set "adjacentThreshold": 0.1 to see oversplit lines.

  • set "adjacentThreshold": 2.0 to see aggressively merged lines.

  • revert Adjacent Threshold to the original setting, then set "yOverlapThreshold": 0.2 to observe how lines with misaligned heights (like the email address) merges more aggressively.

Oversplit lines

PROBLEM

The following image shows oversplit lines. For example, Sensible splits the phrase "premium driver discount" into three lines even though the human eye perceives it as one phrase:

Click to enlargeClick to enlarge

SOLUTION

The following example shows using the Merge Lines preprocessor to fix the oversplit lines and find a discount amount for a specific vehicle.

CONFIG

{
  "preprocessors": [
    {
      "type": "mergeLines",
      "directlyAdjacentThreshold": 0.16,
      "adjacentThreshold": 1
    }
  ],
  "fields": [
    {
      "id": "premier_driver_discount",
      "type": "currency",
      "method": {
        "id": "row"
      },
      "anchor": {
        "match": {
          "type": "includes",
          "text": "premier driver discount"
        },
        "end": {
          "type": "includes",
          "text": "vehicle 06"
        }
      }
    }
  ]
}

PDF

The following image shows the example PDF used with this example config:

Click to enlargeClick to enlarge

OUTPUT

{
  "premier_driver_discount": {
    "source": "113.00",
    "value": 113,
    "unit": "$",
    "type": "currency"
  }
}

Jittery lines on a y-axis

The following example shows using the Y Overlap parameter to correct vertical misalignment or "jitter" in lines (for example, as the result of a low-quality scan or because of handwriting).

Config

{
  "preprocessors": [
    {
      "type": "mergeLines",
      "directlyAdjacentThreshold": 0.16,
      "adjacentThreshold": 1.5,
      "yOverlapThreshold": 0.1
    }
  ],
  "fields": [
    {
      "id": "merged_line",
      "method": {
        "id": "label",
        "position": "right",
        "includeAnchor": true
      },
      "anchor": "these two"
    }
  ]
}

Example document

The following image shows the example PDF used with this example config:

Click to enlargeClick to enlarge

Example PDFDownload link

Output

{
  "merged_line": {
    "type": "string",
    "value": "These two lines are imperfectly aligned They have a y overlap less than 1"
  }
}

Overlapping lines on an x-axis

The following example shows using the Min X Gap Threshold parameter to extract overlapping text in a poorly formatted PDF. In this example, the built-in behavior without a Min X Gap Threshold is to merge the overlapping lines into one line (Supplementary underinsured/uninsured motorist coverage500,000 USD Combined single limit incl. umbl).

The Min X Gap Threshold preserves the intended PDF formatting, which is a two-column table. By preserving this format, you can consistently use the Row method on the table in this document, as well as in other examples of this table in documents in which the lines don't overlap.

Config

{
  "preprocessors": [
    {
      "type": "mergeLines",
      "directlyAdjacentThreshold": 0.16,
      "adjacentThreshold": 0.6,
      "minXGapThreshold": 1.0
    }
  ],
  "fields": [
    {
      "id": "underinsured_limit",
      "method": {
        "id": "row"
      },
      "anchor": "supplementary",
  
    }
  ]
}

Example document

The following image shows the example PDF used with this example config:

Click to enlargeClick to enlarge

Example PDFDownload link

Output

{
  "underinsured_limit": {
    "type": "string",
    "value": "500,000 USD Combined single limit incl. umbl"
  }
}

Notes

Because the Merge Lines preprocessor evaluates after the built-in line merger, there are limitations to the combinations of parameter values you can set:

yOverlapThreshold

In general, when you set "yOverlapThreshold":1.0 or leave its value unspecified, then you set "adjacentThreshold" to 0.6 or higher.

In this situation, "directlyAdjacentThreshold" and "adjacentThreshold" have no effect if both their values are less than 0.6. In other words, the following configuration has no effect:

    {
      "type": "mergeLines",
      "directlyAdjacentThreshold": 0.5,
      "adjacentThreshold": 0.5,
     "yOverlapThreshold": 1,

    }


Did this page help you?