Corrects the alignment of text in PDF documents that are skewed, for example as a result of being photographed at an angle instead of straight on. ID cards and receipts are common examples of such documents. Sensible uses affine transformations to correct scaling, rotation, translation, and shear.

See the Notes section for when not to use this preprocessor and for alternatives.

Parameters
Examples
Notes

Parameters

keyvaluedescription
type (required)deskew
fixedPoints (required)objectDeskews the text in a skewed document by mapping the positions of three skewed points to their ideal positions in an unskewed document. Define the ideal Fixed Points using text anchors in an unskewed example of the document. Choose text anchors that form as large a triangle as possible, ideally at three corners of the document. Choosing the best points can take some trial and error.
Parameters:
match - the text to match for the Fixed Point. Choose "type": "startsWith" or "type": "endsWith" to avoid problems with lines oversplit by skew. See Match object for more details.
targetPosition - contains x and y parameters that define the coordinates of the Fixed Points in inches relative to the 0,0 origin at the top left corner of the page. For more information defining the positions, see the Examples section.
startleft , right. default: leftSpecifies whether the Fixed Point is at the upper-left corner of the anchor line's boundaries, or the upper-right corner.
With a Match parameter of "type": "startsWith", use left.
With a Match parameter of "type": "endsWith", use right.

Examples

The following image shows that without the Deskew preprocessor, extraction from a skewed PDF fails. The Region method returns null, because the targeted date range is in an unexpected position (to the left of the anchor, tenure) rather than in the expected position (below):

Click to enlargeClick to enlarge

To solve this problem:

  1. Define three widely spaced points in a well-aligned example of this document type: The following image shows using the displayed coordinates of a line to define the X and Y parameters for one of three Fixed Point parameters:

Click to enlargeClick to enlarge

For more information about choosing points, see Best Practices.

  1. Check the Deskew preprocessor against the skewed example to reveal a remaining problem:

Click to enlargeClick to enlarge

The remaining problem is that the Deskew Preprocessor doesn't merge lines that were split by the skew. As a result, the region starts at the middle of the word "tenure" instead of the middle of the complete line "White house tenure". Since the new start shifts the region to the right, the region misses the first year of the date range and returns -1963.

  1. To fix this problem, apply a Merge Lines preprocessor after the Deskew preprocessor:

Click to enlargeClick to enlarge

Now Sensible captures the full date range, because the region starts at the middle of the complete line "White house tenure".

Try out this example in the Sensible app using the following PDFs and config:

Example aligned PDFDownload link
Example skewed PDFDownload link

This example uses the following config:

{
  "preprocessors": [
    {
      "type": "deskew",
      "fixedPoints": [
        {
          "match": {
            "type": "equals",
            "text": "first"
          },
          "targetPosition": {
            "x": 2.76,
            "y": 1.6
          },
        },
        {
          "match": {
            "type": "startsWith",
            "text": "Owner"
          },
          "targetPosition": {
            "x": 2.76,
            "y": 3.74
          },
        },
        {
          "match": {
            "type": "endsWith",
            "text": "tenure:"
          },
          "targetPosition": {
            "x": 7.29,
            "y": 3.61
          },
          "start": "right"
        }
      ]
    },    
    {
      "type": "mergeLines",
      "directlyAdjacentThreshold": 0.15,
      "adjacentThreshold": 0.8,
      "yOverlapThreshold": 0.1
    },
  ],
  "fields": [
    {
      "id": "tenure",
      "anchor": {
        "match": {
          "type": "endsWith",
          "text": "tenure:"
        }
      },
      "method": {
        "id": "region",
        "start": "below",
        "width": 1.6,
        "height": 0.5,
        "offsetX": -1,
        "offsetY": 0
      }
    }
  ]
}

Notes

Tips for viewing skewed documents

To view the deskewed document in the Sensible app, click the document, then click the eye (👁) icon:

Click to enlargeClick to enlarge

This visual representation of how the Sensible app deskews documents can be useful for authoring or troubleshooting deskew preprocessors and configs.

Alternatives to Deskew

To choose when to configure the Scale or Deskew preprocessors, use the following tips:

  • If a document contains pages that are rotated but otherwise untransformed, you don't need a preprocessor. Sensible's default OCR engine (Microsoft) corrects rotation automatically.
  • If pages are affected by scale, rotation, or both, but are otherwise untransformed, use the Scale preprocessor as an easier-to-configure and more robust alternative to the Deskew preprocessor.
  • If pages are affected by translation, shear, or other affine transformations in addition to or instead of rotation and scale, use the Deskew preprocessor.

Fixed Points tips

  • Click on a line in the document pane in the Sensible app to view line coordinates for defining the Fixed Points.

  • Choose text anchors for Fixed Points that form as large a triangle as possible, ideally at three corners of the document. Choosing the best points can take trial and error.

  • For the Match parameter, choose "type": "startsWith" or "type": "endsWith" to avoid problems with lines split by skew. If you choose "endsWith", then also define "start:right". You can also define a Merge Lines preprocessor to clean up oversplit lines.

  • For the aligned reference PDF, choose a slightly enlarged version of the document so that the Fixed Points triangle is large. The Deskew preprocessor corrects scaling for smaller skewed images.


Did this page help you?