Scale
Corrects the size of text in PDF documents whose size varies, for example as a result of being photographed at different distances. ID cards and receipts are common examples of such documents. This preprocessor enables coordinates-based methods, such as the Region or Text Table methods, to work with such unpredictably scaled documents.
Parameters
key | value | description |
---|---|---|
type (required) | scale | |
samples | array of objects | Array of example objects containing font heights for text matches in 100% scaled documents. Sensible compares the actual size of each match against the examples, then take an average of the ratios and use that to rescale the whole document. Sensible recommends the following practices: - Choose samples for which the font height does not vary relative to other font heights in the document. For example, don't create a sample that can match to both a heading 1 and a heading 4 style. - Choose samples that appear on each page, such as headers or footers. Each example object has the following parameters: match : a Match objecttargetHeight : the number in inches of the match at 100% scale. |
perPage | boolean | If true, Sensible rescales each page individually against the Target Height parameter, taking the average of all matches' heights on that page rather than in the whole document. For example, if a tax return contains multiple W-2 forms, but each W-2 can be scanned at an unpredictable scale, then you can set this parameter to true and match on text such as the "Wage and Tax" and the W-2 titles in the W-2 form. |
Examples
The following example shows using the Per Page parameter to scale an ID card that has a different size on each page, where the second page contains the target size to standardize on.
Config
{
"preprocessors": [
{
"type": "scale",
"perPage": true,
"samples": [
{
"match": {
"type": "includes",
"text": "First",
"isCaseSensitive": true
},
"targetHeight": 0.22
}
]
}
],
"fields": [
{
"id": "white_house_tenure",
"anchor": "tenure",
"match": "all",
"method": {
"id": "region",
"start": "below",
"offsetX": -1.7,
"offsetY": 0,
"width": 1.5,
"height": 0.6
}
}
]
}
Example document
The following image shows the example document used with this example config:
Example PDF | Download link |
---|
Output
{
"white_house_tenure": [
{
"type": "string",
"value": "1940-1945"
},
{
"type": "string",
"value": "1940-1945"
},
{
"type": "string",
"value": "1940-1945"
}
]
}
Notes
Alternatives to Scale preprocessor
To choose when to configure the Scale or Deskew preprocessors, use the following tips:
- If a document contains pages that are rotated but otherwise untransformed, you don't need a preprocessor. Sensible's default OCR engine (Microsoft) corrects rotation automatically.
- If pages are affected by scale, rotation, or both, but are otherwise untransformed, use the Scale preprocessor as an easier-to-configure and more robust alternative to the Deskew preprocessor.
- If pages are affected by translation, shear, or other affine transformations in addition to or instead of rotation and scale, use the Deskew preprocessor.
Updated 11 months ago