Preprocessors
Use the following preprocessors to clean up your documents before extracting structured data. Preprocessors execute in the order you define them in an array.
Preprocessor | Image | Notes |
---|---|---|
Deskew | ![]() | Corrects the alignment of PDF documents that are skewed, for example as a result of being photographed at an angle instead of straight on. |
Ligature | ![]() | Intelligently replaces Unicode ligatures in a PDF text extraction. |
Merge Lines | ![]() | Corrects oversplit lines. |
OCR | ![]() | Selectively OCRs pages in PDFs containing a mix of digitally generated text and text images (such as scanned text). If the whole PDF is a scan, you don't need to configure this preprocessor. |
Page Filter | Filters out low-scoring pages given a bag of target terms and stop terms. | |
Page Range | Ignores pages outside the start page and end page. | |
Remove Header | ![]() | Removes repeating elements at the top of the page. Ignores header elements that overlap with the page's main body. |
Remove Footer | ![]() | Removes repeating elements at the bottom of the page. Ignores footer elements that overlap with the page's main body. |
Scale | ![]() | Corrects the size of text in PDF documents whose size varies, for example as a result of being scanned or photographed at different scales. |
Split Lines | ![]() | Corrects undersplit lines. |
Updated about 10 hours ago