Preprocessors

Use the following preprocessors to clean up your documents before extracting structured data. Preprocessors execute in the order you define them in an array.

PreprocessorImageNotes
DeskewClick to enlargeClick to enlargeCorrects the alignment of PDF documents that are skewed, for example as a result of being photographed at an angle instead of straight on.
LigatureClick to enlargeClick to enlargeIntelligently replaces Unicode ligatures in a PDF text extraction.
Merge LinesClick to enlargeClick to enlargeCorrects oversplit lines.
OCRClick to enlargeClick to enlargeSelectively OCRs pages in PDFs containing a mix of digitally generated text and text images (such as scanned text). If the whole PDF is a scan, you don't need to configure this preprocessor.
Page FilterFilters out low-scoring pages given a bag of target terms and stop terms.
Page RangeIgnores pages outside the start page and end page.
Remove HeaderClick to enlargeClick to enlargeRemoves repeating elements at the top of the page. Ignores header elements that overlap with the page's main body.
Remove FooterClick to enlargeClick to enlargeRemoves repeating elements at the bottom of the page. Ignores footer elements that overlap with the page's main body.
ScaleClick to enlargeClick to enlargeCorrects the size of text in PDF documents whose size varies, for example as a result of being scanned or photographed at different scales.
Split LinesClick to enlargeClick to enlargeCorrects undersplit lines.

Did this page help you?