Remove header

Ignores repeating elements at the tops of pages. These elements are removed from the direct-text extraction of the document.

To recognize a header, this preprocessor starts at the top of the page and moves down the page, stopping as soon as it finds a nonrepeating element.

Sensible recognizes these elements as "repeating":

  • Elements whose y-extent doesn't overlap with any variable element
  • Positively incrementing page numbers

These elements aren't recognized as "repeating":

  • Elements that change their alignment on alternate pages (for example, page numbers aligned alternately left and right, as in a book)
  • A repeating element that's missing from even one page (for example, from an intentionally blank page).

Parameters

keyvaluedescription
type (required)removeHeaderFor an example, see the Examples section.
startsOnPageinteger. default: 1The first page number on which to start checking for repeated elements. Note this is the page number, not the page's zero-based index in the pages array. To filter out end pages that lack a repeating element, use the Page Range preprocessor to define an End Page parameter.

Examples

The following example shows:

  • A repeating header with an incrementing page number. Sensible removes this from the direct text extraction.
  • A repeating sidebar that overlaps the y-extent of both repeating and variable elements:
    • Where it overlaps a repeating element, Sensible treats it as repeating and removes it from the direct text extraction.
    • Where it overlaps variable text, Sensible treats it as nonrepeating and includes it in the direct text extraction

Config

{
  "preprocessors": [
    {
      "type": "pageRange",
      "endPage": 2
    },
    {
      "type": "removeHeader",
      "startsOnPage": 1
    }
  ],
  "fields": [
    {
      "id": "all_lines_minus_repeating_top_elements",
      "method": {
        "id": "documentRange",
        "includeAnchor": true
      },
      "anchor": {
        "match": {
          "type": "first"
        }
      }
    }
  ]
}

Example document

The following images show the example PDF used with this example config:

Click to enlargeClick to enlarge

Click to enlargeClick to enlarge

Example PDFDownload link

Output

{
  "all_lines_minus_repeating_top_elements": {
    "type": "string",
    "value": "This is the body for the 1st page Header on the It differs from page to page. . do eiusmod tempor incididunt y-axis consectetur adipiscing elit do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco ullamco laboris nisi ut aliquip ex ea commodo consequat uis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. This is the page 2 body. Header on the It varies from page to page. Lorem ipsum dolor sit amet. y-axis consectetur adipiscing elit. Excepteur sint occaecat cupidatat non proident. sunt in culpa qui officia deserunt mollit anim id est laborum. Sed ut perspiciatis unde omnis iste natus error sit voluptatem Accusantium. doloremque laudantium, totam rem aperiam, eaque ipsa quae ab llo inventore veritatis et quasi quasi architecto beatae vitae. dicta sunt explicabo."
  }
}

Did this page help you?