Classifying documents by type
Sensible "classifies" documents in the following senses:
- You can use the Sensible API to classify a document by its similarity to high-level document types you define in your Sensible account, without making an extraction request. For example, classify a document as a
1040sdocument type or apay_stubsdocument type. For more information, see the following sections. - When you extract data from a single document, you classify the document by specifying one of the high-level document types you defined in your account, for example
1040sorpay_stubs. Sensible then automatically classifies the document by its subtype, or "config", in the document type, for example, the1040s_2018version or the1040_20version of a1040sdocument. For more information, see DevOps platform. - When you extract data from multiple documents in a single request, you specify a list of possible document types, and Sensible classifies both high-level document types and subtypes automatically. For example:
- Sensible classifies, or "segments", each document in a multi-document file, or "portfolio". For example, for a
loan_application_bundle.pdfdocument containing apay_stubsdocument, a1040document, and abank_statementsdocument, you can segment each document by its page range in the file, and return its extracted data separately. You can configure LLM- or fingerprint-based segmentation. For more information, see Multi-document extractions. - Classify each attached document in an email by document type, then return each document's extracted data separately. For more information, see Getting started with email extraction.
- Sensible classifies, or "segments", each document in a multi-document file, or "portfolio". For example, for a
flowchart TD
A([Classification]) --> D["Classify-only API"]
A --> B["Single-doc extraction"]
A --> C["Multi-doc request"]
D --> D1["No extraction. Sensible scores doc similarity against your defined types"]
D1 --> D2["Returns document type (e.g. 1040s or pay_stubs)"]
B --> B1["You specify document type (e.g. 1040s)"]
B1 --> B2["Sensible auto-classifies subtype / config (e.g. 1040s_2018)"]
B2 --> B3[Extract data]
C --> C1["You specify list of possible document types"]
C1 --> C2["Portfolio (multi-doc file)"]
C1 --> C3["Email attachments"]
C2 --> C2a["Sensible segments each doc by page range, classifies type + subtype,returns each doc's data"]
C3 --> C3a["Sensible classifies each attachment by type + subtype, returns each doc's data"]
The following sections cover using the Sensible API's Classify endpoints to return a document's high-level type.
Classify endpoints
When you call Sensible's Classify API endpoints, Sensible classifies a document by comparing it to the types you define in your account. For example, you can classify 1040 forms and bank statements if you define the following types in your account:
-
a bank statements type
-
a 1040s type
Sensible uses a document type's name and its description for LLM-based classification:
- Sensible can classify documents into your document types even if the document type is empty (lacks a config or reference document). For example, if you lack a
citibankconfig or reference document in yourbank_statementstype, Sensible can still classify a2023-1-1_citbank_statement_jon_doe.pdfdocument as a bank statement. - If Sensible doesn't find an existing document type to which to match your document in your account, it returns an error.
To optionally improve classification results, describe each document type in your account in its Settings tab. For examples of descriptions, see Document type descriptions. By default, Sensible classifies a document using all the types you define in your account. You can optionally define a subset of document types for classifying a document.
Use cases for classification endpoints
Use cases for the Classify endpoints include the following examples:
-
Prior to an extraction workflow. Determine which documents to extract before calling a Sensible extraction endpoint.
-
Independent from an extraction workflow. Determine where to route each document or to label each document in a system of record.
Updated 6 days ago