Features
A complete breakdown of what Docuoria can do — from automatically recognising a document type to delivering structured output.
Match Rules
Seven built-in rule types for classifying PDFs and selecting the right template automatically.
| Rule Type | What It Scores / Typical Use |
|---|---|
| FileNameMatchRule | Identifies a PDF by its file name. Ideal for files that follow a consistent naming convention, like vendor invoices. |
| MetadataMatchRule | Identifies a PDF by its built-in properties — author, title, subject, or creator. |
| TextPatternMatchRule | Identifies a PDF by the presence of a specific word, phrase, or text pattern anywhere in the document. |
| TextAnchorMatchRule | Identifies a PDF by a label or heading that always appears at a fixed position on the page. |
| PageGeometryMatchRule | Identifies a PDF by page count, size, or orientation. |
| TableMatchRule | Identifies a PDF by a table with specific column headings or row patterns. |
| CompositeMatchRule | Combines multiple rules with AND/OR logic for documents that need more than one identifier to classify reliably. |
Extraction Sources
Six source types for pulling data from different parts of a PDF, with fallback chaining.
| Source Type | Capability |
|---|---|
| TextPatternExtractionSource | Extracts text that matches a word, phrase, or pattern from the document body. Pulls exactly what you need, nothing more. |
| TextAnchorExtractionSource | Extracts a value located near a known label or heading on the page — great for forms with labeled fields. |
| MetadataFieldExtractionSource | Reads a named property directly from the PDF's built-in document information. |
| TableCellExtractionSource | Extracts a value from a specific cell in a table, identified by its column heading and row. |
| TableRowsExtractionSource | Reads every row in a table and extracts repeating data — line items, transactions, entries — as an ordered list. |
| FallbackExtractionSource | Tries multiple extraction approaches in sequence and returns the first one that finds a value — handles layouts that vary between documents. |
Pipeline Steps
Five composable step types that form the execution pipeline inside a template.
ExtractionStep
Pulls field values out of the PDF. This is the core step that reads data from the document.
TransformationStep
Cleans and shapes extracted values — trims whitespace, converts types, applies formatting, and renames fields.
RetrievalStep
Fetches additional information from outside the document to enrich the extracted data.
PythonStep
Runs a Python script to process or calculate values. Requires Python 3.12 on the machine running the extraction.
PublishStep
Sends intermediate results to another system during the pipeline — useful for notifications or live previews.
Field Transforms
Six transform types for shaping extracted values before they reach the output.
TrimTransform
Removes extra spaces from the beginning and end of a value.
CastTransform
Converts a value from one type to another — for example, turning "42.00" into the number 42.
FormatTransform
Applies a format to a value — such as a date display format or a number with two decimal places.
RenameTransform
Changes what a field is called in the output without touching the value.
ComputeTransform
Calculates a new value from existing fields — totals, concatenations, conditionals.
CollectionElementTransform
Applies a transform to every item in a list — for example, cleaning up every line item description in one step.
Output Formats
Structured output in your preferred format — with more on the roadmap.
CSV
CsvOutputGenerator
Opens in: Excel, Google Sheets, Numbers
One row per list item. Nested data is expanded across columns — easy to sort, filter, and pivot.
JSON
JsonOutputGenerator
Opens in: Any app, API, or database
Lists stay as lists, records stay as records. Full structure preserved — ideal for feeding other systems.
XML
PlannedXmlOutputGenerator
Opens in: Legacy systems & enterprise integrations
Planned — each item will get its own element.
Data Model
A structured field system with nesting, lists, and automatic validation.
Primitive Field Types
Nested Records
Group related fields together. Records nest to any depth — invoice header, line items, and summary all in one template.
Ordered Lists
Any field or group can be a list. Lists preserve order and output naturally in JSON or CSV.
Validation
Mark fields as required. If a required field can't be found, the extraction stops cleanly before producing any output.
What Your AI Agent Can Do
Your AI agent handles setup, testing, and troubleshooting. Just tell it what you need.
Preview Any PDF
Ask your agent to look inside a PDF before building a template — see every page, table, and piece of text the engine can see.
Try a Pattern
Test whether a word, phrase, or text pattern appears in a real document. Confirm before committing.
Check Template Matching
See which templates match a given document and how confident the engine is in each — useful for tuning classification.
Preview Extraction Results
Run a full extraction without saving output. Verify every field value looks right before it counts.
Step-by-Step Diagnostics
Every run returns a detailed log — which step ran, what it found, and where anything went wrong.
Setup & Management Commands
One-command install, template import/export, and agent registration — everything you need to get running fast.
Ready to automate your first document?
Install the AI plugin and start extracting data in minutes — no coding required.