Docuoria

Features

A complete breakdown of what Docuoria can do — from automatically recognising a document type to delivering structured output.

Match Rules

Seven built-in rule types for classifying PDFs and selecting the right template automatically.

Rule TypeWhat It Scores / Typical Use
FileNameMatchRuleIdentifies a PDF by its file name. Ideal for files that follow a consistent naming convention, like vendor invoices.
MetadataMatchRuleIdentifies a PDF by its built-in properties — author, title, subject, or creator.
TextPatternMatchRuleIdentifies a PDF by the presence of a specific word, phrase, or text pattern anywhere in the document.
TextAnchorMatchRuleIdentifies a PDF by a label or heading that always appears at a fixed position on the page.
PageGeometryMatchRuleIdentifies a PDF by page count, size, or orientation.
TableMatchRuleIdentifies a PDF by a table with specific column headings or row patterns.
CompositeMatchRuleCombines multiple rules with AND/OR logic for documents that need more than one identifier to classify reliably.

Extraction Sources

Six source types for pulling data from different parts of a PDF, with fallback chaining.

Source TypeCapability
TextPatternExtractionSourceExtracts text that matches a word, phrase, or pattern from the document body. Pulls exactly what you need, nothing more.
TextAnchorExtractionSourceExtracts a value located near a known label or heading on the page — great for forms with labeled fields.
MetadataFieldExtractionSourceReads a named property directly from the PDF's built-in document information.
TableCellExtractionSourceExtracts a value from a specific cell in a table, identified by its column heading and row.
TableRowsExtractionSourceReads every row in a table and extracts repeating data — line items, transactions, entries — as an ordered list.
FallbackExtractionSourceTries multiple extraction approaches in sequence and returns the first one that finds a value — handles layouts that vary between documents.

Pipeline Steps

Five composable step types that form the execution pipeline inside a template.

ExtractionStep

Pulls field values out of the PDF. This is the core step that reads data from the document.

TransformationStep

Cleans and shapes extracted values — trims whitespace, converts types, applies formatting, and renames fields.

RetrievalStep

Fetches additional information from outside the document to enrich the extracted data.

PythonStep

Runs a Python script to process or calculate values. Requires Python 3.12 on the machine running the extraction.

PublishStep

Sends intermediate results to another system during the pipeline — useful for notifications or live previews.

Field Transforms

Six transform types for shaping extracted values before they reach the output.

TrimTransform

Removes extra spaces from the beginning and end of a value.

CastTransform

Converts a value from one type to another — for example, turning "42.00" into the number 42.

FormatTransform

Applies a format to a value — such as a date display format or a number with two decimal places.

RenameTransform

Changes what a field is called in the output without touching the value.

ComputeTransform

Calculates a new value from existing fields — totals, concatenations, conditionals.

CollectionElementTransform

Applies a transform to every item in a list — for example, cleaning up every line item description in one step.

Output Formats

Structured output in your preferred format — with more on the roadmap.

CSV

CsvOutputGenerator

Opens in: Excel, Google Sheets, Numbers

One row per list item. Nested data is expanded across columns — easy to sort, filter, and pivot.

JSON

JsonOutputGenerator

Opens in: Any app, API, or database

Lists stay as lists, records stay as records. Full structure preserved — ideal for feeding other systems.

XML

Planned

XmlOutputGenerator

Opens in: Legacy systems & enterprise integrations

Planned — each item will get its own element.

Data Model

A structured field system with nesting, lists, and automatic validation.

Primitive Field Types

StringNumberIntegerBooleanDateTimestamp

Nested Records

Group related fields together. Records nest to any depth — invoice header, line items, and summary all in one template.

Ordered Lists

Any field or group can be a list. Lists preserve order and output naturally in JSON or CSV.

Validation

Mark fields as required. If a required field can't be found, the extraction stops cleanly before producing any output.

What Your AI Agent Can Do

Your AI agent handles setup, testing, and troubleshooting. Just tell it what you need.

Preview Any PDF

Ask your agent to look inside a PDF before building a template — see every page, table, and piece of text the engine can see.

Try a Pattern

Test whether a word, phrase, or text pattern appears in a real document. Confirm before committing.

Check Template Matching

See which templates match a given document and how confident the engine is in each — useful for tuning classification.

Preview Extraction Results

Run a full extraction without saving output. Verify every field value looks right before it counts.

Step-by-Step Diagnostics

Every run returns a detailed log — which step ran, what it found, and where anything went wrong.

Setup & Management Commands

One-command install, template import/export, and agent registration — everything you need to get running fast.

Ready to automate your first document?

Install the AI plugin and start extracting data in minutes — no coding required.