How Docuoria Works

A transparent, step-by-step walkthrough of the extraction pipeline — from PDF in to structured data out.

How It Fits Together

Three simple concepts, two simple actions — that's all you need to understand to use Docuoria.

Noun

Match Rules

Scoring functions that decide which template applies to a given PDF. Combine file name patterns, metadata, text anchors, and more.

Noun

Templates

Reusable extraction blueprints. A template defines the fields to extract, how to find them, and how to transform the values.

Noun

Output Generators

Render extracted data into a target format — CSV or JSON today, with XML planned.

Action

Evaluate

Run match rules against a PDF to determine which template applies. Returns a confidence score and the winning template.

Action

Execute

Run the selected template's pipeline against the PDF. Returns a clear result — with all the data you asked for, or a clear explanation of why it didn't work.

The Seven Pipeline Steps

Every execution flows through this sequence — from classification to stored output.

Classify

Match rules evaluate the incoming PDF against registered templates. The highest-scoring template is selected for extraction.

Inspect

The engine reads the PDF structure — pages, text blocks, tables, metadata — and builds an internal representation for the pipeline.

Test

Extraction sources and patterns are validated against the PDF to ensure they produce the expected field values before committing.

Build

The template's field definitions are resolved: sources are bound, transforms are wired, and the execution plan is assembled.

Dry-Run

The pipeline executes without writing output. Diagnostics are collected so you can validate results before a real run.

Execute

The full pipeline runs: extraction steps pull data, transformation steps shape it, retrieval and Python steps augment it, and publish steps emit the result.

Store

The output generator writes structured data — CSV or JSON — to the configured output target. The result includes diagnostics and metadata.

Layer Architecture

Docuoria is built as three layers — Layer 1 is shipped today; Layers 2 and 3 are on the roadmap.

Layer 1: Engine

Shipped

The core extraction engine — template evaluation, pipeline execution, and output generation. Available as a NuGet package and AI plugin today.

Layer 2: Service

Future

A hosted service layer for teams who want to use Docuoria without running it locally. Send a PDF, get structured data back — from any language or platform.

Layer 3: Transports & Clients

Future

Email automation, ERP connectors, and SaaS template store. Process documents directly from your inbox or business systems.

Pipeline Results

Every extraction returns one of three clear outcomes.

Success

SucceededResult

Extraction completed. All fields were found and your structured output is ready to use.

Failed

FailedResult

Something went wrong during extraction. A step could not complete. The result includes a detailed log so you know exactly where and why.

Rejected

RejectedResult

Docuoria declined to run the extraction. Common reasons: no template matched the document, a required field was missing from the template, or the input was invalid.

Ready to try it?

Install the AI plugin in one command and start extracting data from your PDFs today.

Get Started Explore Features