How Docuoria Works
A transparent, step-by-step walkthrough of the extraction pipeline — from PDF in to structured data out.
How It Fits Together
Three simple concepts, two simple actions — that's all you need to understand to use Docuoria.
Noun
Match Rules
Scoring functions that decide which template applies to a given PDF. Combine file name patterns, metadata, text anchors, and more.
Noun
Templates
Reusable extraction blueprints. A template defines the fields to extract, how to find them, and how to transform the values.
Noun
Output Generators
Render extracted data into a target format — CSV or JSON today, with XML planned.
Action
Evaluate
Run match rules against a PDF to determine which template applies. Returns a confidence score and the winning template.
Action
Execute
Run the selected template's pipeline against the PDF. Returns a clear result — with all the data you asked for, or a clear explanation of why it didn't work.
The Seven Pipeline Steps
Every execution flows through this sequence — from classification to stored output.
Classify
Match rules evaluate the incoming PDF against registered templates. The highest-scoring template is selected for extraction.
Inspect
The engine reads the PDF structure — pages, text blocks, tables, metadata — and builds an internal representation for the pipeline.
Test
Extraction sources and patterns are validated against the PDF to ensure they produce the expected field values before committing.
Build
The template's field definitions are resolved: sources are bound, transforms are wired, and the execution plan is assembled.
Dry-Run
The pipeline executes without writing output. Diagnostics are collected so you can validate results before a real run.
Execute
The full pipeline runs: extraction steps pull data, transformation steps shape it, retrieval and Python steps augment it, and publish steps emit the result.
Store
The output generator writes structured data — CSV or JSON — to the configured output target. The result includes diagnostics and metadata.
Layer Architecture
Docuoria is built as three layers — Layer 1 is shipped today; Layers 2 and 3 are on the roadmap.
Layer 1: Engine
ShippedThe core extraction engine — template evaluation, pipeline execution, and output generation. Available as a NuGet package and AI plugin today.
Layer 2: Service
FutureA hosted service layer for teams who want to use Docuoria without running it locally. Send a PDF, get structured data back — from any language or platform.
Layer 3: Transports & Clients
FutureEmail automation, ERP connectors, and SaaS template store. Process documents directly from your inbox or business systems.
Pipeline Results
Every extraction returns one of three clear outcomes.
SucceededResult
Extraction completed. All fields were found and your structured output is ready to use.
FailedResult
Something went wrong during extraction. A step could not complete. The result includes a detailed log so you know exactly where and why.
RejectedResult
Docuoria declined to run the extraction. Common reasons: no template matched the document, a required field was missing from the template, or the input was invalid.
Ready to try it?
Install the AI plugin in one command and start extracting data from your PDFs today.