AI document extraction system for OCR, field validation, correction workflows, and exports.

Context
Many administrative and operations workflows still involve copying information from documents into spreadsheets, forms, or databases. Receipts, invoices, IDs, certificates, application forms, and business documents may all contain structured data, but that structure is trapped inside files that people have to read manually.
This work becomes expensive when volume increases. Staff may need to extract names, dates, amounts, addresses, document numbers, reference codes, item lists, tax values, and approval details repeatedly. Even careful manual encoding can create delays, inconsistent formats, duplicates, and errors.
Oblaira was built around that operational problem. The project treats document AI as a review workflow, not a magic extraction button. The system should capture the original file, extract fields, flag uncertainty, allow correction, and export clean data that another system can use.
Problem
OCR alone does not solve document processing. A system may recognize text but still fail to understand which values matter, what document type is being handled, whether a field is missing, or whether the extracted value matches the required format.
If low-confidence values are exported without review, the workflow only moves errors faster. Administrative users need to see the original document beside extracted fields, correct mistakes, validate required values, and decide when a record is clean enough to send downstream.
The product problem was to connect automation with control. Oblaira needed to classify the document, apply the right extraction schema, validate fields, expose uncertainty, support manual correction, and preserve a clean export path.
Solution
Oblaira starts with document upload and OCR. The system identifies the document type, extracts structured fields based on the expected schema, and presents the output in a review interface where users can inspect values against the original file.
The validation layer flags missing fields, incorrect formats, duplicate references, unusual amounts, and low-confidence outputs. This gives reviewers a practical checklist instead of forcing them to manually compare every character from the document.
After review, users can correct fields and export clean records to CSV, Excel-style tables, or downstream database structures. The product value is in the full loop: original file, extracted data, validation flags, manual correction, and usable export.
My role
I built Oblaira as a solo full-stack MVP, owning the product framing, upload flow, OCR and AI extraction structure, schema design, validation rules, correction interface, and export workflow.
The implementation scope covered file upload, document-type classification, field extraction, required-field checks, low-confidence review, duplicate detection, manual correction, original-file reference, and CSV-style export.
The key product decision was to keep the reviewer in the loop. Document automation becomes credible when users can inspect and correct uncertain values before they become official records.
Product workflow
The workflow begins when a user uploads a document such as a receipt, invoice, ID, certificate, or form. The system stores the file, runs OCR, and classifies the document so the correct extraction schema can be applied.
The extraction step pulls fields such as names, dates, totals, document numbers, addresses, item details, reference codes, or identification values. Validation rules then check whether required fields are present, whether formats make sense, and whether a value should be reviewed before export.
The reviewer sees the extracted fields in an editable interface, corrects errors, approves the record, and exports the clean data. That workflow turns document AI into an operations process rather than a one-time text extraction result.
System architecture
Oblaira is structured around a Next.js and React frontend, Tailwind CSS interface, FastAPI backend, PostgreSQL records, OCR processing, OpenAI API usage for extraction, file storage, validation rules, duplicate checks, and CSV export.
The data model separates documents, document types, original files, OCR text, extraction schemas, field values, validation flags, correction history, review status, and export batches. That structure keeps the source document connected to every extracted and corrected value.
Schema-based extraction is important because a receipt, invoice, certificate, and ID should not be treated as the same document. Each type can have different required fields, field formats, and validation rules.
A production version would need document-specific extraction testing, confidence scoring, batch review, stronger file security, audit trails, and integrations with accounting, CRM, or records-management systems. The MVP proves the central workflow from upload to reviewed export.
Current status
Oblaira is a working MVP focused on making document extraction reviewable and operationally useful. It demonstrates how uploaded files can move through OCR, classification, extraction, validation, correction, and export.
The current version is strongest as a document-operations proof of concept. It should not be framed as perfect extraction across every document format; its credible value is the review workflow that handles uncertainty.
The next step would be adding stronger document-specific schemas, confidence scoring, batch processing, audit history, and downstream integrations so reviewed records can move into business systems more directly.
Outcomes
The main outcome of Oblaira is a workflow that turns document files into structured, validated, export-ready records. It reduces manual encoding while still giving users a way to catch uncertain or incorrect values.
From an engineering perspective, the project strengthened my work with OCR workflows, AI extraction, schema design, validation rules, file-backed records, review interfaces, and export logic.
From a product perspective, Oblaira shows that automation is strongest when it respects uncertainty. Users do not need a system that pretends every field is correct; they need a system that helps them find, fix, and export data faster.
Reflection
Oblaira taught me that document AI products need correction paths as much as extraction paths. The moment a system handles real business records, users need to know where the data came from and how to fix it.
The project also showed that schema design is product design. Choosing which fields matter, which values are required, and which errors should be flagged affects how useful the system feels to operations staff.
The broader lesson is that AI extraction becomes credible when original files, extracted values, validation rules, corrections, and exports stay connected in one workflow.