AI/MLJune 22, 20269 min read

Document AI in Production: OCR, Structured Extraction, and PDF Parsing at Scale

Document AI pipelines fail in predictable ways: OCR misreads numbers, layout breaks structured extraction, and scanned PDFs from the 1990s don't behave like digital-native ones. Here's the architecture that handles all of it.

Muhammad Hassan

Founder, Creative Codes. 8 years on backends; last 3 deep on AI agents, RAG pipelines, and production scraping. Python, LangGraph, Playwright, n8n, FastAPI.

GitHub Upwork

Document AI pipelines fail in predictable ways: OCR misreads numbers, layout variation breaks structured extraction, and scanned PDFs from decades past don't behave like digital-native ones. The teams that get burned are the ones who test on clean, modern PDFs and then deploy against a 20-year archive of faxed invoices.

This post covers the full pipeline: PDF classification, OCR selection, structured extraction, validation, and the edge cases that only show up in production.

The problem with treating all documents the same

The first mistake is assuming "PDF" is a single thing. In practice, your input will include:

Digital-native PDFs: generated by software (Word, Excel, a reporting tool). The text layer is embedded. No OCR needed — just parse.
Scanned image PDFs: a document was printed, signed, faxed, or scanned. The PDF contains an image, not text. Requires OCR.
Hybrid PDFs: partially digital, partially scanned (common in forms where someone printed a template and handwrote the fields).
PDF portfolios: containers holding multiple documents as attachments.
Password-protected or DRM-locked PDFs: require decryption before any processing.

Processing all of these through the same OCR pipeline wastes compute and introduces errors. A digital-native PDF run through OCR will produce degraded output compared to direct text extraction.

The right first step is classification: determine what type of document you're dealing with before deciding how to process it.

python

import fitz  # PyMuPDF

def classify_pdf(path: str) -> str:
    doc = fitz.open(path)
    total_pages = len(doc)
    text_pages = 0

    for page in doc:
        text = page.get_text().strip()
        if len(text) > 50:  # page has meaningful text
            text_pages += 1

    ratio = text_pages / total_pages if total_pages > 0 else 0

    if ratio > 0.8:
        return "digital"
    elif ratio < 0.2:
        return "scanned"
    else:
        return "hybrid"

For digital PDFs, extract text directly with PyMuPDF or pdfplumber. Reserve OCR for scanned and hybrid types.

Pre-processing: image quality determines OCR quality

Before OCR runs, image quality matters. A scan at 150 DPI will produce significantly worse OCR output than the same document at 300 DPI. Common pre-processing steps that improve accuracy:

Deskewing: correct page rotation (pages scanned at a slight angle cause OCR errors at line breaks)
Binarization: convert grayscale scans to black-and-white — reduces noise and improves character recognition
Upscaling: for low-resolution scans, upscale to 300 DPI minimum before running OCR

Libraries like opencv-python and Pillow handle these transformations. For Tesseract, enabling --oem 3 --psm 6 (LSTM engine, uniform block of text) gives better accuracy than defaults on most document types.

OCR tool selection

Three tools cover most production cases:

Tesseract — open-source, runs locally, good for clean scans. Struggles with low-resolution images, skewed pages, or complex table layouts. Free. Good starting point for clean archival documents.

AWS Textract — cloud-based, handles forms and tables natively, returns structured data (key-value pairs, table cells) rather than raw text. Costs $0.0015 per page for standard extraction. For documents with consistent form layouts (invoices, insurance forms), Textract's table extraction saves significant post-processing work.

Google Document AI — better than Textract on certain document types (receipts, identity documents, contracts) when you use a specialized processor. Pricing comparable to Textract.

For most production deployments, the decision is: Tesseract for high-volume, cost-sensitive workloads with clean scans; Textract or Document AI for complex forms and tables where structured output is the primary goal.

python

import boto3
import base64

def textract_extract(image_bytes: bytes) -> dict:
    client = boto3.client("textract", region_name="us-east-1")
    response = client.analyze_document(
        Document={"Bytes": image_bytes},
        FeatureTypes=["TABLES", "FORMS"],
    )
    return response

Structured extraction: beyond raw text

Raw OCR text is rarely the end goal. Usually you need specific fields: invoice number, date, total amount, line items, vendor name. Two approaches work in production:

Regex + rule-based extraction

For documents with consistent layouts (invoices from the same vendor, bank statements from one bank), regex patterns are fast, cheap, and interpretable.

python

import re
from decimal import Decimal

def extract_invoice_fields(text: str) -> dict:
    patterns = {
        "invoice_number": r"Invoice\s*#?\s*:?\s*([A-Z0-9\-]+)",
        "date": r"(?:Invoice\s+Date|Date)\s*:?\s*(\d{1,2}[\/\-]\d{1,2}[\/\-]\d{2,4})",
        "total": r"(?:Total|Amount\s+Due)\s*:?\s*\$?([\d,]+\.\d{2})",
    }

    results = {}
    for field, pattern in patterns.items():
        match = re.search(pattern, text, re.IGNORECASE)
        results[field] = match.group(1) if match else None

    if results.get("total"):
        results["total"] = Decimal(results["total"].replace(",", ""))

    return results

This works when you control the document source or when layouts are standardized. It breaks when a vendor changes their invoice template.

LLM-based extraction

For documents with variable layouts — documents from many different vendors, contracts with different structures, unstructured reports — an LLM extraction step handles the variation that breaks regex.

python

import anthropic
import json

client = anthropic.Anthropic()

def llm_extract_invoice(text: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"""Extract the following fields from this invoice text.
Return JSON only, no explanation.

Fields: invoice_number, date (ISO format), vendor_name, total_amount (number only), currency, line_items (list of {{description, quantity, unit_price, total}})

Invoice text:
{text}"""
        }]
    )

    raw = response.content[0].text.strip()
    # Strip markdown code fences if present
    if raw.startswith("```"):
        raw = raw.split("\n", 1)[1].rsplit("```", 1)[0]
    return json.loads(raw)

LLM extraction handles layout variation well. The cost is higher ($0.003 per page on Sonnet) and latency is higher than regex. For high-volume pipelines, use regex where layouts are consistent and LLM extraction for the remainder.

Validation: the step most teams skip

OCR and extraction can produce plausible-looking but wrong data. An invoice total that came out as "$1,2345.67" instead of "$12,345.67" because the OCR missed a comma is a real error that looks syntactically valid.

Build a validation layer with checks specific to each field type:

python

from datetime import datetime
from decimal import Decimal, InvalidOperation

def validate_extracted_fields(fields: dict) -> tuple[dict, list[str]]:
    errors = []

    # Invoice number: alphanumeric, reasonable length
    inv_num = fields.get("invoice_number", "")
    if not inv_num or len(inv_num) < 3 or len(inv_num) > 30:
        errors.append(f"Suspicious invoice_number: '{inv_num}'")

    # Date: must parse and be within a reasonable range
    date_str = fields.get("date", "")
    try:
        parsed_date = datetime.fromisoformat(date_str)
        if parsed_date.year < 2010 or parsed_date.year > 2030:
            errors.append(f"Date out of range: {date_str}")
    except (ValueError, TypeError):
        errors.append(f"Could not parse date: '{date_str}'")

    # Total: must be a valid number, positive, and within realistic bounds
    total = fields.get("total_amount")
    try:
        total_dec = Decimal(str(total))
        if total_dec <= 0 or total_dec > 10_000_000:
            errors.append(f"Total amount out of range: {total}")
    except (InvalidOperation, TypeError):
        errors.append(f"Invalid total amount: '{total}'")

    return fields, errors

Documents with validation errors go to a review queue rather than straight to downstream systems. This is the difference between a pipeline that silently corrupts data and one that catches errors before they propagate.

Handling tables in scanned documents

Tables are the hardest part of document extraction. OCR treats a scanned table as a grid of characters — it doesn't understand the cell structure. AWS Textract's TABLES feature handles this better than raw OCR, returning a cell-by-cell structure you can reconstruct.

For tables in digital PDFs, pdfplumber has native table extraction that usually outperforms OCR-based approaches:

python

import pdfplumber

def extract_tables_from_pdf(path: str) -> list[list]:
    tables = []
    with pdfplumber.open(path) as pdf:
        for page in pdf.pages:
            page_tables = page.extract_tables()
            tables.extend(page_tables)
    return tables

For complex table layouts (merged cells, nested headers, multi-page tables), post-process the raw table output with an LLM to normalize the structure.

Pipeline architecture at scale

For processing thousands of documents per day, the pipeline looks like this:

text

Incoming document
    ↓
Classify (digital / scanned / hybrid)
    ↓
Text extraction (PyMuPDF for digital, Textract for scanned)
    ↓
Structured extraction (regex → LLM fallback for unmatched layouts)
    ↓
Validation (field-level checks, confidence scores)
    ↓
Route: pass → downstream system | fail → human review queue
    ↓
Audit log (store raw OCR text, extracted fields, validation result)

Key operational details:

Store the raw OCR text alongside extracted fields. When extraction fails, you need the raw text to debug without re-running OCR.
Log confidence scores per field when using Textract or Document AI (they return them). Low-confidence fields go to review even if validation passes.
Process pages in parallel for multi-page documents. A 50-page PDF processed page-by-page sequentially is slow; async batch processing makes it viable at scale.

What this enables

A production document AI pipeline powers use cases across industries: automated invoice processing (AP automation), contract data extraction (legal tech), insurance claim processing, KYC document verification, and medical record parsing. The pipeline is the same — what changes is the document type, the fields to extract, and the validation rules.

The architecture described here scales from hundreds to millions of documents with the same code. The bottleneck at scale is usually OCR compute cost and human review throughput, not the extraction logic.

If you're building a document processing pipeline and need it to handle variable formats, high volume, and validation at scale, tell us about the scope.

Document AI services →

Related service

Need a RAG pipeline, ML model, or AI agent built for production?

AI & Machine Learning →

← All insights

AI/ML9 min

LLM Integration for Production Apps: API Design, Latency, and Cost Control

AI/ML9 min

Qdrant vs ChromaDB vs Pinecone: Choosing a Vector Database for Production RAG

AI/ML10 min

From Training to Endpoint: How We Deploy Custom ML Models

We publish new posts every few weeks. See more on the insights page.