LLM Integration for Production Apps: API Design, Latency, and Cost Control
Integrating an LLM into an application is a solved problem in demo. Running it in production — with real latency targets, real cost constraints, and real users — requires a different architecture entirely.
Founder, Creative Codes. 8 years on backends; last 3 deep on AI agents, RAG pipelines, and production scraping. Python, LangGraph, Playwright, n8n, FastAPI.
Integrating an LLM into an application is straightforward in demo. Running it in production — with real latency targets, real cost constraints, and users who will abandon a product that feels slow — requires architecture that the demo never needed. The teams that get this right build the API layer, prompt versioning, caching, and cost monitoring before the first production incident, not after.
The latency problem
LLM API calls are slow by web application standards. A typical call to Claude Sonnet or GPT-4o takes 1.5-4 seconds to first token, with full responses taking 4-15 seconds depending on output length. For an API endpoint that a frontend is waiting on synchronously, this is a critical UX problem.
Three patterns address this:
Streaming: stream tokens to the client as they're generated instead of waiting for the full response. Perceived latency drops dramatically — the user sees output starting in 1-2 seconds even if the full response takes 10. For any user-facing text generation (chat, summaries, explanations), streaming should be the default.
import anthropic
client = anthropic.Anthropic()
def stream_completion(prompt: str):
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
) as stream:
for text in stream.text_stream:
yield textAsync processing: for non-interactive use cases (batch document processing, background enrichment, scheduled analysis), don't make the user wait at all. Accept the request, queue the LLM call, and deliver results via webhook or a polling endpoint. This decouples user-facing latency from LLM processing time entirely.
Caching: cache LLM responses for identical or near-identical inputs. For deterministic prompts (classification, extraction from fixed templates, structured generation), caching at the application layer eliminates redundant API calls.
Prompt engineering for production
Production prompts are different from demo prompts in one important way: they need to produce consistent, parseable output, not just coherent text.
For classification and extraction tasks, the prompt must specify the output format exactly and the model must return it reliably:
EXTRACTION_PROMPT = """Extract the following fields from the job posting below.
Return JSON only. No explanation, no markdown fences.
Fields:
- title (string)
- company (string)
- location (string, null if remote)
- is_remote (boolean)
- salary_min (number in USD, null if not specified)
- salary_max (number in USD, null if not specified)
- required_years_experience (number, null if not specified)
Job posting:
{text}"""Two rules that prevent the most common production failures:
- Say "JSON only" — not "return JSON" or "format as JSON." Models will otherwise add explanation text before or after the JSON.
- Use
nullnotNone— in the field descriptions, use JSON's null, not Python's None. The model reflects the vocabulary back.
For outputs that must be parseable, add a retry with explicit correction on parse failure:
import json
import re
def extract_with_retry(text: str, max_retries: int = 2) -> dict:
prompt = EXTRACTION_PROMPT.format(text=text)
for attempt in range(max_retries + 1):
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[{"role": "user", "content": prompt}],
)
raw = response.content[0].text.strip()
# Strip accidental markdown fences
raw = re.sub(r"^```(?:json)?\n?", "", raw)
raw = re.sub(r"\n?```$", "", raw)
try:
return json.loads(raw)
except json.JSONDecodeError:
if attempt < max_retries:
prompt = f"The previous response was not valid JSON. Return only JSON:\n{raw}"
else:
raise ValueError(f"Failed to parse JSON after {max_retries + 1} attempts")Cost control
LLM API costs scale with token consumption. At scale, prompt design has a direct cost impact.
Measure before optimizing. Add token logging from day one:
def call_with_cost_tracking(
prompt: str,
model: str = "claude-sonnet-4-6",
max_tokens: int = 1024,
task_name: str = "unknown",
) -> tuple[str, dict]:
response = client.messages.create(
model=model,
max_tokens=max_tokens,
messages=[{"role": "user", "content": prompt}],
)
usage = {
"task": task_name,
"model": model,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"cost_usd": (
response.usage.input_tokens * 0.000003 # $3/M input
+ response.usage.output_tokens * 0.000015 # $15/M output
),
}
log_to_database(usage)
return response.content[0].text, usageLog every call. Run weekly queries to find which tasks consume the most tokens. Two patterns cut costs significantly once you know where the spend is:
Prompt caching (Claude-specific): for prompts with a large static prefix (a system prompt, a long document you're referencing), mark the static portion with cache_control. Repeated calls with the same cached prefix are billed at 10% of normal input token cost.
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system=[{
"type": "text",
"text": LONG_STATIC_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}],
messages=[{"role": "user", "content": user_query}],
)Model routing: use a smaller, cheaper model (Haiku, GPT-4o Mini) for tasks that don't require frontier-model reasoning — classification, simple extraction, format conversion. Reserve Sonnet/GPT-4o for tasks that benefit from stronger reasoning. Routing by task type can cut costs by 50-70% on mixed workloads.
Enforcing output structure with JSON Schema
The retry-on-parse-failure approach works but adds latency when the model gets it wrong. A better approach for structured extraction is to use the model's tool/function calling API, which enforces output structure at the API level.
With Anthropic's tool use, you define the output schema once and the model is constrained to produce valid JSON matching that schema — no retry needed:
invoice_schema = {
"name": "extract_invoice",
"description": "Extract structured fields from invoice text",
"input_schema": {
"type": "object",
"properties": {
"invoice_number": {"type": "string"},
"date": {"type": "string", "description": "ISO 8601 date"},
"vendor_name": {"type": "string"},
"total_amount": {"type": "number"},
"currency": {"type": "string", "default": "USD"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "number"},
"unit_price": {"type": "number"},
"total": {"type": "number"},
},
"required": ["description", "total"],
},
},
},
"required": ["invoice_number", "date", "vendor_name", "total_amount"],
},
}
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=[invoice_schema],
tool_choice={"type": "tool", "name": "extract_invoice"},
messages=[{"role": "user", "content": f"Extract fields from this invoice:\n{text}"}],
)
# The response is always a tool_use block with valid JSON
tool_use = next(b for b in response.content if b.type == "tool_use")
extracted = tool_use.input # already a Python dict, guaranteed to match the schemaTool use eliminates parse failures entirely for structured extraction. The tradeoff: it adds a small amount of overhead to the response and is slightly more verbose to set up. For high-volume extraction pipelines where parse failure handling would otherwise need to be production-hardened, tool use pays for itself quickly.
Versioning prompts
Prompts are code. Treat them that way. Changes to a prompt can silently break downstream parsing or change output quality. Keep prompts in version-controlled files, not inline strings:
/prompts
/v1
invoice_extraction.txt
job_classification.txt
/v2
invoice_extraction.txt # updated for new vendor formatEach prompt file includes its version and the date it was last updated as a comment at the top. When changing a prompt, run the full eval suite against the new version before deploying. The eval setup from LLM Evaluation: How to Measure Production Accuracy applies here directly — the same RAGAS or task-specific metrics that gate deployment also gate prompt updates.
Handling failures
LLM API calls can fail in several ways that don't happen with conventional APIs:
- Rate limits (HTTP 429): implement exponential backoff with jitter. Anthropic and OpenAI both support retry headers.
- Context window exceeded: the prompt + conversation history exceeds the model's context limit. Truncate history from the oldest messages first when approaching the limit.
- Overloaded responses (HTTP 529): provider is at capacity. Queue and retry with backoff.
- Content policy blocks: the model refuses to generate a response. Log the input (hashed if sensitive), alert, and route to a human fallback.
Use a centralized LLM client wrapper that handles all of these consistently rather than scattering retry logic across the codebase.
What production-ready LLM integration looks like
The full integration layer has:
- A prompt registry (version-controlled, evaluable before deployment)
- A client wrapper (streaming, retry, cost logging, rate limit handling)
- A caching layer (for deterministic workloads)
- An evaluation harness (accuracy gates before prompt changes go live)
- A cost dashboard (token spend by task, model, and date)
None of these are complex to build individually. The value is having all of them before the first production incident rather than building them reactively.
One operational pattern worth establishing early: a shadow mode deployment. Before enabling an LLM feature for all users, run it in parallel with your existing system (or a human fallback) and compare outputs. This surfaces model behavior on real production inputs without user impact. It also gives you baseline accuracy numbers before the feature goes live, which makes post-launch regression detection meaningful rather than speculative.
The teams that have the most success with LLM integration are the ones who treat the LLM as a fallible component — like a database that sometimes returns wrong answers — and build their architecture around that assumption. Rate limit handling, output validation, cost monitoring, and prompt versioning are not optional extras. They're the foundation.
If you're integrating LLMs into a production application and need the full architecture set up correctly from the start, tell us about the project.
Related: LLM Evaluation: How to Measure Production Accuracy | RAG Pipelines in Production: 5 Lessons from Real Deployments
Related service
Need a RAG pipeline, ML model, or AI agent built for production?
AI & Machine Learning →Related
We publish new posts every few weeks. See more on the insights page.