Building Voice AI Agents for Production: Deepgram & ElevenLabs
How we wire Deepgram, OpenAI, and ElevenLabs over WebSockets to build voice AI agents for real inbound calls. Architecture, edge cases, and production patterns.
Founder, Creative Codes. 8 years on backends; last 3 deep on AI agents, RAG pipelines, and production scraping. Python, LangGraph, Playwright, n8n, FastAPI.
Most voice AI demos run well in a quiet demo environment with a fast internet connection and a single user. Production voice AI systems run on mobile networks, in noisy environments, with callers who interrupt mid-sentence, who have regional accents, and who ask things the system wasn't designed to handle.
This is the architecture we use at Creative Codes to build voice AI systems that work in the second category. We've shipped this stack for a Canadian telecom provider, an enterprise voice AI platform, and an open-source smart speaker SDK. The patterns here come from what actually ran into production problems, not what looked good in a notebook.
What "voice AI" actually involves
A voice AI agent is a pipeline with three distinct stages: speech-to-text (STT), language model inference, and text-to-speech (TTS). Each stage has a latency budget. The total budget for the full round-trip — from the caller finishing a sentence to the AI starting its reply — is around 500ms. Beyond that, callers start to feel the delay.
The budget breakdown looks roughly like this in a well-tuned system:
- STT: 100-200ms (Deepgram nova-2 in streaming mode)
- LLM: 150-300ms (OpenAI gpt-4o with short max_tokens)
- TTS first chunk: 100-200ms (ElevenLabs Flash in streaming mode)
That adds up to 350-700ms depending on network conditions. The only way to hit the lower end consistently is to pipeline all three stages over WebSockets, stream the TTS output before synthesis is complete, and keep conversation context short.
The core pipeline
The entire pipeline runs over a single WebSocket connection. Audio comes in from the caller's phone via Twilio Media Streams or direct SIP, the AI processes it, and audio goes back out — all in real time.
# voice_agent.py — FastAPI WebSocket voice handler
import asyncio
from fastapi import WebSocket
import deepgram, openai, elevenlabs
async def voice_session(ws: WebSocket, session_id: str):
await ws.accept()
history = []
async for audio_chunk in ws.iter_bytes():
# Step 1: Speech to Text
transcript = await deepgram.transcribe(
audio_chunk,
model="nova-2",
language="en-US",
)
if not transcript.text or transcript.confidence < 0.65:
continue
# Step 2: LLM response with rolling context window
history.append({"role": "user", "content": transcript.text})
response = await openai.chat.completions.create(
model="gpt-4o",
messages=[SYSTEM_PROMPT, *history[-8:]],
max_tokens=120,
)
reply = response.choices[0].message.content
history.append({"role": "assistant", "content": reply})
# Step 3: TTS streamed back to caller
audio_response = await elevenlabs.generate(
text=reply,
voice_id=VOICE_ID,
model_id="eleven_flash_v2_5",
stream=True,
)
async for chunk in audio_response:
await ws.send_bytes(chunk)Three decisions here that aren't obvious:
Confidence threshold at 0.65. Low-confidence transcripts get dropped rather than passed to the LLM. A transcript of "um I want to uh" with 0.4 confidence will generate a hallucinated response. Dropping it means the AI stays silent for a beat, which is a better user experience than a confused reply. The threshold is configurable — we tune it per deployment based on caller demographics and noise environment.
Rolling 8-turn context window. The conversation history is capped at 8 turns (4 caller, 4 AI). Beyond that, token costs grow linearly and LLM latency increases. For long calls where context matters, we summarize older turns and prepend the summary to the messages array rather than sending raw history indefinitely.
ElevenLabs Flash, not multilingual v2. The Flash model trades some voice quality for dramatically lower first-chunk latency. In phone calls, audio quality is already limited by the PSTN codec — the voice quality difference between Flash and the higher-quality models is negligible, but the latency difference is not.
Interruption handling
The most common complaint about production voice AI is that it keeps talking after the caller interrupts. This happens because the TTS stream is still playing while the caller's new utterance is being transcribed, and the system doesn't know to stop.
The fix requires a VAD (voice activity detection) layer running in parallel with TTS output:
async def voice_session_with_interrupt(ws: WebSocket, session_id: str):
await ws.accept()
history = []
tts_task = None
async for audio_chunk in ws.iter_bytes():
# VAD: caller started speaking
if detect_speech(audio_chunk) and tts_task and not tts_task.done():
tts_task.cancel()
await ws.send_bytes(SILENCE_FRAME) # flush audio buffer
transcript = await deepgram.transcribe(audio_chunk, model="nova-2")
if not transcript.text or transcript.confidence < 0.65:
continue
history.append({"role": "user", "content": transcript.text})
reply = await get_llm_response(history)
history.append({"role": "assistant", "content": reply})
tts_task = asyncio.create_task(stream_tts(ws, reply))When the VAD detects new speech from the caller, the active TTS task is cancelled, a silence frame flushes whatever was buffered in the audio pipeline, and the new transcript goes through the full pipeline. The caller experiences a natural conversation flow where the AI actually listens.
Language and accent handling
Deepgram's nova-2 model handles English well across most accents. For systems that need to support multiple languages or regional variants, the configuration changes per session:
LOCALE_CONFIG = {
"en-US": {"model": "nova-2", "voice_id": "en_us_voice_id"},
"en-GB": {"model": "nova-2", "voice_id": "en_gb_voice_id"},
"ar": {"model": "nova-2-general", "voice_id": "ar_voice_id"},
"es": {"model": "nova-2-general", "voice_id": "es_voice_id"},
}We detect the caller's locale from the inbound call metadata (country code from Twilio, or explicit selection from an IVR prompt) and load the matching STT model and TTS voice. A single codebase handles multiple languages with locale-specific configuration rather than separate deployments.
For the Zudu enterprise platform, we supported 80+ languages by maintaining a locale configuration table in Postgres and loading it per-session. Adding a new language meant adding a row, not deploying new code.
Conversation state management
For short calls (under 10 turns), in-memory history works fine. For longer calls or multi-session conversations, the history needs to persist somewhere durable.
We use Redis with a TTL:
import redis.asyncio as redis
async def get_history(session_id: str) -> list:
data = await r.get(f"voice:history:{session_id}")
return json.loads(data) if data else []
async def save_history(session_id: str, history: list):
await r.setex(
f"voice:history:{session_id}",
ex=3600, # 1-hour TTL, extend on activity
value=json.dumps(history[-20:]), # keep last 20 turns max
)The TTL prevents unbounded storage growth. The 20-turn cap means Redis entries stay small. For compliance use cases where call transcripts need long-term retention, we write the full transcript to Postgres at call end rather than keeping it in Redis.
Human escalation
Every production voice AI needs a path to a human agent. The AI cannot handle everything. Trying to handle everything is the fastest way to make callers angry.
Escalation triggers:
- Confidence failure: the LLM's response includes a flag indicating it doesn't have the information to answer
- Caller request: the caller explicitly asks for a human ("can I speak to someone?", "transfer me", "I want a person")
- Loop detection: the same question is asked three times without resolution
- Sentiment: strong negative sentiment signals frustration that a human should handle
The implementation is simple: include an escalate function in the LLM's tool list. When the LLM decides escalation is appropriate, it calls the function, the system initiates a Twilio conference transfer with the caller context attached, and the human agent receives a brief on the conversation before joining.
TOOLS = [
{
"type": "function",
"function": {
"name": "escalate_to_human",
"description": "Transfer the caller to a human agent. Use when you cannot resolve the issue.",
"parameters": {
"type": "object",
"properties": {
"reason": {"type": "string", "description": "Brief reason for escalation"},
"summary": {"type": "string", "description": "One-sentence call summary for the agent"},
},
"required": ["reason", "summary"],
},
},
}
]The warm transfer sends reason and summary to the human agent's interface before connecting them to the caller. The caller gets continuity. The agent gets context. Nobody has to repeat themselves.
Compliance: HIPAA, GDPR, and call recording
For healthcare and finance clients, compliance is not optional. The main requirements:
HIPAA (US healthcare): Call recordings containing PHI must be encrypted at rest (AES-256), access-controlled, and retained for a minimum of 6 years. Transcripts are PHI. Our pipeline writes encrypted transcripts to S3 with IAM access restricted to specific roles, and logs every access event to CloudTrail.
GDPR (EU callers): Consent must be captured before recording. We implement this as an IVR prompt at call start: "This call may be recorded for quality purposes. Press 1 to continue or 2 to opt out." The consent decision is stored per caller ID and checked before any transcript retention.
PCI DSS (payment calls): Credit card numbers spoken on calls must not be recorded or transcribed. We implement a pause/resume recording API: when the caller is asked for payment information, recording pauses, and resumes after the card entry is complete.
None of this is complex to implement — it just needs to be designed in from the start rather than retrofitted later.
What the architecture looks like end to end
A production voice AI system for inbound calls:
- Caller dials your number (Twilio routes to your WebSocket endpoint)
- WebSocket session opens, session state initializes in Redis
- Twilio streams audio chunks to your FastAPI handler
- Deepgram STT transcribes in real time
- OpenAI LLM generates response with conversation context
- ElevenLabs TTS synthesizes and streams audio back
- Twilio plays audio to caller
- On call end: transcript written to Postgres, CRM updated, recording (if enabled) saved to S3
The telephony layer (Twilio) is swappable. We've run the same pipeline with Vonage and direct SIP for clients who already had telephony infrastructure. The Deepgram/OpenAI/ElevenLabs core is the same regardless of the telephony provider.
Where this goes wrong in production
Things that don't show up in demos but matter in production:
Duplex audio management. The caller and the AI need to be on separate audio channels. Mixing them causes echo and feedback. Twilio's WebSocket Media Streams handle this with separate inbound/outbound tracks — but you need to route them correctly.
Latency spikes under load. WebSocket sessions are lightweight. A single FastAPI server handles hundreds of concurrent sessions comfortably. But LLM API latency spikes during peak hours (6-9pm US time, when OpenAI's servers are under load). Build retry logic with a fast timeout and a fallback response for when the LLM takes more than 800ms.
Session cleanup. WebSocket connections drop silently. Implement a ping/pong health check and a timeout-based cleanup job that closes orphaned sessions and flushes their Redis state.
STT hallucinations on silence. Deepgram sometimes transcribes silence as "Yeah." or "Mm-hmm." with high confidence. Filter short single-word transcripts that occur within 500ms of the previous response — they're almost always silence artifacts.
If you're building a voice AI system for inbound calls, our Voice AI service covers this full stack. We scope the work in a 30-minute discovery call and build from the telephony integration down to the CRM write. See the Zudu and Key2 Telecom case studies for the specific problems we solved on those deployments.
Related service
Building a voice AI phone system? We scope and ship them in 3-5 weeks.
Voice AI Development →Related
We publish new posts every few weeks. See more on the insights page.