What is the minimum latency achievable with Deepgram and ElevenLabs?

In our production systems, end-to-end latency from the caller finishing a sentence to the AI starting its reply is consistently under 500ms. Deepgram's nova-2 model returns transcription in 100-200ms. OpenAI's gpt-4o with a short max_tokens limit returns in 150-300ms. ElevenLabs Flash sends the first audio chunk in under 200ms using streaming. The sum is sub-500ms when all three are running concurrently with WebSocket streaming.

Can a voice AI agent integrate with our existing phone system?

Yes. Most production phone systems use SIP (Session Initiation Protocol) for call routing. We integrate via Twilio's SIP trunking or direct SIP media streaming, which allows the AI to handle calls on your existing number without routing through a third-party telephony layer. For hosted systems (RingCentral, Dialpad, Five9), the integration approach depends on their API surface, but most expose call routing hooks. We scope this during discovery.

← All insights

Voice AIJune 20, 202610 min read

Building Voice AI Agents for Production: Deepgram & ElevenLabs

Q: How do you handle callers who interrupt the AI mid-sentence?

Interruption handling requires detecting that the caller has started speaking while the TTS audio is still playing, then immediately stopping playback and processing the new input. We implement this with a VAD (voice activity detection) layer running in parallel with TTS output. When VAD detects speech, we cancel the pending TTS stream, clear the audio buffer, and pass the new transcript to the LLM. Without this, the AI keeps talking over the caller, which is the most common complaint about production voice systems.

How we wire Deepgram, OpenAI, and ElevenLabs over WebSockets to build voice AI agents for real inbound calls. Architecture, edge cases, and production patterns.

Muhammad Hassan

Founder, Creative Codes. 8 years on backends; last 3 deep on AI agents, RAG pipelines, and production scraping. Python, LangGraph, Playwright, n8n, FastAPI.

GitHub Upwork

Most voice AI demos run well in a quiet demo environment with a fast internet connection and a single user. Production voice AI systems run on mobile networks, in noisy environments, with callers who interrupt mid-sentence, who have regional accents, and who ask things the system wasn't designed to handle.

This is the architecture we use at Creative Codes to build voice AI systems that work in the second category. We've shipped this stack for a Canadian telecom provider, an enterprise voice AI platform, and an open-source smart speaker SDK. The patterns here come from what actually ran into production problems, not what looked good in a notebook.

What "voice AI" actually involves

A voice AI agent is a pipeline with three distinct stages: speech-to-text (STT), language model inference, and text-to-speech (TTS). Each stage has a latency budget. The total budget for the full round-trip — from the caller finishing a sentence to the AI starting its reply — is around 500ms. Beyond that, callers start to feel the delay.

The budget breakdown looks roughly like this in a well-tuned system:

STT: 100-200ms (Deepgram nova-2 in streaming mode)
LLM: 150-300ms (OpenAI gpt-4o with short max_tokens)
TTS first chunk: 100-200ms (ElevenLabs Flash in streaming mode)

That adds up to 350-700ms depending on network conditions. The only way to hit the lower end consistently is to pipeline all three stages over WebSockets, stream the TTS output before synthesis is complete, and keep conversation context short.

The core pipeline

The entire pipeline runs over a single WebSocket connection. Audio comes in from the caller's phone via Twilio Media Streams or direct SIP, the AI processes it, and audio goes back out — all in real time.

python

# voice_agent.py — FastAPI WebSocket voice handler
import asyncio
from fastapi import WebSocket
import deepgram, openai, elevenlabs

async def voice_session(ws: WebSocket, session_id: str):
    await ws.accept()
    history = []

    async for audio_chunk in ws.iter_bytes():
        # Step 1: Speech to Text
        transcript = await deepgram.transcribe(
            audio_chunk,
            model="nova-2",
            language="en-US",
        )
        if not transcript.text or transcript.confidence < 0.65:
            continue

        # Step 2: LLM response with rolling context window
        history.append({"role": "user", "content": transcript.text})
        response = await openai.chat.completions.create(
            model="gpt-4o",
            messages=[SYSTEM_PROMPT, *history[-8:]],
            max_tokens=120,
        )
        reply = response.choices[0].message.content
        history.append({"role": "assistant", "content": reply})

        # Step 3: TTS streamed back to caller
        audio_response = await elevenlabs.generate(
            text=reply,
            voice_id=VOICE_ID,
            model_id="eleven_flash_v2_5",
            stream=True,
        )
        async for chunk in audio_response:
            await ws.send_bytes(chunk)

Three decisions here that aren't obvious:

Confidence threshold at 0.65. Low-confidence transcripts get dropped rather than passed to the LLM. A transcript of "um I want to uh" with 0.4 confidence will generate a hallucinated response. Dropping it means the AI stays silent for a beat, which is a better user experience than a confused reply. The threshold is configurable — we tune it per deployment based on caller demographics and noise environment.

Rolling 8-turn context window. The conversation history is capped at 8 turns (4 caller, 4 AI). Beyond that, token costs grow linearly and LLM latency increases. For long calls where context matters, we summarize older turns and prepend the summary to the messages array rather than sending raw history indefinitely.

ElevenLabs Flash, not multilingual v2. The Flash model trades some voice quality for dramatically lower first-chunk latency. In phone calls, audio quality is already limited by the PSTN codec — the voice quality difference between Flash and the higher-quality models is negligible, but the latency difference is not.

Interruption handling

The most common complaint about production voice AI is that it keeps talking after the caller interrupts. This happens because the TTS stream is still playing while the caller's new utterance is being transcribed, and the system doesn't know to stop.

The fix requires a VAD (voice activity detection) layer running in parallel with TTS output:

python

async def voice_session_with_interrupt(ws: WebSocket, session_id: str):
    await ws.accept()
    history = []
    tts_task = None

    async for audio_chunk in ws.iter_bytes():
        # VAD: caller started speaking
        if detect_speech(audio_chunk) and tts_task and not tts_task.done():
            tts_task.cancel()
            await ws.send_bytes(SILENCE_FRAME)  # flush audio buffer

        transcript = await deepgram.transcribe(audio_chunk, model="nova-2")
        if not transcript.text or transcript.confidence < 0.65:
            continue

        history.append({"role": "user", "content": transcript.text})
        reply = await get_llm_response(history)
        history.append({"role": "assistant", "content": reply})

        tts_task = asyncio.create_task(stream_tts(ws, reply))

When the VAD detects new speech from the caller, the active TTS task is cancelled, a silence frame flushes whatever was buffered in the audio pipeline, and the new transcript goes through the full pipeline. The caller experiences a natural conversation flow where the AI actually listens.

Language and accent handling

Deepgram's nova-2 model handles English well across most accents. For systems that need to support multiple languages or regional variants, the configuration changes per session:

python

LOCALE_CONFIG = {
    "en-US": {"model": "nova-2", "voice_id": "en_us_voice_id"},
    "en-GB": {"model": "nova-2",  "voice_id": "en_gb_voice_id"},
    "ar":    {"model": "nova-2-general", "voice_id": "ar_voice_id"},
    "es":    {"model": "nova-2-general", "voice_id": "es_voice_id"},
}

We detect the caller's locale from the inbound call metadata (country code from Twilio, or explicit selection from an IVR prompt) and load the matching STT model and TTS voice. A single codebase handles multiple languages with locale-specific configuration rather than separate deployments.

For the Zudu enterprise platform, we supported 80+ languages by maintaining a locale configuration table in Postgres and loading it per-session. Adding a new language meant adding a row, not deploying new code.

Conversation state management

For short calls (under 10 turns), in-memory history works fine. For longer calls or multi-session conversations, the history needs to persist somewhere durable.

We use Redis with a TTL:

python

import redis.asyncio as redis

async def get_history(session_id: str) -> list:
    data = await r.get(f"voice:history:{session_id}")
    return json.loads(data) if data else []

async def save_history(session_id: str, history: list):
    await r.setex(
        f"voice:history:{session_id}",
        ex=3600,  # 1-hour TTL, extend on activity
        value=json.dumps(history[-20:]),  # keep last 20 turns max
    )

The TTL prevents unbounded storage growth. The 20-turn cap means Redis entries stay small. For compliance use cases where call transcripts need long-term retention, we write the full transcript to Postgres at call end rather than keeping it in Redis.

Human escalation

Every production voice AI needs a path to a human agent. The AI cannot handle everything. Trying to handle everything is the fastest way to make callers angry.

Escalation triggers:

Confidence failure: the LLM's response includes a flag indicating it doesn't have the information to answer
Caller request: the caller explicitly asks for a human ("can I speak to someone?", "transfer me", "I want a person")
Loop detection: the same question is asked three times without resolution
Sentiment: strong negative sentiment signals frustration that a human should handle

The implementation is simple: include an escalate function in the LLM's tool list. When the LLM decides escalation is appropriate, it calls the function, the system initiates a Twilio conference transfer with the caller context attached, and the human agent receives a brief on the conversation before joining.

python

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "escalate_to_human",
            "description": "Transfer the caller to a human agent. Use when you cannot resolve the issue.",
            "parameters": {
                "type": "object",
                "properties": {
                    "reason": {"type": "string", "description": "Brief reason for escalation"},
                    "summary": {"type": "string", "description": "One-sentence call summary for the agent"},
                },
                "required": ["reason", "summary"],
            },
        },
    }
]

The warm transfer sends reason and summary to the human agent's interface before connecting them to the caller. The caller gets continuity. The agent gets context. Nobody has to repeat themselves.

For healthcare and finance clients, compliance is not optional. The main requirements:

HIPAA (US healthcare): Call recordings containing PHI must be encrypted at rest (AES-256), access-controlled, and retained for a minimum of 6 years. Transcripts are PHI. Our pipeline writes encrypted transcripts to S3 with IAM access restricted to specific roles, and logs every access event to CloudTrail.

GDPR (EU callers): Consent must be captured before recording. We implement this as an IVR prompt at call start: "This call may be recorded for quality purposes. Press 1 to continue or 2 to opt out." The consent decision is stored per caller ID and checked before any transcript retention.

PCI DSS (payment calls): Credit card numbers spoken on calls must not be recorded or transcribed. We implement a pause/resume recording API: when the caller is asked for payment information, recording pauses, and resumes after the card entry is complete.

None of this is complex to implement — it just needs to be designed in from the start rather than retrofitted later.

What the architecture looks like end to end

A production voice AI system for inbound calls:

Caller dials your number (Twilio routes to your WebSocket endpoint)
WebSocket session opens, session state initializes in Redis
Twilio streams audio chunks to your FastAPI handler
Deepgram STT transcribes in real time
OpenAI LLM generates response with conversation context
ElevenLabs TTS synthesizes and streams audio back
Twilio plays audio to caller
On call end: transcript written to Postgres, CRM updated, recording (if enabled) saved to S3

The telephony layer (Twilio) is swappable. We've run the same pipeline with Vonage and direct SIP for clients who already had telephony infrastructure. The Deepgram/OpenAI/ElevenLabs core is the same regardless of the telephony provider.

Where this goes wrong in production

Things that don't show up in demos but matter in production:

Duplex audio management. The caller and the AI need to be on separate audio channels. Mixing them causes echo and feedback. Twilio's WebSocket Media Streams handle this with separate inbound/outbound tracks — but you need to route them correctly.

Latency spikes under load. WebSocket sessions are lightweight. A single FastAPI server handles hundreds of concurrent sessions comfortably. But LLM API latency spikes during peak hours (6-9pm US time, when OpenAI's servers are under load). Build retry logic with a fast timeout and a fallback response for when the LLM takes more than 800ms.

Session cleanup. WebSocket connections drop silently. Implement a ping/pong health check and a timeout-based cleanup job that closes orphaned sessions and flushes their Redis state.

STT hallucinations on silence. Deepgram sometimes transcribes silence as "Yeah." or "Mm-hmm." with high confidence. Filter short single-word transcripts that occur within 500ms of the previous response — they're almost always silence artifacts.

If you're building a voice AI system for inbound calls, our Voice AI service covers this full stack. We scope the work in a 30-minute discovery call and build from the telephony integration down to the CRM write. See the Zudu and Key2 Telecom case studies for the specific problems we solved on those deployments.

Related service

Building a voice AI phone system? We scope and ship them in 3-5 weeks.

Voice AI Development →

← All insights

Voice AI7 min

Voice AI Agent Cost: Build vs Buy at Three Volume Tiers

Voice AI7 min

Voice AI vs IVR: What Actually Changes for Your Business

We publish new posts every few weeks. See more on the insights page.