
Quick Answer
A production voice AI agent runs three integrated layers: automatic speech recognition (ASR) to transcribe the user's speech, a large language model (LLM) for reasoning and response generation, and a text-to-speech (TTS) engine to deliver the answer. Each layer adds latency. Total end-to-end response time in production typically runs 1.5 to 3 seconds. The architecture choices that reduce this -- and the failure modes that break it -- are what this post covers.
Key Takeaways
When we built a voice-enabled ordering agent for a B2B commerce client, the demo took two days to build. The production system took ten weeks. The gap between those two timelines is everything this post is about.
Every voice AI agent runs on the same fundamental architecture. The implementation choices at each layer determine cost, latency, and reliability.
| Layer | Function | Tools We Use | Typical Latency |
|---|---|---|---|
| ASR (Speech Recognition) | Converts user audio to text | OpenAI Whisper, Deepgram, Google STT | 200-500ms |
| LLM (Reasoning + Response) | Generates the appropriate reply | Claude 3.5+, GPT-4o, Gemini 1.5 Pro | 400-1,500ms |
| TTS (Speech Synthesis) | Converts text response to audio | ElevenLabs, Azure Neural TTS, Google TTS | 150-400ms |
| Total end-to-end | From user stops speaking to first audio byte | 750ms - 2,400ms |
The most instructive project we can share is the B2B WhatsApp voice ordering agent we built for a manufacturing-sector client handling orders across India, UAE, and Singapore.
The requirement: a field sales representative sends a WhatsApp voice message to place an order in their native language (English, Hindi, or Arabic). The AI processes the order, confirms the details, and routes to the ERP. What we delivered: 60% faster order processing, 40% reduction in order entry errors, and voice ordering that works across 3 languages on mobile networks.
In a non-streaming pipeline, the user speaks, the ASR transcribes the entire message, the LLM generates the entire response, and the TTS renders the entire audio before the user hears a single word. Total wait: 1.5-3 seconds minimum. In a streaming pipeline, the LLM starts generating tokens as soon as ASR has a partial transcript, and TTS starts rendering audio as soon as the first sentence is complete. The user hears a response within 800ms of stopping speaking.
Voice activity detection (VAD) answers one question: has the user finished speaking? Most demos use a simple silence timer. In production with background noise, this fails constantly. We use Silero VAD running client-side, classifying audio frames as speech or non-speech in real time at under 10ms per frame.
A common mistake: send all audio to English ASR, then try to detect the language of the resulting transcript. If the ASR model is English-optimized and the user spoke Hindi, the transcript is useless. The correct architecture: run a lightweight language identification model on the first 2 seconds of audio before routing to the appropriate ASR model.
General-purpose ASR models struggle with product codes and domain-specific terminology. For the ordering agent, we built a post-ASR correction layer: a fuzzy matcher that compares ASR output against the product catalog. For a 2,000-item catalog, this runs in under 15ms and reduces order entry errors by 40%.
Voice data is personal data under GDPR (EU), the DPDP Act (India), and PDPA (Singapore). For our B2B ordering agent: consent gate on first interaction, audit logging for every voice transaction, and audio storage on AWS Mumbai (ap-south-1) to satisfy India data residency requirements.
ASR options: Deepgram Nova-2 has best latency for real-time streaming (~$0.0043/minute). OpenAI Whisper self-hosted is most cost-effective at scale with best multilingual support. Google STT v2 performs well on Indian English.
TTS options: ElevenLabs produces highest quality voices (~$0.30/1,000 characters). Azure Neural TTS offers strong price-performance at enterprise scale. Google WaveNet is cost-effective for high-volume GCP deployments.
For more on production AI agent architecture, see our guide to building AI agents with Model Context Protocol. If you are evaluating a conversational AI partner, ask them to demo the specific failure modes, not just the happy path. See also: how to choose an AI development company.
ElevenLabs produces the most natural-sounding voices and is best for customer-facing use cases where voice quality drives trust. Azure Neural TTS offers the strongest price-performance at enterprise scale. For cost-sensitive high-volume use cases, Google WaveNet is acceptable.
Use streaming throughout the pipeline: streaming ASR, streaming LLM token generation, and streaming TTS synthesis. Sentence-boundary detection ensures the TTS only renders complete sentences. Choose lower-latency LLM variants (Claude Haiku, GPT-4o mini) for conversational turns.
Voice recordings are personal data under GDPR (EU), DPDP Act (India), PDPA (Singapore). Key requirements: explicit consent before recording, transparent disclosure of AI processing, data retention limits, and audit logging. Build consent architecture before building the agent.
Yes, with the right architecture. Use a language detection model on the audio stream before routing to ASR. For code-switching, multilingual ASR models like Whisper large-v3 handle this better than routing to separate single-language models.
A voice chatbot follows a scripted decision tree. A voice AI agent reasons: it understands intent, retrieves relevant information, makes decisions within defined parameters, and handles unexpected inputs with contextually appropriate responses.
Track task completion rate, fallback trigger rate (escalations to human), ASR confidence distribution, and user correction rate. Latency metrics tell you the system is fast; completion rate tells you it is actually useful.












