Voice Agents: TL;DR of Week 1

Voice AI Agents TL;DR

Who is this TL;DR for?

If you're short on time but want to understand what goes into building production-grade voice AI agents, this post condenses the key insights from the first week of the Voice AI course by Daily.co.

Ideal for:

  • Builders shipping voice-first agents
  • Devs optimizing real-time latency and infra
  • Anyone exploring telephony and WebRTC use cases in AI

What is a Voice Agent?

A voice agent is an AI-powered system that listens, thinks, and speaks back. Core pipeline:

What is a Voice Agent?

But this is just the surface. Real-time voice requires solving deep systems challenges:

  • Low-latency networking
  • Turn detection
  • Interruption handling
  • Context management
  • Function calling and tool use
  • Scripting and instruction following
  • Memory and retrieval
  • Integration with legacy systems

Voice AI Stack Today

Voice AI Stack Today

The typical voice AI loop looks like this:

  1. Convert voice to text (STT)
  2. Process with LLM (context + memory)
  3. Convert response to voice (TTS)

For each of these steps, you have to choose a model that not only fits the task, but also fits the latency budget. A few good model choices for each step are listed below:

  • STT: Deepgram, Gladia, Whisper
  • LLM: GPT-4o, Gemini 2.0 Flash
  • TTS: Cartesia, Rime, ElevenLabs

Keep in mind that all of this must happen under 1 second to feel natural.

Network Layer

The network layer is the most critical part of the stack. It's the only way to get real-time audio between the client and server. Choose the correct protocol based on your use case. The following is a breakdown of which protocol is best for which use case:

  • WebSockets: TCP based, Good for server-to-server, not optimal for server-to-client real-time audio
  • WebRTC: UDP based, Gold standard for low-latency, bidirectional client-server audio/video and built-in media controls
  • PSTN / SIP: For dial-in/out phone calls, legacy infra

Latency Breakdown (Worst Case ~993ms)

Voice AI Latency
  • STT and LLM are biggest bottlenecks (~650ms combined)
  • Turn detection, sentence aggregation, and speaker output also matter
  • Small gains across the stack = big user perception improvements

Note: This worst-case breakdown image is old now, but the principles are still relevant. Models have improved and latency has decreased. However, this should give you a good ballpark for the latency of a voice agent.

Hosting & Infrastructure

Traditional Web Apps vs Voice AI

Traditional AppReal-time Voice AI
Short-lived requestsPersistent connection (minutes)
Request / response patternStreaming bidirectional data
StatelessStateful sessions
High volume, low durationLower volume, longer duration
More predictable resource usageVariable resource consumption

Resource Allocation

The following is a general rule of thumb for resource allocation for your voice agent. This should serve as a starting point that you can further optimize for your specific use case.

  • ~0.5 vCPU & 1GB RAM per voice agent
  • ~40kbps per agent via WebRTC
  • Optimize for cold-starts: warm instances, scheduling, caching

Deployment Strategy

Keep in mind that time to first word matters a lot. People have a perception that phone callse are usually answered between 3-5 seconds. Your goal should be to optimize for this metric.

As for the deployment strategy, here are some general tips:

  1. Start simple: Start with a simple container solution and lightweight tooling (Docker, Fly.io, etc.). You can always scale your services on AWS, GCP, etc. later.
  2. Optimize: Warm pools, scheduling, drain timers
  3. Scale: Reactive (demand-based) or Predictive (historical patterns)

Run warm instances to ensure that resources are always available to handle requests. Pre-warm your instances when you expect a spike in traffic.

Understand the resource usage patterns and predictively scale your instances based on demand. Invest in ways to reduce cold-starts.

Infra Providers

  • ML-native: Modal, Cerebrium, Baseten
  • Hyperscalers: AWS, GCP
  • Voice-native: Vapi, Pipecat, Layercode

Telephony: Connecting to Real Phones

PSTN (Public Switched Telephone Network)

  • Connects to real phone numbers
  • Use cases: call centers, general public
  • Providers: Twilio, Telnyx, SignalWire

SIP (Session Initiation Protocol)

  • Enterprise IP telephony
  • Greater control over complex call flows

WebRTC

  • For high-fidelity audio in browsers
  • Track-level control, better quality

Advanced Call Features

  • DTMF Handling: Capture user keypad inputs (e.g., 1 for sales)
  • Cold Transfer: Redirect to another number instantly
  • Warm Transfer: Bot speaks to agent before handing off

Open Source Tools


Resources


Key Takeaways

  • Voice agents are real-time, multimodal, and hard.
  • You’re not just doing LLMs, you’re solving infra + audio + network + UX.
  • WebRTC and pre-warmed infra are critical.
  • Telephony is a first-class citizen, not an afterthought.
  • Everything breaks without tight latency control.