
Voice Agents
Design and implement low-latency voice agents using speech-to-speech or STT→LLM→TTS pipelines with usable interruption and turn-taking.
Overview
Voice-agents is an agent skill most often used in Build (also Ship, Operate) that guides speech-to-speech and pipeline voice AI with sub-800ms latency and interruption-aware conversation design.
Install
npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill voice-agentsWhat is this skill?
- Two architectures: speech-to-speech (Realtime API) vs STT→LLM→TTS pipeline
- Latency budget: target sub-800ms end-to-end; jitter matters as much as averages
- VAD, turn-taking, and barge-in handling called out as experience breakers
- Best-in-class component stacking (e.g., Deepgram STT + ElevenLabs TTS)
- MVP-first iteration from real conversation logs—not premature full platform scope
- Sub-800ms end-to-end latency target
- 84% of organizations increasing voice AI budgets in 2025 (per skill readme)
Adoption & trust: 653 installs on skills.sh; 40.1k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You want natural phone or app voice conversations but stitched STT, LLM, and TTS stacks feel laggy, brittle on interrupts, and painful to debug.
Who is it for?
Indie builders adding voice to a SaaS or agent product who must hit conversational latency and handle barge-in on real audio.
Skip if: Text-only chatbots, teams without network/API budget for audio providers, or projects skipping telephony/WebRTC audio constraints entirely.
When should I use this skill?
You are designing or implementing a voice agent and need architecture, latency, VAD, and interruption guidance.
What do I get? / Deliverables
You choose a realistic architecture (Realtime vs pipeline), set latency and VAD requirements, and integrate proven STT/TTS components with an MVP scope you can iterate from real calls.
- Architecture decision: speech-to-speech vs pipeline
- Integration plan for VAD, turn-taking, and barge-in
- MVP call flow with latency budget
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Voice agents are built when you wire Realtime APIs, telephony, and audio providers—canonical shelf is Build integrations even though latency tuning continues into Ship. Integrations subphase covers Deepgram, ElevenLabs, OpenAI Realtime, VAD, and phone-system hooks described in the skill.
Where it fits
Wire OpenAI Realtime or Deepgram+ElevenLabs for your first inbound call flow.
Profile end-to-end milliseconds and jitter before exposing voice to paying users.
Trace failed sessions and latency spikes from production conversation logs.
How it compares
End-to-end voice systems design—not a single-vendor TTS snippet or generic REST backend skill.
Common Questions / FAQ
Who is voice-agents for?
Solo builders shipping voice-enabled agents or apps who need architecture, latency, and provider integration guidance beyond basic speech APIs.
When should I use voice-agents?
Use it in Build/integrations when wiring Realtime or pipeline stacks, in Ship/perf when tuning sub-800ms budgets, and in Operate/monitoring when tracing jitter and failed sessions from production calls.
Is voice-agents safe to install?
Review the Security Audits panel on this Prism page; voice stacks use network APIs and often secrets—scope agent permissions and rotate keys used for STT/TTS providers.
SKILL.md
READMESKILL.md - Voice Agents
# Voice Agents Voice agents represent the frontier of AI interaction - humans speaking naturally with AI systems. The challenge isn't just speech recognition and synthesis, it's achieving natural conversation flow with sub-800ms latency while handling interruptions, background noise, and emotional nuance. This skill covers two architectures: speech-to-speech (OpenAI Realtime API, lowest latency, most natural) and pipeline (STT→LLM→TTS, more control, easier to debug). Key insight: latency is the constraint. Humans expect responses in 500ms. Every millisecond matters. 84% of organizations are increasing voice AI budgets in 2025. This is the year voice agents go mainstream. ## Principles - Latency is the constraint - target <800ms end-to-end - Jitter (variance) matters as much as absolute latency - VAD quality determines conversation flow - Interruption handling makes or breaks the experience - Start with focused MVP, iterate based on real conversations - Combine best-in-class components (Deepgram STT + ElevenLabs TTS) ## Capabilities - voice-agents - speech-to-speech - speech-to-text - text-to-speech - conversational-ai - voice-activity-detection - turn-taking - barge-in-detection - voice-interfaces ## Scope - phone-system-integration → backend - audio-processing-dsp → audio-specialist - music-generation → audio-specialist - accessibility-compliance → accessibility-specialist ## Tooling ### Speech_to_speech - OpenAI Realtime API - When: Lowest latency, most natural conversation Note: gpt-4o-realtime-preview, native voice, sub-500ms - Pipecat - When: Open-source voice orchestration Note: Daily-backed, enterprise-grade, modular ### Speech_to_text - OpenAI Whisper - When: Highest accuracy, multilingual Note: gpt-4o-transcribe for best results - Deepgram Nova-3 - When: Production workloads, 54% lower WER Note: 150-184ms TTFT, 90%+ accuracy on noisy audio - AssemblyAI - When: Real-time streaming, speaker diarization Note: Good accuracy-latency balance ### Text_to_speech - ElevenLabs - When: Most natural voice, emotional control Note: Flash model 75ms latency, V3 for expression - OpenAI TTS - When: Integrated with OpenAI stack Note: gpt-4o-mini-tts, 13 voices, streaming - Deepgram Aura-2 - When: Cost-effective production TTS Note: 40% cheaper than ElevenLabs, 184ms TTFB ### Frameworks - Pipecat - When: Open-source voice agent orchestration Note: Silero VAD, SmartTurn, interruption handling - Vapi - When: Managed voice agent platform Note: No infrastructure management - Retell AI - When: Low-latency voice agents Note: Best context preservation on interruption ## Patterns ### Speech-to-Speech Architecture Direct audio-to-audio processing for lowest latency **When to use**: Maximum naturalness, emotional preservation, real-time conversation # SPEECH-TO-SPEECH ARCHITECTURE: """ [User Audio] → [S2S Model] → [Agent Audio] Advantages: - Lowest latency (sub-500ms) - Preserves emotion, emphasis, accents - Most natural conversation flow Disadvantages: - Less control over responses - Harder to debug/audit - Can't easily modify what's said """ ## OpenAI Realtime API """ import { RealtimeClient } from '@openai/realtime-api-beta'; const client = new RealtimeClient({ apiKey: process.env.OPENAI_API_KEY, }); // Configure for voice conversation client.updateSession({ modalities: ['text', 'audio'], voice: 'alloy', input_audio_format: 'pcm16', output_audio_format: 'pcm16', instructions: `You are a helpful customer service agent. Be concise and friendly. If you don't know something, say so rather than making things up.`, turn_detection: { type: 'server_vad', // or 'semantic_vad' threshold: 0.5, prefix_padding_ms: 300, silence_duration_ms: 500, }, }); // Handle audio st