
Elevenlabs
Generate voiceovers, SFX, and music via ElevenLabs when your agent is assembling video, podcast, or game audio assets.
Overview
ElevenLabs is an agent skill for the Build phase that generates AI voiceovers, sound effects, and music through the ElevenLabs APIs for video, podcast, and game audio.
Install
npx skills add https://github.com/digitalsamba/claude-code-video-toolkit --skill elevenlabsWhat is this skill?
- Text-to-speech with four documented models (multilingual_v2, flash_v2_5, turbo_v2_5, v3) plus VoiceSettings tuning
- SSML-style control on flash/turbo models via `<break>` and `<phoneme>` tags
- Sound effects and music generation from text descriptions for video and game soundtracks
- Voice cloning and multilingual_v2 coverage for production-stable narration
- Requires ELEVENLABS_API_KEY in `.env` with Python client examples for save-to-file workflows
- Documents 4 TTS model IDs with explicit quality, SSML, and latency notes
- eleven_multilingual_v2 lists 29 languages for production-stable speech
Adoption & trust: 614 installs on skills.sh; 1.4k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have scripts and scene descriptions but no fast way for your coding agent to produce consistent narration, SFX, or beds without leaving the repo.
Who is it for?
Indie builders automating narration and soundtrack layers for marketing videos, demos, or small games with an ElevenLabs account.
Skip if: Teams that need broadcast licensing review, zero cloud spend, or lip-sync/avatar pipelines—the skill covers API audio synthesis only.
When should I use this skill?
Creating audio for videos, podcasts, or games—voiceovers, narration, dialogue, SFX from descriptions, background music, soundtrack generation, or voice cloning.
What do I get? / Deliverables
Your agent emits configured TTS, effects, or music files via the documented Python client patterns, ready to drop into your video or audio toolchain.
- Saved audio files (e.g. voiceover.mp3) from TTS, SFX, or music API calls
- Model and VoiceSettings choices documented in the agent run
Recommended Skills
Journey fit
Audio synthesis hooks into the build phase where solo builders integrate third-party APIs into media pipelines before ship. ElevenLabs is an external API integration—TTS, music, and SFX generation—not a standalone launch or growth tactic.
How it compares
API integration skill for scripted audio generation, not a local open-source TTS CLI or a full video editor.
Common Questions / FAQ
Who is elevenlabs for?
Solo and indie builders using agentic coding tools to add voiceovers, SFX, and music to videos, podcasts, or games via ElevenLabs.
When should I use elevenlabs?
During Build when you are integrating media APIs—e.g. generating narration for a launch teaser, podcast intro beds, or game UI voice lines from text.
Is elevenlabs safe to install?
Review the Security Audits panel on this Prism page before installing; the skill expects a secrets-bearing API key and outbound calls to ElevenLabs.
SKILL.md
READMESKILL.md - Elevenlabs
# ElevenLabs Audio Generation Requires `ELEVENLABS_API_KEY` in `.env`. ## Text-to-Speech ```python from elevenlabs.client import ElevenLabs from elevenlabs import save, VoiceSettings import os client = ElevenLabs(api_key=os.getenv("ELEVENLABS_API_KEY")) audio = client.text_to_speech.convert( text="Welcome to my video!", voice_id="JBFqnCBsd6RMkjVDRZzb", model_id="eleven_multilingual_v2", voice_settings=VoiceSettings( stability=0.5, similarity_boost=0.75, style=0.5, speed=1.0 ) ) save(audio, "voiceover.mp3") ``` ### Models | Model | Quality | SSML Support | Notes | |-------|---------|--------------|-------| | `eleven_multilingual_v2` | Highest consistency | None | Stable, production-ready, 29 languages | | `eleven_flash_v2_5` | Good | `<break>`, `<phoneme>` | Fast, supports pause/pronunciation tags | | `eleven_turbo_v2_5` | Good | `<break>`, `<phoneme>` | Fastest latency | | `eleven_v3` | Most expressive | None | Alpha — unreliable, needs prompt engineering | **Choose:** multilingual_v2 for reliability, flash/turbo for SSML control, v3 for maximum expressiveness (expect retakes). ### Voice Settings by Style | Style | stability | similarity | style | speed | |-------|-----------|------------|-------|-------| | Natural/professional | 0.75-0.85 | 0.9 | 0.0-0.1 | 1.0 | | Conversational | 0.5-0.6 | 0.85 | 0.3-0.4 | 0.9-1.0 | | Energetic/YouTuber | 0.3-0.5 | 0.75 | 0.5-0.7 | 1.0-1.1 | ### Pauses Between Sections **With flash/turbo models:** Use SSML break tags inline: ``` ...end of section. <break time="1.5s" /> Start of next... ``` Max 3 seconds per break. Excessive breaks can cause speed artifacts. **With multilingual_v2 / v3:** No SSML support. Options: - Paragraph breaks (blank lines) — creates ~0.3-0.5s natural pause - Post-process with ffmpeg: split audio and insert silence **WARNING:** `...` (ellipsis) is NOT a reliable pause — it can be vocalized as a word/sound. Do not use ellipsis as a pause mechanism. ### Pronunciation Control **Phonetic spelling (any model):** Write words as you want them pronounced: - `Janus` → `Jan-us` - `nginx` → `engine-x` - Use dashes, capitals, apostrophes to guide pronunciation **SSML phoneme tags (flash/turbo only):** ``` <phoneme alphabet="ipa" ph="ˈdʒeɪnəs">Janus</phoneme> ``` ### Iterative Workflow 1. Generate → listen → identify pronunciation/pacing issues 2. Adjust: phonetic spellings, break tags, voice settings 3. Regenerate. If pauses aren't precise enough, add silence in post with ffmpeg rather than fighting the TTS engine. ## Voice Cloning ### Instant Voice Clone ```python with open("sample.mp3", "rb") as f: voice = client.voices.ivc.create( name="My Voice", files=[f], remove_background_noise=True ) print(f"Voice ID: {voice.voice_id}") ``` - Use `client.voices.ivc.create()` (not `client.voices.clone()`) - Pass file handles in binary mode (`"rb"`), not paths - Convert m4a first: `ffmpeg -i input.m4a -codec:a libmp3lame -qscale:a 2 output.mp3` - Multiple samples (2-3 clips) improve accuracy - Save voice ID for reuse **Professional Voice Clone:** Requires Creator plan+, 30+ min audio. See [reference.md](reference.md). ## Sound Effects Max 22 seconds per generation. ```python result = client.text_to_sound_effects.convert( text="Thunder rumbling followed by heavy rain", duration_seconds=10, prompt_influence=0.3 ) with open("thunder.mp3", "wb") as f: for chunk in result: f.write(chunk) ``` **Prompt tips:** Be specific — "Heavy footsteps on wooden floorboards, slow and deliberate, with creaking" ## Music Generation