Elevenlabs

Name: Elevenlabs
Author: digitalsamba

digitalsamba/claude-code-video-toolkit

817 installs
1.8k repo stars
Updated July 6, 2026
digitalsamba/claude-code-video-toolkit

elevenlabs is an agent skill that generates AI voiceovers, sound effects, and music with the ElevenLabs Python client so video, podcast, and game pipelines can produce narration and soundtrack assets programmatically.

About

The elevenlabs skill in digitalsamba/claude-code-video-toolkit teaches agents to generate voiceovers, sound effects, and background music through the ElevenLabs Python client with an ELEVENLABS_API_KEY in .env. It documents text_to_speech.convert with VoiceSettings presets for natural, conversational, or energetic delivery across four models: eleven_multilingual_v2 for stable 29-language output, eleven_flash_v2_5 and eleven_turbo_v2_5 for SSML breaks and phoneme tags, and expressive eleven_v3 for retake-heavy drafts. Instant voice cloning uses client.voices.ivc.create with binary sample files, while text_to_sound_effects.convert handles up to 22 second SFX and music.compose spans 10 seconds to 5 minutes with force_instrumental for beds. The toolkit ties into Remotion through tools/voiceover.py and the /generate-voiceover command, producing per-scene MP3 files plus manifest.json durations that drive Series.Sequence frame counts. Developers reach for elevenlabs when video, podcast, or game pipelines need programmatic narration, described effects, or synced soundtrack assets.

Four TTS models with guidance on multilingual_v2, flash, turbo, and v3 tradeoffs
VoiceSettings tables for natural, conversational, and energetic delivery styles
Instant voice cloning via client.voices.ivc.create with binary audio samples
Sound effects up to 22 seconds and music.compose tracks from 10 seconds to 5 minutes
Remotion workflow via tools/voiceover.py and /generate-voiceover scene manifests

Elevenlabs by the numbers

817 all-time installs (skills.sh)
+19 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #310 of 1,340 Generative Media skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

At a glance

elevenlabs capabilities & compatibility

Use cases: video generation · orchestration

From the docs

What elevenlabs says it does

Generate AI voiceovers, sound effects, and music using ElevenLabs APIs.

SKILL.md

VOICEOVER-SCRIPT.md → voiceover.py → public/audio/ → Remotion composition

SKILL.md

npx skills add https://github.com/digitalsamba/claude-code-video-toolkit --skill elevenlabs

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/digitalsamba/claude-code-video-toolkit/elevenlabs.svg)](https://skillselion.com/skills/digitalsamba/claude-code-video-toolkit/elevenlabs)

Installs	817
repo stars	★ 1.8k
Security audit	3 / 3 scanners passed
Last updated	July 6, 2026
Repository	digitalsamba/claude-code-video-toolkit ↗

How do you produce consistent voiceovers, described sound effects, and background music for video scenes without manual recording sessions or stock audio hunting?

Generate voiceovers, sound effects, and music with ElevenLabs APIs for video, podcast, and game audio pipelines inside Claude Code video projects.

Who is it for?

Developers building Remotion or Claude Code video toolkit projects who need ElevenLabs TTS, cloning, SFX, or music with scene-level timing manifests.

Skip if: Teams without an ELEVENLABS_API_KEY, non-Python stacks, or audio work that does not map to the documented ElevenLabs client calls.

When should I use this skill?

The user asks for voiceovers, narration, dialogue, voice cloning, sound effects from descriptions, background music, or soundtrack generation for videos, podcasts, or games.

What you get

MP3 voiceover, SFX, and music files plus a manifest.json with per-scene durations ready for Remotion Series.Sequence timing.

Scene MP3 voiceover files under public/audio/scenes/
manifest.json with per-file durations
Optional SFX and instrumental music tracks

By the numbers

Documents 4 production TTS models with SSML support called out for flash and turbo
Sound effects capped at 22 seconds; music.compose supports 10 seconds to 5 minutes

Files

SKILL.mdMarkdownGitHub ↗

ElevenLabs Audio Generation

Requires ELEVENLABS_API_KEY in .env.

Text-to-Speech

from elevenlabs.client import ElevenLabs
from elevenlabs import save, VoiceSettings
import os

client = ElevenLabs(api_key=os.getenv("ELEVENLABS_API_KEY"))

audio = client.text_to_speech.convert(
    text="Welcome to my video!",
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    model_id="eleven_multilingual_v2",
    voice_settings=VoiceSettings(
        stability=0.5,
        similarity_boost=0.75,
        style=0.5,
        speed=1.0
    )
)
save(audio, "voiceover.mp3")

Models

Model	Quality	SSML Support	Notes
`eleven_multilingual_v2`	Highest consistency	None	Stable, production-ready, 29 languages
`eleven_flash_v2_5`	Good	`<break>`, `<phoneme>`	Fast, supports pause/pronunciation tags
`eleven_turbo_v2_5`	Good	`<break>`, `<phoneme>`	Fastest latency
`eleven_v3`	Most expressive	None	Alpha — unreliable, needs prompt engineering

Choose: multilingual_v2 for reliability, flash/turbo for SSML control, v3 for maximum expressiveness (expect retakes).

Voice Settings by Style

Style	stability	similarity	style	speed
Natural/professional	0.75-0.85	0.9	0.0-0.1	1.0
Conversational	0.5-0.6	0.85	0.3-0.4	0.9-1.0
Energetic/YouTuber	0.3-0.5	0.75	0.5-0.7	1.0-1.1

Pauses Between Sections

With flash/turbo models: Use SSML break tags inline:

...end of section. <break time="1.5s" /> Start of next...

Max 3 seconds per break. Excessive breaks can cause speed artifacts.

With multilingual_v2 / v3: No SSML support. Options:

Paragraph breaks (blank lines) — creates ~0.3-0.5s natural pause
Post-process with ffmpeg: split audio and insert silence

WARNING: ... (ellipsis) is NOT a reliable pause — it can be vocalized as a word/sound. Do not use ellipsis as a pause mechanism.

Pronunciation Control

Phonetic spelling (any model): Write words as you want them pronounced:

Janus → Jan-us
nginx → engine-x
Use dashes, capitals, apostrophes to guide pronunciation

SSML phoneme tags (flash/turbo only):

<phoneme alphabet="ipa" ph="ˈdʒeɪnəs">Janus</phoneme>

Iterative Workflow

1. Generate → listen → identify pronunciation/pacing issues 2. Adjust: phonetic spellings, break tags, voice settings 3. Regenerate. If pauses aren't precise enough, add silence in post with ffmpeg rather than fighting the TTS engine.

Voice Cloning

Instant Voice Clone

with open("sample.mp3", "rb") as f:
    voice = client.voices.ivc.create(
        name="My Voice",
        files=[f],
        remove_background_noise=True
    )
print(f"Voice ID: {voice.voice_id}")

Use client.voices.ivc.create() (not client.voices.clone())
Pass file handles in binary mode ("rb"), not paths
Convert m4a first: ffmpeg -i input.m4a -codec:a libmp3lame -qscale:a 2 output.mp3
Multiple samples (2-3 clips) improve accuracy
Save voice ID for reuse

Professional Voice Clone: Requires Creator plan+, 30+ min audio. See reference.md.

Sound Effects

Max 22 seconds per generation.

result = client.text_to_sound_effects.convert(
    text="Thunder rumbling followed by heavy rain",
    duration_seconds=10,
    prompt_influence=0.3
)
with open("thunder.mp3", "wb") as f:
    for chunk in result:
        f.write(chunk)

Prompt tips: Be specific — "Heavy footsteps on wooden floorboards, slow and deliberate, with creaking"

Music Generation

10 seconds to 5 minutes. Use client.music.compose() (not .generate()).

result = client.music.compose(
    prompt="Upbeat indie rock, catchy guitar riff, energetic drums, travel vlog",
    music_length_ms=60000,
    force_instrumental=True
)
with open("music.mp3", "wb") as f:
    for chunk in result:
        f.write(chunk)

Prompt structure: Genre, mood, instruments, tempo, use case. Add "no vocals" or use force_instrumental=True for background music.

Remotion Integration

Complete Workflow: Script to Synchronized Scene

VOICEOVER-SCRIPT.md → voiceover.py → public/audio/ → Remotion composition
        ↓                  ↓               ↓                 ↓
  Scene narration    Generate MP3    Audio files     <Audio> component
  with durations     per scene       with timing     synced to scenes

Step 1: Generate Per-Scene Audio

Use the toolkit's voiceover tool to generate audio for each scene:

# Generate voiceover files for each scene
python tools/voiceover.py --scene-dir public/audio/scenes --json

# Output:
# public/audio/scenes/
#   ├── scene-01-title.mp3
#   ├── scene-02-problem.mp3
#   ├── scene-03-solution.mp3
#   └── manifest.json  (durations for each file)

The manifest.json contains timing info:

{
  "scenes": [
    { "file": "scene-01-title.mp3", "duration": 4.2 },
    { "file": "scene-02-problem.mp3", "duration": 12.8 },
    { "file": "scene-03-solution.mp3", "duration": 15.3 }
  ],
  "totalDuration": 32.3
}

Step 2: Use Audio in Remotion Composition

// src/Composition.tsx
import { Audio, staticFile, Series, useVideoConfig } from 'remotion';

// Import scene components
import { TitleSlide } from './scenes/TitleSlide';
import { ProblemSlide } from './scenes/ProblemSlide';
import { SolutionSlide } from './scenes/SolutionSlide';

// Scene durations (from manifest.json, converted to frames at 30fps)
const SCENE_DURATIONS = {
  title: Math.ceil(4.2 * 30),      // 126 frames
  problem: Math.ceil(12.8 * 30),   // 384 frames
  solution: Math.ceil(15.3 * 30),  // 459 frames
};

export const MainComposition: React.FC = () => {
  return (
    <>
      {/* Scene sequence */}
      <Series>
        <Series.Sequence durationInFrames={SCENE_DURATIONS.title}>
          <TitleSlide />
        </Series.Sequence>
        <Series.Sequence durationInFrames={SCENE_DURATIONS.problem}>
          <ProblemSlide />
        </Series.Sequence>
        <Series.Sequence durationInFrames={SCENE_DURATIONS.solution}>
          <SolutionSlide />
        </Series.Sequence>
      </Series>

      {/* Audio track - plays continuously across all scenes */}
      <Audio src={staticFile('audio/voiceover.mp3')} volume={1} />

      {/* Optional: Background music at lower volume */}
      <Audio src={staticFile('audio/music.mp3')} volume={0.15} />
    </>
  );
};

Step 3: Per-Scene Audio (Alternative)

For more control, add audio to each scene individually:

// src/scenes/ProblemSlide.tsx
import { Audio, staticFile, useCurrentFrame } from 'remotion';

export const ProblemSlide: React.FC = () => {
  const frame = useCurrentFrame();

  return (
    <div style={{ /* slide styles */ }}>
      <h1>The Problem</h1>
      {/* Scene content */}

      {/* Audio starts when this scene starts (frame 0 of this sequence) */}
      <Audio src={staticFile('audio/scenes/scene-02-problem.mp3')} />
    </div>
  );
};

Syncing Visuals to Voiceover

Calculate scene duration from audio, not the other way around:

// src/config/timing.ts
import manifest from '../../public/audio/scenes/manifest.json';

const FPS = 30;

// Convert audio durations to frame counts
export const sceneDurations = manifest.scenes.reduce((acc, scene) => {
  const name = scene.file.replace(/^scene-\d+-/, '').replace('.mp3', '');
  acc[name] = Math.ceil(scene.duration * FPS);
  return acc;
}, {} as Record<string, number>);

// Usage in composition:
// <Series.Sequence durationInFrames={sceneDurations.title}>

Audio Timing Patterns

import { Audio, Sequence, interpolate, useCurrentFrame } from 'remotion';

// Fade in audio
export const FadeInAudio: React.FC<{ src: string; fadeFrames?: number }> = ({
  src,
  fadeFrames = 30
}) => {
  const frame = useCurrentFrame();
  const volume = interpolate(frame, [0, fadeFrames], [0, 1], {
    extrapolateRight: 'clamp',
  });
  return <Audio src={src} volume={volume} />;
};

// Delayed audio start
export const DelayedAudio: React.FC<{ src: string; delayFrames: number }> = ({
  src,
  delayFrames
}) => (
  <Sequence from={delayFrames}>
    <Audio src={src} />
  </Sequence>
);

// Usage:
// <FadeInAudio src={staticFile('audio/music.mp3')} fadeFrames={60} />
// <DelayedAudio src={staticFile('audio/sfx/whoosh.mp3')} delayFrames={45} />

Voiceover + Demo Video Sync

When a scene has both voiceover and demo video:

import { Audio, OffthreadVideo, staticFile, useVideoConfig } from 'remotion';

export const DemoScene: React.FC = () => {
  const { durationInFrames, fps } = useVideoConfig();

  // Calculate playback rate to fit demo into voiceover duration
  const demoDuration = 45; // seconds (original demo length)
  const sceneDuration = durationInFrames / fps; // seconds (from voiceover)
  const playbackRate = demoDuration / sceneDuration;

  return (
    <>
      <OffthreadVideo
        src={staticFile('demos/feature-demo.mp4')}
        playbackRate={playbackRate}
      />
      <Audio src={staticFile('audio/scenes/scene-04-demo.mp3')} />
    </>
  );
};

Error Handling

import { Audio, staticFile, delayRender, continueRender } from 'remotion';
import { useEffect, useState } from 'react';

export const SafeAudio: React.FC<{ src: string }> = ({ src }) => {
  const [handle] = useState(() => delayRender());
  const [audioReady, setAudioReady] = useState(false);

  useEffect(() => {
    const audio = new window.Audio(src);
    audio.oncanplaythrough = () => {
      setAudioReady(true);
      continueRender(handle);
    };
    audio.onerror = () => {
      console.error(`Failed to load audio: ${src}`);
      continueRender(handle); // Continue without audio rather than hang
    };
  }, [src, handle]);

  if (!audioReady) return null;
  return <Audio src={src} />;
};

Toolkit Command: /generate-voiceover

The /generate-voiceover command handles the full workflow:

/generate-voiceover

1. Reads VOICEOVER-SCRIPT.md
2. Extracts narration for each scene
3. Generates audio via ElevenLabs API
4. Saves to public/audio/scenes/
5. Creates manifest.json with durations
6. Updates project.json with timing info

Popular Voices

George: JBFqnCBsd6RMkjVDRZzb (warm narrator)
Rachel: 21m00Tcm4TlvDq8ikWAM (clear female)
Adam: pNInz6obpgDQGcFmaJgB (professional male)

List all: client.voices.get_all()

For full API docs, see reference.md.

ElevenLabs API Reference

Detailed API documentation for ElevenLabs audio generation services.

Authentication

from elevenlabs.client import ElevenLabs
client = ElevenLabs(api_key=os.getenv("ELEVENLABS_API_KEY"))

Text-to-Speech Models

Model ID	Description	Languages	Latency
`eleven_flash_v2_5`	Ultra-low latency streaming	32	~75ms
`eleven_multilingual_v2`	Highest quality	32	Standard
`eleven_turbo_v2_5`	Fast, good quality	32	Low
`eleven_v3`	Best emotional range (alpha)	32+	Higher

Voice Settings

Parameter	Range	Default	Effect
`stability`	0.0-1.0	0.5	Lower = more expressive/variable
`similarity_boost`	0.0-1.0	0.75	Higher = closer to original voice
`style`	0.0-1.0	0.0	Style exaggeration (v2 models)
`speed`	0.5-2.0	1.0	Playback speed multiplier

Output Formats

Format Code	Sample Rate	Bitrate	Tier Required
`mp3_44100_128`	44.1kHz	128kbps	Free (default)
`mp3_44100_192`	44.1kHz	192kbps	Creator+
`pcm_44100`	44.1kHz	-	Pro+
`ulaw_8000`	8kHz	-	Free (telephony)

Long-form Audio (Stitching)

For continuity across multiple generations:

result1 = client.text_to_speech.convert_with_timestamps(
    text="First paragraph...",
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    model_id="eleven_multilingual_v2"
)
request_id_1 = result1.request_id

result2 = client.text_to_speech.convert(
    text="Second paragraph...",
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    model_id="eleven_multilingual_v2",
    previous_request_ids=[request_id_1]
)

Professional Voice Cloning (PVC)

Requires Creator plan+. Creates a fine-tuned model (3-6 hours training).

Requirements:

30 min minimum, 2-3 hours optimal audio
Professional XLR mic recommended
Pop filter, ~20cm distance
Peak levels: -6dB to -3dB
Consistent performance style

Workflow:

# 1. Create PVC with samples
pvc = client.voices.create_professional_voice_clone(
    name="My Pro Voice",
    files=["recording1.mp3", "recording2.mp3", ...],
)

# 2. Get verification captcha
captcha = client.voices.get_pvc_verification_captcha(voice_id=pvc.voice_id)
# Read the captcha text aloud and record

# 3. Submit verification
client.voices.verify_pvc(
    voice_id=pvc.voice_id,
    recording=open("captcha_reading.mp3", "rb")
)

# 4. Start training
client.voices.start_pvc_training(voice_id=pvc.voice_id)

Sound Effects Parameters

Parameter	Type	Required	Description
`text`	string	Yes	Description of sound effect
`duration_seconds`	float	No	1-22 seconds (auto if omitted)
`prompt_influence`	float	No	0.0-1.0 (default 0.3)

Billing: 100 chars/generation (auto) or 25 chars/second (fixed duration)

Example prompts:

Environmental: "Rain on a tin roof, steady and rhythmic"
Action: "Sword being drawn from sheath, metallic ring"
Mechanical: "Old car engine struggling to start then roaring to life"

Music Parameters

Parameter	Type	Required	Description
`prompt`	string	Yes*	Natural language music description
`composition_plan`	object	Yes*	Detailed composition structure
`duration_ms`	int	No	10000-300000 (10s-5min)
`instrumental`	bool	No	Force instrumental output

*Either prompt or composition_plan required, not both.

Effective prompts include: 1. Genre/Style: "indie rock", "lo-fi hip hop", "orchestral" 2. Mood: "uplifting", "melancholic", "tense" 3. Instruments: "acoustic guitar", "synth pads", "strings" 4. Tempo/Energy: "slow", "upbeat", "driving" 5. Context: "for a travel vlog", "podcast intro"

Rate Limits by Tier

Tier	TTS Concurrent	SFX Concurrent	Music Concurrent
Free	2	2	1
Starter	3	3	2
Creator	5	5	3
Pro	10	10	5
Scale	15	15	10

Voice Management

# List voices
voices = client.voices.get_all()
for voice in voices.voices:
    print(f"{voice.name}: {voice.voice_id}")

# Delete voice
client.voices.delete(voice_id="your_voice_id")

Error Handling

from elevenlabs.core.api_error import ApiError

try:
    audio = client.text_to_speech.convert(...)
except ApiError as e:
    if e.status_code == 429:
        print("Rate limited - wait and retry")
    elif e.status_code == 401:
        print("Invalid API key")

Code	Meaning	Action
401	Invalid API key	Check API key
403	Feature not available	Upgrade tier
422	Invalid parameters	Check request body
429	Rate limited	Wait and retry

Related skills

Remotion Best PracticesGet Remotion-specific coding guidance that prevents common video rendering mistakes when creating animated React videos.442k4.1k

Remotion RenderGenerate high-quality MP4 videos from React code using Remotion inside an AI coding agent.363k648

Ai Video GenerationTurn written prompts into short videos using AI video generation models directly from Cursor or Claude.363k648

Ai Avatar VideoGenerate short talking-head videos of custom AI avatars from text prompts.363k648

Ai Image GenerationLet their coding agent generate, iterate on, and insert high-quality images directly into web apps, marketing assets, or product features.363k648

Video EditIntelligently route video editing requests to the best RunComfy model without trial-and-error.357k31

How it compares

ElevenLabs audio generation skill for the video toolkit, not a general video editing or motion graphics template pack.

FAQ

Who is elevenlabs for?

elevenlabs is for developers using the Claude Code video toolkit or Remotion projects who need programmatic narration, cloning, sound effects, or music from ElevenLabs APIs.

When should I use elevenlabs?

Use elevenlabs when creating video voiceovers, podcast narration, game audio, described sound effects, or background music, especially when scene-level MP3 files and manifest.json timing should feed Remotion compositions.

Is elevenlabs safe to install?

Review the Security Audits panel on this page, keep ELEVENLABS_API_KEY out of version control, and confirm cloned voice samples comply with your content and licensing policies.

Generative Mediaautomationagents

About

Elevenlabs by the numbers

elevenlabs capabilities & compatibility

What elevenlabs says it does

Add your badge

How do you produce consistent voiceovers, described sound effects, and background music for video scenes without manual recording sessions or stock audio hunting?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

ElevenLabs Audio Generation

Text-to-Speech

Models

Voice Settings by Style

Pauses Between Sections

Pronunciation Control

Iterative Workflow

Voice Cloning

Instant Voice Clone

Sound Effects

Music Generation

Remotion Integration

Complete Workflow: Script to Synchronized Scene

Step 1: Generate Per-Scene Audio

Step 2: Use Audio in Remotion Composition

Step 3: Per-Scene Audio (Alternative)

Syncing Visuals to Voiceover

Audio Timing Patterns

Voiceover + Demo Video Sync

Error Handling

Toolkit Command: /generate-voiceover

Popular Voices

ElevenLabs API Reference

Authentication

Text-to-Speech Models

Voice Settings

Output Formats

Long-form Audio (Stitching)

Professional Voice Cloning (PVC)

Sound Effects Parameters

Music Parameters

Rate Limits by Tier

Voice Management

Error Handling

Related skills

How it compares

FAQ

Who is elevenlabs for?

When should I use elevenlabs?

Is elevenlabs safe to install?

This week in AI coding