Voicebox Voice Synthesis

Name: Voicebox Voice Synthesis
Author: aradotso

aradotso/trending-skills

1.4k installs
66 repo stars
Updated July 9, 2026
aradotso/trending-skills

voicebox-voice-synthesis is an agent skill that expert skill for voicebox — the open-source local voice cloning and tts studio built with tauri, react, and fastapi.

About

voicebox-voice-synthesis is an agent skill from aradotso/trending-skills that expert skill for voicebox — the open-source local voice cloning and tts studio built with tauri, react, and fastapi. # Voicebox Voice Synthesis Studio > Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection. Voicebox is a local-first, open-source voice cloning and TTS studio — a self-hosted alternative to ElevenLabs. It runs entirely on your machine (macOS MLX/Metal, Windows/Linux CUDA, CPU fallback), exposes a REST API on `localhost:17493`, and ship Developers invoke voicebox-voice-synthesis during idea/discover work for ai & agent building tasks. The skill documents triggers, prerequisites, and step-by-step workflows grounded in SKILL.md. Compatible with Claude Code, Cursor, and Codex agent runtimes that load marketplace skills. Review the Security Audits panel on this listing before installing in production environments.

Voicebox Voice Synthesis Studio
Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.
Pre-built Binaries (Recommended)
| macOS Apple Silicon | https://voicebox.sh/download/mac-arm |
| macOS Intel | https://voicebox.sh/download/mac-intel |

Voicebox Voice Synthesis by the numbers

1,360 all-time installs (skills.sh)
+14 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #846 of 16,659 AI & Agent Building skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

At a glance

voicebox-voice-synthesis capabilities & compatibility

Capabilities: voicebox voice synthesis studio · skill by [ara.so](https://ara.so) — daily 2026 s · pre built binaries (recommended) · | macos apple silicon | https://voicebox.sh/down · | macos intel | https://voicebox.sh/download/mac
Use cases: orchestration

From the docs

What voicebox-voice-synthesis says it does

> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.

SKILL.md

Linux requires building from source: https://voicebox.sh/linux-install

SKILL.md

**Prerequisites:** [Bun](https://bun.sh), [Rust](https://rustup.rs), [Python 3.11+](https://python.org), Tauri prerequisites

SKILL.md

npx skills add https://github.com/aradotso/trending-skills --skill voicebox-voice-synthesis

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/aradotso/trending-skills/voicebox-voice-synthesis.svg)](https://skillselion.com/skills/aradotso/trending-skills/voicebox-voice-synthesis)

Installs	1.4k
repo stars	★ 66
Security audit	3 / 3 scanners passed
Last updated	July 9, 2026
Repository	aradotso/trending-skills ↗

What it does

Expert skill for Voicebox — the open-source local voice cloning and TTS studio built with Tauri, React, and FastAPI

Who is it for?

Developers working on ai & agent building during idea tasks.

Skip if: Tasks outside AI & Agent Building scope described in SKILL.md.

When should I use this skill?

Expert skill for Voicebox — the open-source local voice cloning and TTS studio built with Tauri, React, and FastAPI

What you get

Completed ai & agent building workflow aligned with SKILL.md steps.

Configured Voicebox instance
Voice clone profiles
Synthesized audio via API

By the numbers

Voicebox stack uses Tauri, React, and FastAPI

Files

SKILL.mdMarkdownGitHub ↗

Voicebox Voice Synthesis Studio

Skill by ara.so — Daily 2026 Skills collection.

Voicebox is a local-first, open-source voice cloning and TTS studio — a self-hosted alternative to ElevenLabs. It runs entirely on your machine (macOS MLX/Metal, Windows/Linux CUDA, CPU fallback), exposes a REST API on localhost:17493, and ships with 5 TTS engines, 23 languages, post-processing effects, and a multi-track Stories editor.

---

Installation

Pre-built Binaries (Recommended)

Platform	Link
macOS Apple Silicon	https://voicebox.sh/download/mac-arm
macOS Intel	https://voicebox.sh/download/mac-intel
Windows	https://voicebox.sh/download/windows
Docker	`docker compose up`

Linux requires building from source: https://voicebox.sh/linux-install

Build from Source

Prerequisites: Bun, Rust, Python 3.11+, Tauri prerequisites

git clone https://github.com/jamiepine/voicebox.git
cd voicebox

# Install just task runner
brew install just        # macOS
cargo install just       # any platform

# Set up Python venv + all dependencies
just setup

# Start backend + desktop app in dev mode
just dev

# List all available commands
just --list

---

Architecture

Layer	Technology
Desktop App	Tauri (Rust)
Frontend	React + TypeScript + Tailwind CSS
State	Zustand + React Query
Backend	FastAPI (Python) on port 17493
TTS Engines	Qwen3-TTS, LuxTTS, Chatterbox, Chatterbox Turbo, TADA
Effects	Pedalboard (Spotify)
Transcription	Whisper / Whisper Turbo
Inference	MLX (Apple Silicon) / PyTorch (CUDA/ROCm/XPU/CPU)
Database	SQLite

The Python FastAPI backend handles all ML inference. The Tauri Rust shell wraps the frontend and manages the backend process lifecycle. The API is accessible directly at http://localhost:17493 even when using the desktop app.

---

REST API Reference

Base URL: http://localhost:17493 Interactive docs: http://localhost:17493/docs

Generate Speech

# Basic generation
curl -X POST http://localhost:17493/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello world, this is a voice clone.",
    "profile_id": "abc123",
    "language": "en"
  }'

# With engine selection
curl -X POST http://localhost:17493/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Speak slowly and with gravitas.",
    "profile_id": "abc123",
    "language": "en",
    "engine": "qwen3-tts"
  }'

# With paralinguistic tags (Chatterbox Turbo only)
curl -X POST http://localhost:17493/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "That is absolutely hilarious! [laugh] I cannot believe it.",
    "profile_id": "abc123",
    "engine": "chatterbox-turbo",
    "language": "en"
  }'

Voice Profiles

# List all profiles
curl http://localhost:17493/profiles

# Create a new profile
curl -X POST http://localhost:17493/profiles \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Narrator",
    "language": "en",
    "description": "Deep narrative voice"
  }'

# Upload audio sample to a profile
curl -X POST http://localhost:17493/profiles/{profile_id}/samples \
  -F "file=@/path/to/voice-sample.wav"

# Export a profile
curl http://localhost:17493/profiles/{profile_id}/export \
  --output narrator-profile.zip

# Import a profile
curl -X POST http://localhost:17493/profiles/import \
  -F "file=@narrator-profile.zip"

Generation Queue & Status

# Get generation status (SSE stream)
curl -N http://localhost:17493/generate/{generation_id}/status

# List recent generations
curl http://localhost:17493/generations

# Retry a failed generation
curl -X POST http://localhost:17493/generations/{generation_id}/retry

# Download generated audio
curl http://localhost:17493/generations/{generation_id}/audio \
  --output output.wav

Models

# List available models and download status
curl http://localhost:17493/models

# Unload a model from GPU memory (without deleting)
curl -X POST http://localhost:17493/models/{model_id}/unload

---

TypeScript/JavaScript Integration

Basic TTS Client

const VOICEBOX_URL = process.env.VOICEBOX_API_URL ?? "http://localhost:17493";

interface GenerateRequest {
  text: string;
  profile_id: string;
  language?: string;
  engine?: "qwen3-tts" | "luxtts" | "chatterbox" | "chatterbox-turbo" | "tada";
}

interface GenerateResponse {
  generation_id: string;
  status: "queued" | "processing" | "complete" | "failed";
  audio_url?: string;
}

async function generateSpeech(req: GenerateRequest): Promise<GenerateResponse> {
  const response = await fetch(`${VOICEBOX_URL}/generate`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify(req),
  });

  if (!response.ok) {
    throw new Error(`Voicebox API error: ${response.status} ${await response.text()}`);
  }

  return response.json();
}

// Usage
const result = await generateSpeech({
  text: "Welcome to our application.",
  profile_id: "abc123",
  language: "en",
  engine: "qwen3-tts",
});

console.log("Generation ID:", result.generation_id);

Poll for Completion

async function waitForGeneration(
  generationId: string,
  timeoutMs = 60_000
): Promise<string> {
  const start = Date.now();

  while (Date.now() - start < timeoutMs) {
    const res = await fetch(`${VOICEBOX_URL}/generations/${generationId}`);
    const data = await res.json();

    if (data.status === "complete") {
      return `${VOICEBOX_URL}/generations/${generationId}/audio`;
    }
    if (data.status === "failed") {
      throw new Error(`Generation failed: ${data.error}`);
    }

    await new Promise((r) => setTimeout(r, 1000));
  }

  throw new Error("Generation timed out");
}

Stream Status with SSE

function streamGenerationStatus(
  generationId: string,
  onStatus: (status: string) => void
): () => void {
  const eventSource = new EventSource(
    `${VOICEBOX_URL}/generate/${generationId}/status`
  );

  eventSource.onmessage = (event) => {
    const data = JSON.parse(event.data);
    onStatus(data.status);

    if (data.status === "complete" || data.status === "failed") {
      eventSource.close();
    }
  };

  eventSource.onerror = () => eventSource.close();

  // Return cleanup function
  return () => eventSource.close();
}

// Usage
const cleanup = streamGenerationStatus("gen_abc123", (status) => {
  console.log("Status update:", status);
});

Download Audio as Blob

async function downloadAudio(generationId: string): Promise<Blob> {
  const response = await fetch(
    `${VOICEBOX_URL}/generations/${generationId}/audio`
  );

  if (!response.ok) {
    throw new Error(`Failed to download audio: ${response.status}`);
  }

  return response.blob();
}

// Play in browser
async function playGeneratedAudio(generationId: string): Promise<void> {
  const blob = await downloadAudio(generationId);
  const url = URL.createObjectURL(blob);
  const audio = new Audio(url);
  audio.play();
  audio.onended = () => URL.revokeObjectURL(url);
}

---

Python Integration

import httpx
import asyncio

VOICEBOX_URL = "http://localhost:17493"

async def generate_speech(
    text: str,
    profile_id: str,
    language: str = "en",
    engine: str = "qwen3-tts"
) -> bytes:
    async with httpx.AsyncClient(timeout=120.0) as client:
        # Submit generation
        resp = await client.post(
            f"{VOICEBOX_URL}/generate",
            json={
                "text": text,
                "profile_id": profile_id,
                "language": language,
                "engine": engine,
            }
        )
        resp.raise_for_status()
        generation_id = resp.json()["generation_id"]

        # Poll until complete
        for _ in range(120):
            status_resp = await client.get(
                f"{VOICEBOX_URL}/generations/{generation_id}"
            )
            status_data = status_resp.json()

            if status_data["status"] == "complete":
                audio_resp = await client.get(
                    f"{VOICEBOX_URL}/generations/{generation_id}/audio"
                )
                return audio_resp.content

            if status_data["status"] == "failed":
                raise RuntimeError(f"Generation failed: {status_data.get('error')}")

            await asyncio.sleep(1.0)

        raise TimeoutError("Generation timed out after 120s")


# Usage
audio_bytes = asyncio.run(
    generate_speech(
        text="The quick brown fox jumps over the lazy dog.",
        profile_id="your-profile-id",
        language="en",
        engine="chatterbox",
    )
)

with open("output.wav", "wb") as f:
    f.write(audio_bytes)

---

TTS Engine Selection Guide

Engine	Best For	Languages	VRAM	Notes
`qwen3-tts` (0.6B/1.7B)	Quality + instructions	10	Medium	Supports delivery instructions in text
`luxtts`	Fast CPU generation	English only	~1GB	150x realtime on CPU, 48kHz
`chatterbox`	Multilingual coverage	23	Medium	Arabic, Hindi, Swahili, CJK + more
`chatterbox-turbo`	Expressive/emotion	English only	Low (350M)	Use `[laugh]`, `[sigh]`, `[gasp]` tags
`tada` (1B/3B)	Long-form coherence	10	High	700s+ audio, HumeAI model

Delivery Instructions (Qwen3-TTS)

Embed natural language instructions directly in the text:

await generateSpeech({
  text: "(whisper) I have a secret to tell you.",
  profile_id: "abc123",
  engine: "qwen3-tts",
});

await generateSpeech({
  text: "(speak slowly and clearly) Step one: open the application.",
  profile_id: "abc123",
  engine: "qwen3-tts",
});

Paralinguistic Tags (Chatterbox Turbo)

const tags = [
  "[laugh]", "[chuckle]", "[gasp]", "[cough]",
  "[sigh]", "[groan]", "[sniff]", "[shush]", "[clear throat]"
];

await generateSpeech({
  text: "Oh really? [gasp] I had no idea! [laugh] That's incredible.",
  profile_id: "abc123",
  engine: "chatterbox-turbo",
});

---

Environment & Configuration

# Custom models directory (set before launching)
export VOICEBOX_MODELS_DIR=/path/to/models

# For AMD ROCm GPU (auto-configured, but can override)
export HSA_OVERRIDE_GFX_VERSION=11.0.0

Docker configuration (docker-compose.yml override):

services:
  voicebox:
    environment:
      - VOICEBOX_MODELS_DIR=/models
    volumes:
      - /host/models:/models
    ports:
      - "17493:17493"
    # For NVIDIA GPU passthrough:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

---

Common Patterns

Voice Profile Creation Flow

// 1. Create profile
const profile = await fetch(`${VOICEBOX_URL}/profiles`, {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({ name: "My Voice", language: "en" }),
}).then((r) => r.json());

// 2. Upload audio sample (WAV/MP3, ideally 5–30 seconds clean speech)
const formData = new FormData();
formData.append("file", audioBlob, "sample.wav");

await fetch(`${VOICEBOX_URL}/profiles/${profile.id}/samples`, {
  method: "POST",
  body: formData,
});

// 3. Generate with the new profile
const gen = await generateSpeech({
  text: "Testing my cloned voice.",
  profile_id: profile.id,
});

Batch Generation with Queue

async function batchGenerate(
  items: Array<{ text: string; profileId: string }>,
  engine = "qwen3-tts"
): Promise<string[]> {
  // Submit all — Voicebox queues them serially to avoid GPU contention
  const submissions = await Promise.all(
    items.map((item) =>
      generateSpeech({ text: item.text, profile_id: item.profileId, engine })
    )
  );

  // Wait for all completions
  const audioUrls = await Promise.all(
    submissions.map((s) => waitForGeneration(s.generation_id))
  );

  return audioUrls;
}

Long-Form Text (Auto-Chunking)

Voicebox auto-chunks at sentence boundaries — just send the full text:

const longScript = `
  Chapter one. The morning fog rolled across the valley floor...
  // Up to 50,000 characters supported
`;

await generateSpeech({
  text: longScript,
  profile_id: "narrator-profile-id",
  engine: "tada", // Best for long-form coherence
  language: "en",
});

---

Troubleshooting

API not responding

# Check if backend is running
curl http://localhost:17493/health

# Restart backend only (dev mode)
just backend

# Check logs
just logs

GPU not detected

# Check detected backend
curl http://localhost:17493/system/info

# Force CPU mode (set before launch)
export VOICEBOX_FORCE_CPU=1

Model download fails / slow

# Set custom models directory with more space
export VOICEBOX_MODELS_DIR=/path/with/space
just dev

# Cancel stuck download via API
curl -X DELETE http://localhost:17493/models/{model_id}/download

Out of VRAM — unload models

# List loaded models
curl http://localhost:17493/models | jq '.[] | select(.loaded == true)'

# Unload specific model
curl -X POST http://localhost:17493/models/{model_id}/unload

Audio quality issues

Use 5–30 seconds of clean, noise-free speech for voice samples
Multiple samples improve clone quality — upload 3–5 different sentences
For multilingual cloning, use chatterbox engine
Ensure sample audio is 16kHz+ mono WAV for best results
Use luxtts for highest output quality (48kHz) in English

Generation stuck in queue after crash

Voicebox auto-recovers stale generations on startup. If the issue persists:

curl -X POST http://localhost:17493/generations/{generation_id}/retry

---

Frontend Integration (React Example)

import { useState } from "react";

const VOICEBOX_URL = import.meta.env.VITE_VOICEBOX_URL ?? "http://localhost:17493";

export function VoiceGenerator({ profileId }: { profileId: string }) {
  const [text, setText] = useState("");
  const [audioUrl, setAudioUrl] = useState<string | null>(null);
  const [loading, setLoading] = useState(false);

  const handleGenerate = async () => {
    setLoading(true);
    try {
      const res = await fetch(`${VOICEBOX_URL}/generate`, {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ text, profile_id: profileId, language: "en" }),
      });
      const { generation_id } = await res.json();

      // Poll for completion
      let done = false;
      while (!done) {
        await new Promise((r) => setTimeout(r, 1000));
        const statusRes = await fetch(`${VOICEBOX_URL}/generations/${generation_id}`);
        const { status } = await statusRes.json();
        if (status === "complete") {
          setAudioUrl(`${VOICEBOX_URL}/generations/${generation_id}/audio`);
          done = true;
        } else if (status === "failed") {
          throw new Error("Generation failed");
        }
      }
    } finally {
      setLoading(false);
    }
  };

  return (
    <div>
      <textarea value={text} onChange={(e) => setText(e.target.value)} />
      <button onClick={handleGenerate} disabled={loading}>
        {loading ? "Generating..." : "Generate Speech"}
      </button>
      {audioUrl && <audio controls src={audioUrl} />}
    </div>
  );
}