Omnivoice Tts

Name: Omnivoice Tts
Author: aradotso

aradotso/trending-skills

598 installs
66 repo stars
Updated July 9, 2026
aradotso/trending-skills

omnivoice-tts is a generative media agent skill that integrates the OmniVoice zero-shot TTS model for developers who need multilingual speech synthesis and instant voice cloning in Python agents or scripts.

About

OmniVoice TTS is an expert agent skill that turns text into high-quality speech across more than 600 languages using a state-of-the-art diffusion language model. It enables instant voice cloning from just a few seconds of reference audio, text-driven voice design, and efficient batch generation. Designed for Python-based agents and scripts, it gives solo builders the ability to add realistic multilingual voice output to products, prototypes, accessibility features, or interactive experiences without needing large training datasets or complex pipelines.

Supports 600+ languages with zero-shot TTS
Voice cloning from a short reference audio sample
Voice design using simple text attribute prompts
Batch inference with RTF as low as 0.025
Works on CUDA, Apple Silicon MPS, or CPU

Omnivoice Tts by the numbers

598 all-time installs (skills.sh)
+10 installs in the week ending Jun 27, 2026 (Skillselion tracking)
Ranked #365 of 1,340 Generative Media skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 19, 2026 (Skillselion catalog sync)

npx skills add https://github.com/aradotso/trending-skills --skill omnivoice-tts

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/aradotso/trending-skills/omnivoice-tts.svg)](https://skillselion.com/skills/aradotso/trending-skills/omnivoice-tts)

Installs	598
repo stars	★ 66
Security audit	3 / 3 scanners passed
Last updated	July 9, 2026
Repository	aradotso/trending-skills ↗

How do you add multilingual voice cloning to agents?

Generate natural-sounding speech across 600+ languages with instant voice cloning directly from their agents or scripts.

Who is it for?

Developers building Python agents, automation scripts, or apps that need zero-shot multilingual TTS with voice cloning across hundreds of languages.

Skip if: Developers needing real-time sub-100ms latency streaming TTS, on-device mobile speech synthesis, or simple cloud TTS with only a handful of languages.

When should I use this skill?

User asks to clone a voice with OmniVoice, generate multilingual TTS in Python, or run zero-shot voice design batch inference.

What you get

OmniVoice Python inference scripts, cloned voice audio files, batch-generated speech outputs, and agent-integrated TTS calls.

TTS inference scripts
Cloned voice audio files
Batch-generated speech outputs

By the numbers

Supports 600+ languages for zero-shot TTS
Includes voice cloning and voice design capabilities

Files

SKILL.mdMarkdownGitHub ↗

OmniVoice TTS Skill

Skill by ara.so — Daily 2026 Skills collection.

OmniVoice is a state-of-the-art zero-shot TTS model supporting 600+ languages, built on a diffusion language model-style architecture. It supports voice cloning (from reference audio), voice design (via text attributes), and auto voice generation with RTF as low as 0.025.

---

Installation

Requirements

Python 3.9+
PyTorch 2.8+
CUDA (recommended) or Apple Silicon (MPS) or CPU

pip (recommended)

# Step 1: Install PyTorch for your platform

# NVIDIA GPU (CUDA 12.8)
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128

# Apple Silicon
pip install torch==2.8.0 torchaudio==2.8.0

# Step 2: Install OmniVoice
pip install omnivoice

# Or from source (latest)
pip install git+https://github.com/k2-fsa/OmniVoice.git

# Or editable dev install
git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
pip install -e .

uv

git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
uv sync
# With mirror: uv sync --default-index "https://mirrors.aliyun.com/pypi/simple"

HuggingFace Mirror (if blocked)

export HF_ENDPOINT="https://hf-mirror.com"

---

Core Concepts

Mode	What you provide	Use case
Voice Cloning	`ref_audio` + `ref_text`	Clone a speaker from a short audio clip
Voice Design	`instruct` string	Describe speaker attributes (no audio needed)
Auto Voice	nothing extra	Model picks a random voice

---

Python API

Load the Model

from omnivoice import OmniVoice
import torch
import torchaudio

# NVIDIA GPU
model = OmniVoice.from_pretrained(
    "k2-fsa/OmniVoice",
    device_map="cuda:0",
    dtype=torch.float16
)

# Apple Silicon
model = OmniVoice.from_pretrained(
    "k2-fsa/OmniVoice",
    device_map="mps",
    dtype=torch.float16
)

# CPU (slower)
model = OmniVoice.from_pretrained(
    "k2-fsa/OmniVoice",
    device_map="cpu",
    dtype=torch.float32
)

Voice Cloning

# With manual reference transcription (faster, more accurate)
audio = model.generate(
    text="Hello, this is a test of zero-shot voice cloning.",
    ref_audio="ref.wav",
    ref_text="Transcription of the reference audio.",
)

# Without ref_text — Whisper auto-transcribes ref_audio
audio = model.generate(
    text="Hello, this is a test of zero-shot voice cloning.",
    ref_audio="ref.wav",
)

# audio is a list of torch.Tensor, shape (1, T) at 24kHz
torchaudio.save("out.wav", audio[0], 24000)

Voice Design

# Describe speaker via comma-separated attributes
audio = model.generate(
    text="Hello, this is a test of zero-shot voice design.",
    instruct="female, low pitch, british accent",
)
torchaudio.save("out.wav", audio[0], 24000)

Supported attributes:

Gender: male, female
Age: child, young, middle-aged, elderly
Pitch: very low pitch, low pitch, high pitch, very high pitch
Style: whisper
English accents: american accent, british accent, australian accent, etc.
Chinese dialects: 四川话, 陕西话, etc.

Auto Voice

audio = model.generate(text="This is a sentence without any voice prompt.")
torchaudio.save("out.wav", audio[0], 24000)

Generation Parameters

audio = model.generate(
    text="Hello world.",
    ref_audio="ref.wav",
    ref_text="Reference text.",
    num_step=32,      # diffusion steps; use 16 for faster (slightly lower quality)
    speed=1.2,        # speaking rate multiplier (>1 faster, <1 slower)
    duration=8.0,     # fix output duration in seconds (overrides speed)
)

Non-Verbal Symbols

# Insert expressive non-verbal sounds inline
audio = model.generate(
    text="[laughter] You really got me. I didn't see that coming at all."
)

Supported tags: [laughter], [sigh], [confirmation-en], [question-en], [question-ah], [question-oh], [question-ei], [question-yi], [surprise-ah], [surprise-oh], [surprise-wa], [surprise-yo], [dissatisfaction-hnn]

Pronunciation Control

# Chinese: pinyin with tone numbers (inline, uppercase)
audio = model.generate(
    text="这批货物打ZHE2出售后他严重SHE2本了，再也经不起ZHE1腾了。"
)

# English: CMU dict pronunciation in brackets (uppercase)
audio = model.generate(
    text="You could probably still make [IH1 T] look good."
)

---

CLI Tools

Web Demo

omnivoice-demo --ip 0.0.0.0 --port 8001
omnivoice-demo --help  # all options

Single Inference

# Voice Cloning (ref_text optional; omit for Whisper auto-transcription)
omnivoice-infer \
    --model k2-fsa/OmniVoice \
    --text "This is a test for text to speech." \
    --ref_audio ref.wav \
    --ref_text "Transcription of the reference audio." \
    --output hello.wav

# Voice Design
omnivoice-infer \
    --model k2-fsa/OmniVoice \
    --text "This is a test for text to speech." \
    --instruct "male, British accent" \
    --output hello.wav

# Auto Voice
omnivoice-infer \
    --model k2-fsa/OmniVoice \
    --text "This is a test for text to speech." \
    --output hello.wav

Batch Inference (Multi-GPU)

omnivoice-infer-batch \
    --model k2-fsa/OmniVoice \
    --test_list test.jsonl \
    --res_dir results/

JSONL format (test.jsonl):

{"id": "sample_001", "text": "Hello world", "ref_audio": "/path/to/ref.wav", "ref_text": "Reference transcript"}
{"id": "sample_002", "text": "Voice design example", "instruct": "female, british accent"}
{"id": "sample_003", "text": "Auto voice example"}
{"id": "sample_004", "text": "Speed controlled", "ref_audio": "/path/to/ref.wav", "speed": 1.2}
{"id": "sample_005", "text": "Duration fixed", "ref_audio": "/path/to/ref.wav", "duration": 10.0}
{"id": "sample_006", "text": "With language hint", "ref_audio": "/path/to/ref.wav", "language_id": "en", "language_name": "English"}

JSONL field reference:

Field	Required	Description
`id`	✅	Unique identifier
`text`	✅	Text to synthesize
`ref_audio`	❌	Path to reference audio (voice cloning)
`ref_text`	❌	Transcript of ref audio
`instruct`	❌	Speaker attributes (voice design)
`language_id`	❌	Language code, e.g. `"en"`
`language_name`	❌	Language name, e.g. `"English"`
`duration`	❌	Fixed output duration in seconds
`speed`	❌	Speaking rate multiplier (ignored if duration set)

---

Common Patterns

Full Voice Cloning Pipeline

from omnivoice import OmniVoice
import torch
import torchaudio
from pathlib import Path

def clone_voice(ref_audio_path: str, texts: list[str], output_dir: str):
    model = OmniVoice.from_pretrained(
        "k2-fsa/OmniVoice",
        device_map="cuda:0",
        dtype=torch.float16
    )
    Path(output_dir).mkdir(parents=True, exist_ok=True)

    for i, text in enumerate(texts):
        audio = model.generate(
            text=text,
            ref_audio=ref_audio_path,
            # ref_text omitted: Whisper auto-transcribes
            num_step=32,
            speed=1.0,
        )
        out_path = f"{output_dir}/output_{i:04d}.wav"
        torchaudio.save(out_path, audio[0], 24000)
        print(f"Saved: {out_path}")

clone_voice(
    ref_audio_path="speaker.wav",
    texts=["Hello world.", "Second sentence.", "Third sentence."],
    output_dir="outputs/"
)

Batch Processing from a List

import json
from omnivoice import OmniVoice
import torch
import torchaudio

model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16)

items = [
    {"id": "s1", "text": "English sentence.", "instruct": "female, american accent"},
    {"id": "s2", "text": "Another sentence.", "ref_audio": "ref.wav"},
    {"id": "s3", "text": "Auto voice.", },
]

for item in items:
    kwargs = {"text": item["text"]}
    if "ref_audio" in item:
        kwargs["ref_audio"] = item["ref_audio"]
    if "ref_text" in item:
        kwargs["ref_text"] = item["ref_text"]
    if "instruct" in item:
        kwargs["instruct"] = item["instruct"]

    audio = model.generate(**kwargs)
    torchaudio.save(f"{item['id']}.wav", audio[0], 24000)

Voice Design Combinations

designs = [
    "male, elderly, low pitch",
    "female, child, high pitch",
    "male, whisper",
    "female, british accent, high pitch",
    "male, american accent, middle-aged",
]

for design in designs:
    audio = model.generate(
        text="The quick brown fox jumps over the lazy dog.",
        instruct=design,
    )
    safe_name = design.replace(", ", "_").replace(" ", "-")
    torchaudio.save(f"design_{safe_name}.wav", audio[0], 24000)

Fast Inference (Lower Diffusion Steps)

# Default: num_step=32 (high quality)
# Fast: num_step=16 (slightly lower quality, ~2x faster)
audio = model.generate(
    text="Fast inference example.",
    ref_audio="ref.wav",
    num_step=16,
)

---

Output Format

Sample rate: 24,000 Hz
Type: list[torch.Tensor], each tensor shape (1, T)
Save: use torchaudio.save(path, audio[0], 24000)

---

Troubleshooting

HuggingFace download fails

export HF_ENDPOINT="https://hf-mirror.com"

CUDA out of memory

# Use float16 (not float32)
model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16)
# Or reduce batch size / text length in batch inference

Whisper ASR not available for ref_text auto-transcription

pip install openai-whisper

Wrong pronunciation in Chinese

Use inline pinyin with tone numbers directly in the text string:

# Format: PINYINTONE_NUMBER within the sentence
text = "这批货物打ZHE2出售"

Audio quality issues

Increase num_step to 32 or 64
Provide ref_text manually instead of relying on auto-transcription
Use a clean, noise-free reference audio clip (3–15 seconds recommended)

Apple Silicon (MPS) issues

# Use mps device explicitly
model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="mps", dtype=torch.float16)

---

Model & Resources

Resource	Link
HuggingFace Model	`k2-fsa/OmniVoice`
HuggingFace Space	https://huggingface.co/spaces/k2-fsa/OmniVoice
Paper (arXiv)	https://arxiv.org/abs/2604.00688
Demo Page	https://zhu-han.github.io/omnivoice
Supported Languages	`docs/languages.md` in repo
Voice Design Attributes	`docs/voice-design.md` in repo
Generation Parameters	`docs/generation-parameters.md` in repo
Training/Eval Examples	`examples/` in repo

Related skills

Remotion Best PracticesGet Remotion-specific coding guidance that prevents common video rendering mistakes when creating animated React videos.442k4.1k

Remotion RenderGenerate high-quality MP4 videos from React code using Remotion inside an AI coding agent.363k648

Ai Video GenerationTurn written prompts into short videos using AI video generation models directly from Cursor or Claude.363k648

Ai Avatar VideoGenerate short talking-head videos of custom AI avatars from text prompts.363k648

Ai Image GenerationLet their coding agent generate, iterate on, and insert high-quality images directly into web apps, marketing assets, or product features.363k648

Video EditIntelligently route video editing requests to the best RunComfy model without trial-and-error.357k31

How it compares

Pick this over basic cloud TTS wrappers when the project requires zero-shot voice cloning across hundreds of languages from Python agent scripts.

FAQ

How many languages does OmniVoice TTS support?

OmniVoice TTS supports 600+ languages through a zero-shot text-to-speech model. The omnivoice-tts skill documents Python installation, inference, voice cloning, and batch generation across that multilingual coverage.

Can omnivoice-tts clone a voice from a sample?

omnivoice-tts covers OmniVoice zero-shot voice cloning, letting developers generate speech in a cloned voice from reference audio without per-language fine-tuning. Voice design capabilities are also documented for custom voice profiles.

Is Omnivoice Tts safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Generative Mediaagentsautomation