Hyperframes Media

Name: Hyperframes Media
Author: heygen-com

heygen-com/hyperframes

Preprocess narration, captions, and transparent overlay assets locally for HyperFrames video compositions without paid TTS or cloud APIs.

Overview

HyperFrames Media is an agent skill for the Build phase that preprocesses narration, transcripts, and transparent overlays for HyperFrames compositions via local Kokoro, Whisper, and u2net CLIs.

Install

npx skills add https://github.com/heygen-com/hyperframes --skill hyperframes-media

What is this skill?

Three CLI commands: tts, transcribe, and remove-background with models cached under ~/.cache/hyperframes/
Kokoro-82M text-to-speech with 54 listed voices and no API key
Whisper transcription for speech-to-text and caption timestamps
u2net background removal for transparent video or image overlays
Chains naturally TTS → transcribe → captions and references hyperframes element conventions
Three CLI commands: tts, transcribe, remove-background
54 Kokoro voices available via --list
Models cache under ~/.cache/hyperframes/ on first run

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 58.9k installs on skills.sh; 25.6k GitHub stars; 2/3 security scanners passed (skills.sh audits).

What problem does it solve?

You need voiceover, captions, and cut-out overlays for a HyperFrames video but do not want to juggle separate cloud APIs or guess how to chain TTS into timed subtitles.

Who is it for?

Indie builders composing HyperFrames marketing or demo videos who prefer on-device TTS and captions with predictable CLI flags.

Skip if: Teams that need studio voice talent, real-time streaming synthesis, or video editing outside the HyperFrames asset pipeline.

When should I use this skill?

Generating voiceover from text, transcribing speech for captions, removing backgrounds for transparent overlays, choosing TTS voice or Whisper model, or chaining TTS → transcribe → captions for HyperFrames.

What do I get? / Deliverables

You get cached local model runs that output WAV narration, transcript timestamps, and alpha-ready media you can reference from composition HTML using the hyperframes skill conventions.

WAV narration from Kokoro TTS
Transcript with timestamps for captions
Transparent video or image assets for overlays

Recommended Skills

Video Editagentspace-so/runcomfy-agent-skills

Video Edit is a RunComfy-focused agent skill that acts as a smart router between your edit intent and the correct model …211k installs·15 stars

Image To Videoagentspace-so/runcomfy-agent-skills

Image-to-Video on RunComfy picks the right i2v model for each intent—HappyHorse for general animation, Wan 2.7 with audi…210k installs·15 stars

Image Editagentspace-so/runcomfy-agent-skills

Image Edit is a RunComfy Pro Pack agent skill that acts as a smart router between your edit intent and the right model i…210k installs·15 stars

Flux Kontextagentspace-so/runcomfy-agent-skills

Flux Kontext Pro on RunComfy packages Black Forest Labs' precise local edit model with documented prompting patterns and…210k installs·15 stars

Nano Banana 2agentspace-so/runcomfy-agent-skills

Nano Banana 2 on RunComfy wraps Google's Gemini-family flash text-to-image model with prompting patterns for fast iterat…210k installs·15 stars

Nano Banana Editagentspace-so/runcomfy-agent-skills

Nano Banana Edit on RunComfy documents Google's image-to-image edit endpoint for identity-preserving changes, background…210k installs·15 stars

Journey fit

Primary fit

BuildIntegrations & version control

Media preprocessing sits in Build because it produces composition-ready audio and video assets before you ship marketing or product demos in HyperFrames HTML. Integrations fits the three CLI pipelines (Kokoro TTS, Whisper transcription, u2net background removal) that plug into the broader HyperFrames workflow.

Also useful

LaunchDistribution & launch channels

How it compares

Local CLI preprocessing for composition assets—not a hosted HeyGen avatar API or a general-purpose video editor.

Common Questions / FAQ

Who is hyperframes-media for?

Solo and indie builders using HyperFrames who need offline narration, caption timing, and transparent overlays before assembling composition HTML.

When should I use hyperframes-media?

During Build when generating voiceover from text, transcribing speech for captions, removing backgrounds for overlays, picking a Kokoro voice or Whisper model, or chaining TTS through transcription into captions.

Is hyperframes-media safe to install?

It runs local CLIs and downloads models to your cache; review the Security Audits panel on this Prism page and verify npx hyperframes sources before running on sensitive machines.

Workflow Chain

Requires first: skill heygen com hyperframes hyperframes

Then invoke: skill heygen com hyperframes hyperframes

SKILL.md

READMESKILL.md - Hyperframes Media

# HyperFrames Media Preprocessing

Three CLI commands that produce assets for compositions: `tts` (speech), `transcribe` (timestamps), and `remove-background` (transparent video). Each downloads a model on first run and caches it under `~/.cache/hyperframes/`. Drop the output into the project, then reference it from the composition HTML — see the `hyperframes` skill for the audio/video element conventions.

## Text-to-Speech (`tts`)

Generate speech audio locally with Kokoro-82M. No API key.

```bash
npx hyperframes tts "Text here" --voice af_nova --output narration.wav
npx hyperframes tts script.txt --voice bf_emma --output narration.wav
npx hyperframes tts --list                       # all 54 voices
```

### Voice Selection

Match voice to content. Default is `af_heart`.

| Content type      | Voice                 | Why                           |
| ----------------- | --------------------- | ----------------------------- |
| Product demo      | `af_heart`/`af_nova`  | Warm, professional            |
| Tutorial / how-to | `am_adam`/`bf_emma`   | Neutral, easy to follow       |
| Marketing / promo | `af_sky`/`am_michael` | Energetic or authoritative    |
| Documentation     | `bf_emma`/`bm_george` | Clear British English, formal |
| Casual / social   | `af_heart`/`af_sky`   | Approachable, natural         |

### Multilingual

Voice IDs encode language in the first letter: `a`=American English, `b`=British English, `e`=Spanish, `f`=French, `h`=Hindi, `i`=Italian, `j`=Japanese, `p`=Brazilian Portuguese, `z`=Mandarin. The CLI auto-detects the phonemizer locale from the prefix — no `--lang` needed when the voice matches the text.

```bash
npx hyperframes tts "La reunión empieza a las nueve" --voice ef_dora --output es.wav
npx hyperframes tts "今日はいい天気ですね" --voice jf_alpha --output ja.wav
```

Use `--lang` only to override auto-detection (stylized accents). Valid codes: `en-us`, `en-gb`, `es`, `fr-fr`, `hi`, `it`, `pt-br`, `ja`, `zh`. Non-English phonemization requires `espeak-ng` system-wide (`brew install espeak-ng` / `apt-get install espeak-ng`).

### Speed

- `0.7-0.8` — tutorial, complex content, accessibility
- `1.0` — natural pace (default)
- `1.1-1.2` — intros, transitions, upbeat content
- `1.5+` — rarely appropriate; test carefully

### Long Scripts

For more than a few paragraphs, write to a `.txt` file and pass the path. Inputs over ~5 minutes of speech may benefit from splitting into segments.

### Requirements

Python 3.8+ with `kokoro-onnx` and `soundfile` (`pip install kokoro-onnx soundfile`). Model downloads on first use (~311 MB + ~27 MB voices, cached in `~/.cache/hyperframes/tts/`).

## Transcription (`transcribe`)

Produce a normalized `transcript.json` with word-level timestamps.

```bash
npx hyperframes transcribe audio.mp3
npx hyperframes transcribe video.mp4 --model small --language es
npx hyperframes transcribe subtitles.srt          # import existing
npx hyperframes transcribe subtitles.vtt
npx hyperframes transcribe openai-response.json
```

### Language Rule (Non-Negotiable)

**Never use `.en` models unless the user explicitly states the audio is English.** `.en` models (`small.en`, `medium.en`) **translate** non-English audio into English instead of transcribing it. This silently destroys the original language.

1. Language known and non-English → `--model small --language <code>` (no `.en` suffix)
2. Language known and English → `--model small.en`
3. Language unknown → `--model small` (no `

What is this skill?

Three CLI commands: tts, transcribe, and remove-background with models cached under ~/.cache/hyperframes/

Kokoro-82M text-to-speech with 54 listed voices and no API key

Whisper transcription for speech-to-text and caption timestamps

u2net background removal for transparent video or image overlays

Chains naturally TTS → transcribe → captions and references hyperframes element conventions

Three CLI commands: tts, transcribe, remove-background

54 Kokoro voices available via --list

Models cache under ~/.cache/hyperframes/ on first run

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 58.9k installs on skills.sh; 25.6k GitHub stars; 2/3 security scanners passed (skills.sh audits).

Journey fit

Primary fit

BuildIntegrations & version control

Also useful

LaunchDistribution & launch channels

SKILL.md

READMESKILL.md - Hyperframes Media

# HyperFrames Media Preprocessing

Three CLI commands that produce assets for compositions: `tts` (speech), `transcribe` (timestamps), and `remove-background` (transparent video). Each downloads a model on first run and caches it under `~/.cache/hyperframes/`. Drop the output into the project, then reference it from the composition HTML — see the `hyperframes` skill for the audio/video element conventions.

## Text-to-Speech (`tts`)

Generate speech audio locally with Kokoro-82M. No API key.

```bash
npx hyperframes tts "Text here" --voice af_nova --output narration.wav
npx hyperframes tts script.txt --voice bf_emma --output narration.wav
npx hyperframes tts --list                       # all 54 voices
```

### Voice Selection

Match voice to content. Default is `af_heart`.

| Content type      | Voice                 | Why                           |
| ----------------- | --------------------- | ----------------------------- |
| Product demo      | `af_heart`/`af_nova`  | Warm, professional            |
| Tutorial / how-to | `am_adam`/`bf_emma`   | Neutral, easy to follow       |
| Marketing / promo | `af_sky`/`am_michael` | Energetic or authoritative    |
| Documentation     | `bf_emma`/`bm_george` | Clear British English, formal |
| Casual / social   | `af_heart`/`af_sky`   | Approachable, natural         |

### Multilingual

Voice IDs encode language in the first letter: `a`=American English, `b`=British English, `e`=Spanish, `f`=French, `h`=Hindi, `i`=Italian, `j`=Japanese, `p`=Brazilian Portuguese, `z`=Mandarin. The CLI auto-detects the phonemizer locale from the prefix — no `--lang` needed when the voice matches the text.

```bash
npx hyperframes tts "La reunión empieza a las nueve" --voice ef_dora --output es.wav
npx hyperframes tts "今日はいい天気ですね" --voice jf_alpha --output ja.wav
```

Use `--lang` only to override auto-detection (stylized accents). Valid codes: `en-us`, `en-gb`, `es`, `fr-fr`, `hi`, `it`, `pt-br`, `ja`, `zh`. Non-English phonemization requires `espeak-ng` system-wide (`brew install espeak-ng` / `apt-get install espeak-ng`).

### Speed

- `0.7-0.8` — tutorial, complex content, accessibility
- `1.0` — natural pace (default)
- `1.1-1.2` — intros, transitions, upbeat content
- `1.5+` — rarely appropriate; test carefully

### Long Scripts

For more than a few paragraphs, write to a `.txt` file and pass the path. Inputs over ~5 minutes of speech may benefit from splitting into segments.

### Requirements

Python 3.8+ with `kokoro-onnx` and `soundfile` (`pip install kokoro-onnx soundfile`). Model downloads on first use (~311 MB + ~27 MB voices, cached in `~/.cache/hyperframes/tts/`).

## Transcription (`transcribe`)

Produce a normalized `transcript.json` with word-level timestamps.

```bash
npx hyperframes transcribe audio.mp3
npx hyperframes transcribe video.mp4 --model small --language es
npx hyperframes transcribe subtitles.srt          # import existing
npx hyperframes transcribe subtitles.vtt
npx hyperframes transcribe openai-response.json
```

### Language Rule (Non-Negotiable)

**Never use `.en` models unless the user explicitly states the audio is English.** `.en` models (`small.en`, `medium.en`) **translate** non-English audio into English instead of transcribing it. This silently destroys the original language.

1. Language known and non-English → `--model small --language <code>` (no `.en` suffix)
2. Language known and English → `--model small.en`
3. Language unknown → `--model small` (no `

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is hyperframes-media for?

When should I use hyperframes-media?

Is hyperframes-media safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is hyperframes-media for?

When should I use hyperframes-media?

Is hyperframes-media safe to install?

SKILL.md