
Hyperframes Media
Preprocess narration, captions, and transparent overlay assets locally for HyperFrames video compositions without paid TTS or cloud APIs.
Overview
HyperFrames Media is an agent skill for the Build phase that preprocesses narration, transcripts, and transparent overlays for HyperFrames compositions via local Kokoro, Whisper, and u2net CLIs.
Install
npx skills add https://github.com/heygen-com/hyperframes --skill hyperframes-mediaWhat is this skill?
- Three CLI commands: tts, transcribe, and remove-background with models cached under ~/.cache/hyperframes/
- Kokoro-82M text-to-speech with 54 listed voices and no API key
- Whisper transcription for speech-to-text and caption timestamps
- u2net background removal for transparent video or image overlays
- Chains naturally TTS → transcribe → captions and references hyperframes element conventions
- Three CLI commands: tts, transcribe, remove-background
- 54 Kokoro voices available via --list
- Models cache under ~/.cache/hyperframes/ on first run
Adoption & trust: 58.9k installs on skills.sh; 25.6k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need voiceover, captions, and cut-out overlays for a HyperFrames video but do not want to juggle separate cloud APIs or guess how to chain TTS into timed subtitles.
Who is it for?
Indie builders composing HyperFrames marketing or demo videos who prefer on-device TTS and captions with predictable CLI flags.
Skip if: Teams that need studio voice talent, real-time streaming synthesis, or video editing outside the HyperFrames asset pipeline.
When should I use this skill?
Generating voiceover from text, transcribing speech for captions, removing backgrounds for transparent overlays, choosing TTS voice or Whisper model, or chaining TTS → transcribe → captions for HyperFrames.
What do I get? / Deliverables
You get cached local model runs that output WAV narration, transcript timestamps, and alpha-ready media you can reference from composition HTML using the hyperframes skill conventions.
- WAV narration from Kokoro TTS
- Transcript with timestamps for captions
- Transparent video or image assets for overlays
Recommended Skills
Journey fit
Media preprocessing sits in Build because it produces composition-ready audio and video assets before you ship marketing or product demos in HyperFrames HTML. Integrations fits the three CLI pipelines (Kokoro TTS, Whisper transcription, u2net background removal) that plug into the broader HyperFrames workflow.
How it compares
Local CLI preprocessing for composition assets—not a hosted HeyGen avatar API or a general-purpose video editor.
Common Questions / FAQ
Who is hyperframes-media for?
Solo and indie builders using HyperFrames who need offline narration, caption timing, and transparent overlays before assembling composition HTML.
When should I use hyperframes-media?
During Build when generating voiceover from text, transcribing speech for captions, removing backgrounds for overlays, picking a Kokoro voice or Whisper model, or chaining TTS through transcription into captions.
Is hyperframes-media safe to install?
It runs local CLIs and downloads models to your cache; review the Security Audits panel on this Prism page and verify npx hyperframes sources before running on sensitive machines.
Workflow Chain
Requires first: skill heygen com hyperframes hyperframes
Then invoke: skill heygen com hyperframes hyperframes
SKILL.md
READMESKILL.md - Hyperframes Media
# HyperFrames Media Preprocessing Three CLI commands that produce assets for compositions: `tts` (speech), `transcribe` (timestamps), and `remove-background` (transparent video). Each downloads a model on first run and caches it under `~/.cache/hyperframes/`. Drop the output into the project, then reference it from the composition HTML — see the `hyperframes` skill for the audio/video element conventions. ## Text-to-Speech (`tts`) Generate speech audio locally with Kokoro-82M. No API key. ```bash npx hyperframes tts "Text here" --voice af_nova --output narration.wav npx hyperframes tts script.txt --voice bf_emma --output narration.wav npx hyperframes tts --list # all 54 voices ``` ### Voice Selection Match voice to content. Default is `af_heart`. | Content type | Voice | Why | | ----------------- | --------------------- | ----------------------------- | | Product demo | `af_heart`/`af_nova` | Warm, professional | | Tutorial / how-to | `am_adam`/`bf_emma` | Neutral, easy to follow | | Marketing / promo | `af_sky`/`am_michael` | Energetic or authoritative | | Documentation | `bf_emma`/`bm_george` | Clear British English, formal | | Casual / social | `af_heart`/`af_sky` | Approachable, natural | ### Multilingual Voice IDs encode language in the first letter: `a`=American English, `b`=British English, `e`=Spanish, `f`=French, `h`=Hindi, `i`=Italian, `j`=Japanese, `p`=Brazilian Portuguese, `z`=Mandarin. The CLI auto-detects the phonemizer locale from the prefix — no `--lang` needed when the voice matches the text. ```bash npx hyperframes tts "La reunión empieza a las nueve" --voice ef_dora --output es.wav npx hyperframes tts "今日はいい天気ですね" --voice jf_alpha --output ja.wav ``` Use `--lang` only to override auto-detection (stylized accents). Valid codes: `en-us`, `en-gb`, `es`, `fr-fr`, `hi`, `it`, `pt-br`, `ja`, `zh`. Non-English phonemization requires `espeak-ng` system-wide (`brew install espeak-ng` / `apt-get install espeak-ng`). ### Speed - `0.7-0.8` — tutorial, complex content, accessibility - `1.0` — natural pace (default) - `1.1-1.2` — intros, transitions, upbeat content - `1.5+` — rarely appropriate; test carefully ### Long Scripts For more than a few paragraphs, write to a `.txt` file and pass the path. Inputs over ~5 minutes of speech may benefit from splitting into segments. ### Requirements Python 3.8+ with `kokoro-onnx` and `soundfile` (`pip install kokoro-onnx soundfile`). Model downloads on first use (~311 MB + ~27 MB voices, cached in `~/.cache/hyperframes/tts/`). ## Transcription (`transcribe`) Produce a normalized `transcript.json` with word-level timestamps. ```bash npx hyperframes transcribe audio.mp3 npx hyperframes transcribe video.mp4 --model small --language es npx hyperframes transcribe subtitles.srt # import existing npx hyperframes transcribe subtitles.vtt npx hyperframes transcribe openai-response.json ``` ### Language Rule (Non-Negotiable) **Never use `.en` models unless the user explicitly states the audio is English.** `.en` models (`small.en`, `medium.en`) **translate** non-English audio into English instead of transcribing it. This silently destroys the original language. 1. Language known and non-English → `--model small --language <code>` (no `.en` suffix) 2. Language known and English → `--model small.en` 3. Language unknown → `--model small` (no `