
Speech To Text
Transcribe meetings, podcasts, and voice notes into searchable text or subtitles through inference.sh without wiring Whisper or ElevenLabs APIs by hand.
Overview
Speech-to-text is an agent skill for the Build phase that transcribes remote audio URLs through inference.sh belt apps (ElevenLabs Scribe and Whisper models).
Install
npx skills add https://github.com/inference-sh/skills --skill speech-to-textWhat is this skill?
- Runs ElevenLabs Scribe v2, Fast Whisper Large V3, and Whisper V3 Large via inference.sh app IDs
- Supports diarization, timestamps, multi-language transcription, translation, and audio event tagging
- Meeting, podcast, subtitle, and voice-note oriented trigger phrases in SKILL.md
- Quick start: `belt login` then `belt app run` with an `audio_url` JSON payload
- Install path documented: `npx skills add belt-sh/cli` for the belt CLI skill
- 3 listed STT models (ElevenLabs Scribe v2, Fast Whisper V3, Whisper V3 Large)
- ElevenLabs Scribe v2 marketed at 98%+ accuracy and 90+ languages
Adoption & trust: 505 installs on skills.sh; 512 GitHub stars; 1/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have MP3s or meeting recordings but no quick, agent-invokable path to accurate transcripts with optional speakers and timestamps.
Who is it for?
Indie builders batching podcast episodes, user interviews, or Loom exports into text using belt without maintaining separate Whisper containers.
Skip if: Real-time streaming dictation inside the IDE or teams that cannot use external CLI login and cloud inference.
When should I use this skill?
Meeting transcription, subtitles, podcast transcripts, voice notes; triggers include speech to text, whisper, stt, transcribe meeting, elevenlabs stt, scribe.
What do I get? / Deliverables
You get structured transcription output from a chosen STT app run, ready to paste into notes, subtitles, or a follow-on summarization skill.
- Transcript text (and model-specific metadata such as timestamps or diarization when using Scribe)
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Build because the skill is a belt/inference.sh CLI integration that wires external STT models into an agent workflow. Integrations fits best: it connects ElevenLabs Scribe and Whisper apps via `belt app run`, not frontend UI or pure docs.
Where it fits
Pipe customer interview URLs through `infsh/fast-whisper-large-v3` before tagging insights in your CRM.
Transcribe a weekly podcast episode to draft blog posts and newsletter quotes.
Turn recorded demo narration into accurate README or tutorial copy.
How it compares
Use this skill package for belt-orchestrated STT jobs instead of hand-rolling ffmpeg plus local Whisper scripts in chat.
Common Questions / FAQ
Who is speech-to-text for?
Solo and indie builders using agent coding tools who already use or can install the inference.sh belt CLI and need transcription from hosted audio URLs.
When should I use speech-to-text?
During Build when wiring content or support pipelines (integrations), and during Grow when turning podcasts or webinars into written content—whenever triggers like transcribe meeting, subtitles generation, or audio to text match your task.
Is speech-to-text safe to install?
It documents Bash/belt usage and sends audio URLs to third-party inference apps; review the Security Audits panel on this Prism page and your inference.sh account policies before passing sensitive recordings.
SKILL.md
READMESKILL.md - Speech To Text
> **Install the belt CLI skill:** `npx skills add belt-sh/cli` # Speech-to-Text Transcribe audio to text via [inference.sh](https://inference.sh) CLI.  ## Quick Start > Requires inference.sh CLI (`belt`). [Install instructions](https://raw.githubusercontent.com/inference-sh/skills/refs/heads/main/cli-install.md) ```bash belt login belt app run infsh/fast-whisper-large-v3 --input '{"audio_url": "https://audio.mp3"}' ``` ## Available Models | Model | App ID | Best For | |-------|--------|----------| | ElevenLabs Scribe v2 | `elevenlabs/stt` | 98%+ accuracy, diarization, 90+ languages | | Fast Whisper V3 | `infsh/fast-whisper-large-v3` | Fast transcription | | Whisper V3 Large | `infsh/whisper-v3-large` | Highest accuracy | ## Examples ### Basic Transcription ```bash belt app run infsh/fast-whisper-large-v3 --input '{"audio_url": "https://meeting.mp3"}' ``` ### With Timestamps ```bash belt app sample infsh/fast-whisper-large-v3 --save input.json # { # "audio_url": "https://podcast.mp3", # "timestamps": true # } belt app run infsh/fast-whisper-large-v3 --input input.json ``` ### Translation (to English) ```bash belt app run infsh/whisper-v3-large --input '{ "audio_url": "https://french-audio.mp3", "task": "translate" }' ``` ### From Video ```bash # Extract audio from video first belt app run infsh/video-audio-extractor --input '{"video_url": "https://video.mp4"}' > audio.json # Transcribe the extracted audio belt app run infsh/fast-whisper-large-v3 --input '{"audio_url": "<audio-url>"}' ``` ## Workflow: Video Subtitles ```bash # 1. Transcribe video audio belt app run infsh/fast-whisper-large-v3 --input '{ "audio_url": "https://video.mp4", "timestamps": true }' > transcript.json # 2. Use transcript for captions belt app run infsh/caption-videos --input '{ "video_url": "https://video.mp4", "captions": "<transcript-from-step-1>" }' ``` ## Supported Languages Whisper supports 99+ languages including: English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Russian, and many more. ## Use Cases - **Meetings**: Transcribe recordings - **Podcasts**: Generate transcripts - **Subtitles**: Create captions for videos - **Voice Notes**: Convert to searchable text - **Interviews**: Transcription for research - **Accessibility**: Make audio content accessible ## Output Format Returns JSON with: - `text`: Full transcription - `segments`: Timestamped segments (if requested) - `language`: Detected language ## Related Skills ```bash # ElevenLabs STT (98%+ accuracy, diarization) npx skills add inference-sh/skills@elevenlabs-stt # ElevenLabs TTS (reverse direction) npx skills add inference-sh/skills@elevenlabs-tts # Full platform skill (all 250+ apps) npx skills add inference-sh/skills@infsh-cli # Text-to-speech (reverse direction) npx skills add inference-sh/skills@text-to-speech # Video generation (add captions) npx skills add inference-sh/skills@ai-video-generation # AI avatars (lipsync with transcripts) npx skills add inference-sh/skills@ai-avatar-video ``` Browse all audio apps: `belt app store --category audio` ## Documentation - [Running Apps](https://inferen