Speech To Text

Canonical shelf is Build because the skill is a belt/inference.sh CLI integration that wires external STT models into an agent workflow. Integrations fits best: it connects ElevenLabs Scribe and Whisper apps via `belt app run`, not frontend UI or pure docs.

Also useful

Also useful

Where it fits

Example use

Pipe customer interview URLs through `infsh/fast-whisper-large-v3` before tagging insights in your CRM.

Example use

Transcribe a weekly podcast episode to draft blog posts and newsletter quotes.

Example use

Turn recorded demo narration into accurate README or tutorial copy.

How it compares

Use this skill package for belt-orchestrated STT jobs instead of hand-rolling ffmpeg plus local Whisper scripts in chat.

Common Questions / FAQ

Who is speech-to-text for?

Solo and indie builders using agent coding tools who already use or can install the inference.sh belt CLI and need transcription from hosted audio URLs.

When should I use speech-to-text?

During Build when wiring content or support pipelines (integrations), and during Grow when turning podcasts or webinars into written content—whenever triggers like transcribe meeting, subtitles generation, or audio to text match your task.

Is speech-to-text safe to install?

It documents Bash/belt usage and sends audio URLs to third-party inference apps; review the Security Audits panel on this Prism page and your inference.sh account policies before passing sensitive recordings.

SKILL.md

READMESKILL.md - Speech To Text

> **Install the belt CLI skill:** `npx skills add belt-sh/cli`

# Speech-to-Text

Transcribe audio to text via [inference.sh](https://inference.sh) CLI.

![Speech-to-Text](https://cloud.inference.sh/u/4mg21r6ta37mpaz6ktzwtt8krr/01jz025e88nkvw55at1rqtj5t8.png)

## Quick Start

> Requires inference.sh CLI (`belt`). [Install instructions](https://raw.githubusercontent.com/inference-sh/skills/refs/heads/main/cli-install.md)

```bash
belt login

belt app run infsh/fast-whisper-large-v3 --input '{"audio_url": "https://audio.mp3"}'
```


## Available Models

| Model | App ID | Best For |
|-------|--------|----------|
| ElevenLabs Scribe v2 | `elevenlabs/stt` | 98%+ accuracy, diarization, 90+ languages |
| Fast Whisper V3 | `infsh/fast-whisper-large-v3` | Fast transcription |
| Whisper V3 Large | `infsh/whisper-v3-large` | Highest accuracy |

## Examples

### Basic Transcription

```bash
belt app run infsh/fast-whisper-large-v3 --input '{"audio_url": "https://meeting.mp3"}'
```

### With Timestamps

```bash
belt app sample infsh/fast-whisper-large-v3 --save input.json

# {
#   "audio_url": "https://podcast.mp3",
#   "timestamps": true
# }

belt app run infsh/fast-whisper-large-v3 --input input.json
```

### Translation (to English)

```bash
belt app run infsh/whisper-v3-large --input '{
  "audio_url": "https://french-audio.mp3",
  "task": "translate"
}'
```

### From Video

```bash
# Extract audio from video first
belt app run infsh/video-audio-extractor --input '{"video_url": "https://video.mp4"}' > audio.json

# Transcribe the extracted audio
belt app run infsh/fast-whisper-large-v3 --input '{"audio_url": "<audio-url>"}'
```

## Workflow: Video Subtitles

```bash
# 1. Transcribe video audio
belt app run infsh/fast-whisper-large-v3 --input '{
  "audio_url": "https://video.mp4",
  "timestamps": true
}' > transcript.json

# 2. Use transcript for captions
belt app run infsh/caption-videos --input '{
  "video_url": "https://video.mp4",
  "captions": "<transcript-from-step-1>"
}'
```

## Supported Languages

Whisper supports 99+ languages including:
English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Russian, and many more.

## Use Cases

- **Meetings**: Transcribe recordings
- **Podcasts**: Generate transcripts
- **Subtitles**: Create captions for videos
- **Voice Notes**: Convert to searchable text
- **Interviews**: Transcription for research
- **Accessibility**: Make audio content accessible

## Output Format

Returns JSON with:
- `text`: Full transcription
- `segments`: Timestamped segments (if requested)
- `language`: Detected language

## Related Skills

```bash
# ElevenLabs STT (98%+ accuracy, diarization)
npx skills add inference-sh/skills@elevenlabs-stt

# ElevenLabs TTS (reverse direction)
npx skills add inference-sh/skills@elevenlabs-tts

# Full platform skill (all 250+ apps)
npx skills add inference-sh/skills@infsh-cli

# Text-to-speech (reverse direction)
npx skills add inference-sh/skills@text-to-speech

# Video generation (add captions)
npx skills add inference-sh/skills@ai-video-generation

# AI avatars (lipsync with transcripts)
npx skills add inference-sh/skills@ai-avatar-video
```

Browse all audio apps: `belt app store --category audio`

## Documentation

- [Running Apps](https://inferen

What is this skill?

Runs ElevenLabs Scribe v2, Fast Whisper Large V3, and Whisper V3 Large via inference.sh app IDs

Supports diarization, timestamps, multi-language transcription, translation, and audio event tagging

Meeting, podcast, subtitle, and voice-note oriented trigger phrases in SKILL.md

Quick start: `belt login` then `belt app run` with an `audio_url` JSON payload

Install path documented: `npx skills add belt-sh/cli` for the belt CLI skill

3 listed STT models (ElevenLabs Scribe v2, Fast Whisper V3, Whisper V3 Large)

ElevenLabs Scribe v2 marketed at 98%+ accuracy and 90+ languages

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 505 installs on skills.sh; 512 GitHub stars; 1/3 security scanners passed (skills.sh audits).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

Where it fits

Example use

Pipe customer interview URLs through `infsh/fast-whisper-large-v3` before tagging insights in your CRM.

Example use

Transcribe a weekly podcast episode to draft blog posts and newsletter quotes.

Example use