Parlor On Device Ai

Name: Parlor On Device Ai
Author: aradotso

aradotso/trending-skills

565 installs
66 repo stars
Updated July 9, 2026
aradotso/trending-skills

parlor-on-device-ai is a Claude Code skill that configures Parlor, a fully local multimodal voice and vision AI assistant using Gemma 4 E2B and Kokoro TTS over a FastAPI WebSocket server, for developers who need real-tim

About

Parlor On-Device AI is a real-time multimodal voice and vision assistant that runs 100% locally on your hardware. It uses Gemma 4 E2B through LiteRT-LM to process speech and camera input while Kokoro TTS generates natural voice replies. The stack includes a browser client capturing microphone and camera, a FastAPI WebSocket server for bidirectional streaming, Silero VAD for hands-free activation, and barge-in support so users can interrupt the assistant mid-response. No API keys, no recurring costs, and no data leaves your machine. Developers use it to create private, responsive AI companions, local voice interfaces, or on-device multimodal prototypes that feel instantaneous because everything stays on the GPU and CPU you already own.

Combines Gemma 4 E2B via LiteRT-LM for real-time speech and vision understanding
Uses Kokoro TTS for streamed voice output with sentence-level early playback
Runs entirely on-device with Silero VAD for hands-free barge-in interaction
Platform-aware TTS backends: MLX on Apple Silicon and ONNX on Linux
FastAPI WebSocket server streams PCM audio and JPEG vision frames bidirectionally

Parlor On Device Ai by the numbers

565 all-time installs (skills.sh)
+13 installs in the week ending Jun 23, 2026 (Skillselion tracking)
Ranked #1,624 of 16,659 AI & Agent Building skills by installs in the Skillselion catalog
Security screen: HIGH risk (skills.sh audit)
Data as of Jul 19, 2026 (Skillselion catalog sync)

npx skills add https://github.com/aradotso/trending-skills --skill parlor-on-device-ai

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/aradotso/trending-skills/parlor-on-device-ai.svg)](https://skillselion.com/skills/aradotso/trending-skills/parlor-on-device-ai)

Installs	565
repo stars	★ 66
Security audit	1 / 3 scanners passed
Last updated	July 9, 2026
Repository	aradotso/trending-skills ↗

How do you run local voice and vision AI?

Run a fully local multimodal voice and vision AI assistant with no cloud costs or API keys.

Who is it for?

ML engineers and backend developers building privacy-first or offline multimodal assistants on Apple Silicon or similar local hardware.

Skip if: Teams that require managed cloud LLM APIs, lack local GPU or Apple Silicon capacity, or only need text chat without voice or camera streams.

When should I use this skill?

The task mentions Parlor, on-device voice AI, local Gemma 4 with Kokoro TTS, or a WebSocket multimodal assistant without cloud keys.

What you get

A locally hosted Parlor WebSocket server delivering real-time on-device speech, vision inference, and multimodal responses without cloud API keys.

Configured Parlor WebSocket server
Local voice and vision assistant runtime

Files

SKILL.mdMarkdownGitHub ↗

Parlor On-Device AI

Skill by ara.so — Daily 2026 Skills collection.

Parlor is a real-time, on-device multimodal AI assistant. It combines Gemma 4 E2B (via LiteRT-LM) for speech and vision understanding with Kokoro TTS for voice output. Everything runs locally — no API keys, no cloud calls, no cost per request.

Architecture

Browser (mic + camera)
    │
    │  WebSocket (audio PCM + JPEG frames)
    ▼
FastAPI server
    ├── Gemma 4 E2B via LiteRT-LM (GPU)  →  understands speech + vision
    └── Kokoro TTS (MLX on Mac, ONNX on Linux)  →  speaks back
    │
    │  WebSocket (streamed audio chunks)
    ▼
Browser (playback + transcript)

Key features:

Silero VAD in browser — hands-free, no push-to-talk
Barge-in — interrupt AI mid-sentence by speaking
Sentence-level TTS streaming — audio starts before full response is ready
Platform-aware TTS — MLX backend on Apple Silicon, ONNX on Linux

Requirements

Python 3.12+
macOS with Apple Silicon or Linux with a supported GPU
~3 GB free RAM
`uv` package manager

Installation

git clone https://github.com/fikrikarim/parlor.git
cd parlor

# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

cd src
uv sync
uv run server.py

Open http://localhost:8000, grant camera and microphone permissions, and start talking.

Models download automatically on first run (~2.6 GB for Gemma 4 E2B, plus TTS models).

Configuration

Set environment variables before running:

# Use a pre-downloaded model instead of auto-downloading
export MODEL_PATH=/path/to/gemma-4-E2B-it.litertlm

# Change server port (default: 8000)
export PORT=9000

uv run server.py

Variable	Default	Description
`MODEL_PATH`	auto-download from HuggingFace	Path to local `.litertlm` model file
`PORT`	`8000`	Server port

Project Structure

src/
├── server.py              # FastAPI WebSocket server + Gemma 4 inference
├── tts.py                 # Platform-aware TTS (MLX on Mac, ONNX on Linux)
├── index.html             # Frontend UI (VAD, camera, audio playback)
├── pyproject.toml         # Dependencies
└── benchmarks/
    ├── bench.py           # End-to-end WebSocket benchmark
    └── benchmark_tts.py   # TTS backend comparison

Key Components

server.py — FastAPI WebSocket Server

The server handles two WebSocket connections: one for receiving audio/video from the browser, one for streaming audio back.

# Simplified pattern from server.py
from fastapi import FastAPI, WebSocket
import asyncio

app = FastAPI()

@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
    await websocket.accept()
    async for data in websocket.iter_bytes():
        # data contains PCM audio + optional JPEG frame
        response_text = await run_gemma_inference(data)
        audio_chunks = await run_tts(response_text)
        for chunk in audio_chunks:
            await websocket.send_bytes(chunk)

tts.py — Platform-Aware TTS

Kokoro TTS selects backend based on platform:

# tts.py uses platform detection
import platform

def get_tts_backend():
    if platform.system() == "Darwin":
        # Apple Silicon: use MLX backend for GPU acceleration
        from kokoro_mlx import KokoroMLX
        return KokoroMLX()
    else:
        # Linux: use ONNX backend
        from kokoro import KokoroPipeline
        return KokoroPipeline(lang_code='a')

tts = get_tts_backend()

# Sentence-level streaming — yields audio as each sentence is ready
async def synthesize_streaming(text: str):
    for sentence in split_sentences(text):
        audio = tts.synthesize(sentence)
        yield audio

Gemma 4 E2B Inference via LiteRT-LM

# LiteRT-LM inference pattern
from litert_lm import LiteRTLM
import os

model_path = os.environ.get("MODEL_PATH", None)

# Auto-downloads if MODEL_PATH not set
model = LiteRTLM.from_pretrained(
    "google/gemma-4-E2B-it",
    local_path=model_path
)

async def run_gemma_inference(audio_pcm: bytes, image_jpeg: bytes = None):
    inputs = {"audio": audio_pcm}
    if image_jpeg:
        inputs["image"] = image_jpeg
    
    response = ""
    async for token in model.generate_stream(**inputs):
        response += token
    return response

Running Benchmarks

cd src

# End-to-end WebSocket latency benchmark
uv run benchmarks/bench.py

# Compare TTS backends (MLX vs ONNX)
uv run benchmarks/benchmark_tts.py

Performance Reference (Apple M3 Pro)

Stage	Time
Speech + vision understanding	~1.8–2.2s
Response generation (~25 tokens)	~0.3s
Text-to-speech (1–3 sentences)	~0.3–0.7s
Total end-to-end	~2.5–3.0s

Decode speed: ~83 tokens/sec on GPU.

Common Patterns

Extending the System Prompt

Modify the prompt in server.py to change the AI's persona or task:

SYSTEM_PROMPT = """You are a helpful language tutor. 
Respond conversationally in 1-3 sentences.
If the user makes a grammar mistake, gently correct them.
You can see through the user's camera and discuss what you observe."""

Adding a New Language for TTS

Kokoro supports multiple language codes. Set lang_code in tts.py:

# Language codes: 'a' = American English, 'b' = British English
# 'e' = Spanish, 'f' = French, 'z' = Chinese, 'j' = Japanese
pipeline = KokoroPipeline(lang_code='e')  # Spanish

Customizing VAD Sensitivity (index.html)

The Silero VAD threshold can be tuned in the frontend:

// In index.html — lower positiveSpeechThreshold = more sensitive
const vad = await MicVAD.new({
  positiveSpeechThreshold: 0.6,   // default ~0.8, lower = triggers more easily
  negativeSpeechThreshold: 0.35,  // how quickly it stops detecting speech
  minSpeechFrames: 3,
  onSpeechStart: () => { /* UI feedback */ },
  onSpeechEnd: (audio) => sendAudioToServer(audio),
});

Sending Frames Programmatically (WebSocket Client Example)

import asyncio
import websockets
import json
import base64

async def send_audio_frame(audio_pcm_bytes: bytes, jpeg_bytes: bytes = None):
    uri = "ws://localhost:8000/ws"
    async with websockets.connect(uri) as ws:
        payload = {
            "audio": base64.b64encode(audio_pcm_bytes).decode(),
        }
        if jpeg_bytes:
            payload["image"] = base64.b64encode(jpeg_bytes).decode()
        
        await ws.send(json.dumps(payload))
        
        # Receive streamed audio response
        async for message in ws:
            audio_chunk = message  # raw PCM bytes
            # play or save audio_chunk

Troubleshooting

Model download fails

# Pre-download manually via huggingface_hub
uv run python -c "
from huggingface_hub import hf_hub_download
path = hf_hub_download('google/gemma-4-E2B-it', 'gemma-4-E2B-it.litertlm')
print(path)
"
export MODEL_PATH=/path/shown/above
uv run server.py

Microphone/camera not working in browser

Must access via http://localhost (not IP address) — browsers block media APIs on non-localhost HTTP
Check browser permissions: address bar → lock icon → reset permissions

TTS not loading on Linux

# Ensure ONNX runtime is installed
uv add onnxruntime
# Or for GPU:
uv add onnxruntime-gpu

High latency or slow inference

Verify GPU is being used: check for Metal (Mac) or CUDA (Linux) in startup logs
Close other GPU-heavy applications
On Linux, confirm CUDA drivers match installed onnxruntime-gpu version

Port already in use

export PORT=8080
uv run server.py
# Or kill the existing process:
lsof -ti:8000 | xargs kill

`uv sync` fails — Python version mismatch

# Parlor requires Python 3.12+
python3 --version
# Install 3.12 via pyenv or system package manager, then:
uv python pin 3.12
uv sync

Dependencies (pyproject.toml)

Key packages installed by uv sync:

litert-lm — Google AI Edge inference runtime for Gemma
fastapi + uvicorn — async web/WebSocket server
kokoro — Kokoro TTS ONNX backend
kokoro-mlx — Kokoro TTS MLX backend (Mac only)
silero-vad — voice activity detection (browser-side via CDN)
huggingface-hub — model auto-download

Related skills

Setup Matt Pocock SkillsScaffold the per-repo configuration that Matt Pocock’s engineering agent skills rely on so they understand the issue tracker, triage labels, and domain documentation la462k185k

Lark Skill MakerQuickly turn any Lark/Feishu OpenAPI call or multi-step workflow into a reusable agent skill with its own SKILL.md.379k15.8k

CavemanSlash token usage by roughly 75% while keeping every technical detail intact when working with Claude Code, Cursor or similar agents.378k92.5k

Lark AppsConnect Claude, Cursor or custom agents directly to Lark (Feishu) for messaging, document automation, approval workflows and enterprise data access.375k

Running Claude Code Via Litellm CopilotRun Claude Code at a fraction of the cost by routing requests through LiteLLM to the GitHub Copilot Chat API.270k72

Codex PetGenerate a complete Codex Pet spritesheet and metadata from one reference image without needing an OpenAI key or Codex Pro.246k8

How it compares

Choose parlor-on-device-ai for local multimodal stacks, not for hosted OpenAI or Anthropic API integrations.

FAQ

What models power parlor-on-device-ai?

parlor-on-device-ai configures Parlor with Gemma 4 E2B for on-device vision and reasoning plus Kokoro TTS for local speech output, served through a FastAPI WebSocket server for real-time multimodal interaction.

Does Parlor require cloud API keys?

parlor-on-device-ai targets fully local execution with no cloud costs or API keys, making Parlor suitable for on-device speech, camera vision, and WebSocket streaming on hardware such as Apple Silicon.

Is Parlor On Device Ai safe to install?

skills.sh reports 1 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

AI & Agent Buildingagentsautomation