
Parlor On Device Ai
Stand up a local FastAPI WebSocket voice-and-vision assistant with Gemma 4 E2B and Kokoro TTS—no API keys and no per-request cloud cost.
Overview
Parlor On-Device AI is an agent skill for the Build phase that configures a local FastAPI WebSocket multimodal assistant using Gemma 4 E2B and Kokoro TTS without cloud API calls.
Install
npx skills add https://github.com/aradotso/trending-skills --skill parlor-on-device-aiWhat is this skill?
- FastAPI WebSocket server bridging browser PCM audio and JPEG frames to Gemma 4 E2B via LiteRT-LM
- Kokoro TTS with MLX on Apple Silicon and ONNX on Linux
- Browser Silero VAD for hands-free capture and barge-in interruption
- Sentence-level TTS streaming so playback starts before the full reply is generated
- Fully local stack—no API keys, no cloud inference, no usage billing per request
- Gemma 4 E2B via LiteRT-LM for speech and vision
- Silero VAD and barge-in in the browser
- MLX TTS on Mac and ONNX TTS on Linux
Adoption & trust: 536 installs on skills.sh; 31 GitHub stars; 1/3 security scanners passed (skills.sh audits).
What problem does it solve?
You want a real-time voice-and-vision copilot but cloud APIs add cost, latency, and data leaving the device.
Who is it for?
Builders on Apple Silicon or Linux with GPU headroom who need hands-free VAD, barge-in, and camera-aware dialogue without API keys.
Skip if: Teams that need managed hosted speech APIs only, or builders who cannot install local Gemma/Kokoro model stacks and WebSocket infra.
When should I use this skill?
Triggers include set up parlor on-device AI, run local voice AI with camera, configure parlor multimodal assistant, use Gemma 4 with Kokoro TTS locally, build real-time voice assistant on device, parlor websocket voice v
What do I get? / Deliverables
You run a browser-to-FastAPI pipeline with on-device speech and vision understanding and streamed Kokoro TTS replies, suitable for local demos and private assistant prototypes.
- Running FastAPI WebSocket multimodal server
- Browser client with mic, camera, VAD, playback, and transcript UI
Recommended Skills
Journey fit
Parlor is an integration-heavy build artifact (local server, models, browser client), so it shelves under build → integrations before any launch or growth work. The skill wires browser mic/camera to a FastAPI WebSocket backend with on-device inference and TTS—classic product integration work for a multimodal assistant.
How it compares
Use for self-hosted multimodal WebSocket assistants instead of wiring OpenAI Realtime or other cloud-only voice APIs.
Common Questions / FAQ
Who is parlor-on-device-ai for?
Solo developers building private or offline-capable voice-and-vision assistants who want Gemma 4 E2B plus Kokoro TTS on their own machine via FastAPI and WebSockets.
When should I use parlor-on-device-ai?
In the build phase when integrating a real-time assistant—e.g. setting up Parlor on-device AI, running local voice AI with camera, or configuring the Parlor websocket voice-vision server on Apple Silicon.
Is parlor-on-device-ai safe to install?
It runs local models and opens WebSocket servers on your machine; review the Security Audits panel on this Prism page and lock down network exposure before sharing beyond localhost.
SKILL.md
READMESKILL.md - Parlor On Device Ai
# Parlor On-Device AI > Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection. Parlor is a real-time, on-device multimodal AI assistant. It combines Gemma 4 E2B (via LiteRT-LM) for speech and vision understanding with Kokoro TTS for voice output. Everything runs locally — no API keys, no cloud calls, no cost per request. ## Architecture ``` Browser (mic + camera) │ │ WebSocket (audio PCM + JPEG frames) ▼ FastAPI server ├── Gemma 4 E2B via LiteRT-LM (GPU) → understands speech + vision └── Kokoro TTS (MLX on Mac, ONNX on Linux) → speaks back │ │ WebSocket (streamed audio chunks) ▼ Browser (playback + transcript) ``` Key features: - **Silero VAD** in browser — hands-free, no push-to-talk - **Barge-in** — interrupt AI mid-sentence by speaking - **Sentence-level TTS streaming** — audio starts before full response is ready - **Platform-aware TTS** — MLX backend on Apple Silicon, ONNX on Linux ## Requirements - Python 3.12+ - macOS with Apple Silicon **or** Linux with a supported GPU - ~3 GB free RAM - [`uv`](https://github.com/astral-sh/uv) package manager ## Installation ```bash git clone https://github.com/fikrikarim/parlor.git cd parlor # Install uv if needed curl -LsSf https://astral.sh/uv/install.sh | sh cd src uv sync uv run server.py ``` Open [http://localhost:8000](http://localhost:8000), grant camera and microphone permissions, and start talking. Models download automatically on first run (~2.6 GB for Gemma 4 E2B, plus TTS models). ## Configuration Set environment variables before running: ```bash # Use a pre-downloaded model instead of auto-downloading export MODEL_PATH=/path/to/gemma-4-E2B-it.litertlm # Change server port (default: 8000) export PORT=9000 uv run server.py ``` | Variable | Default | Description | |--------------|-------------------------------|------------------------------------------------| | `MODEL_PATH` | auto-download from HuggingFace | Path to local `.litertlm` model file | | `PORT` | `8000` | Server port | ## Project Structure ``` src/ ├── server.py # FastAPI WebSocket server + Gemma 4 inference ├── tts.py # Platform-aware TTS (MLX on Mac, ONNX on Linux) ├── index.html # Frontend UI (VAD, camera, audio playback) ├── pyproject.toml # Dependencies └── benchmarks/ ├── bench.py # End-to-end WebSocket benchmark └── benchmark_tts.py # TTS backend comparison ``` ## Key Components ### server.py — FastAPI WebSocket Server The server handles two WebSocket connections: one for receiving audio/video from the browser, one for streaming audio back. ```python # Simplified pattern from server.py from fastapi import FastAPI, WebSocket import asyncio app = FastAPI() @app.websocket("/ws") async def websocket_endpoint(websocket: WebSocket): await websocket.accept() async for data in websocket.iter_bytes(): # data contains PCM audio + optional JPEG frame response_text = await run_gemma_inference(data) audio_chunks = await run_tts(response_text) for chunk in audio_chunks: await websocket.send_bytes(chunk) ``` ### tts.py — Platform-Aware TTS Kokoro TTS selects backend based on platform: ```python # tts.py uses platform detection import platform def get_tts_backend():