
Gemma Dev
Spin up a local streaming Gradio chat UI on Google Gemma instruction-tuned models for quick model smoke-tests and demos.
Overview
gemma-dev is an agent skill for the Build phase that provides a Gradio plus transformers streaming chat template for Google Gemma models.
Install
npx skills add https://github.com/google-gemma/gemma-skills --skill gemma-devWhat is this skill?
- Hugging Face text-generation pipeline with device_map auto and dtype auto
- TextIteratorStreamer plus background Thread for token-by-token Gradio yields
- Maps Gradio chat history into role/content messages for chat templates
- GenerationConfig max_new_tokens=256 with swappable model_id (default google/gemma-4-E2B-it)
- gr.ChatInterface scaffold titled Gemma Chatbot for one-file local demos
- Default model_id google/gemma-4-E2B-it
Adoption & trust: 1 installs on skills.sh; 529 GitHub stars.
What problem does it solve?
You want to talk to a Gemma model locally but do not have a small, copy-paste chat app that streams tokens correctly.
Who is it for?
Solo builders validating Gemma behavior on their machine with a few dozen lines of Python.
Skip if: Teams needing production inference, autoscaling, billing, or hardened security around model access.
When should I use this skill?
When you need a minimal streaming Gradio chat wired to a Google Gemma Hugging Face checkpoint for local experimentation.
What do I get? / Deliverables
You get a runnable Gradio ChatInterface wired to Gemma with streamed replies so you can iterate on prompts and model IDs before harder integration work.
- Runnable Gradio ChatInterface script targeting a Gemma model_id
- Streaming chat loop with history-aware message formatting
Recommended Skills
Journey fit
Canonical shelf is Build because the artifact is runnable inference UI code, not distribution or ops. Agent-tooling fits Hugging Face pipelines, token streaming, and chat-shaped agent prototypes.
How it compares
Use as a starter script rather than a managed inference API or full agent framework.
Common Questions / FAQ
Who is gemma-dev for?
Indie developers and agent builders who already use Python and want the fastest path to a streaming Gemma chat demo on Hugging Face stacks.
When should I use gemma-dev?
During Validate when you prototype model UX, and during Build when you wire agent-tooling or compare Gemma checkpoints before production APIs.
Is gemma-dev safe to install?
Review the Security Audits panel on this Prism page and treat downloaded weights and local GPU access as sensitive before running in shared environments.
SKILL.md
READMESKILL.md - Gemma Dev
import gradio as gr from transformers import pipeline, TextIteratorStreamer, GenerationConfig from threading import Thread # Load the pipeline # Replace "google/gemma-4-E2B-it" with other available models model_id = "google/gemma-4-E2B-it" pipe = pipeline( "text-generation", model=model_id, device_map="auto", dtype="auto", ) def chat(message, history): messages = [] # Add conversation history for msg in history: role = msg["role"] # Extract text from the content list (e.g. [{'text': 'hello', 'type': 'text'}]) if isinstance(msg["content"], list): content_text = "".join([item["text"] for item in msg["content"] if item["type"] == "text"]) else: content_text = msg["content"] messages.append({"role": role, "content": content_text}) # Add current user message messages.append({"role": "user", "content": message}) streamer = TextIteratorStreamer(pipe.tokenizer, skip_prompt=True, skip_special_tokens=True) config = GenerationConfig(max_new_tokens=256) thread = Thread(target=pipe, args=(messages,), kwargs=dict( generation_config=config, streamer=streamer )) thread.start() # Generate response generated_text = "" for new_text in streamer: generated_text += new_text yield generated_text # Create the ChatInterface demo = gr.ChatInterface( fn=chat, title="Gemma Chatbot", description="Ask Gemma anything!", ) if __name__ == "__main__": demo.launch() import { pipeline, TextStreamer } from '@huggingface/transformers'; import cliProgress from 'cli-progress'; import inquirer from 'inquirer'; let generator; async function initializeGemma() { console.log('Initializing Gemma model...'); const progressBar = new cliProgress.SingleBar({}, cliProgress.Presets.shades_classic); progressBar.start(100, 0); generator = await pipeline('text-generation', 'onnx-community/gemma-4-E2B-it-ONNX', { device: 'webgpu', dtype: 'q4', progress_callback: (progress) => { progressBar.update(progress.progress); }, }); progressBar.stop(); console.log('Gemma model initialized!'); } async function* generate(question) { const messages = [ {role: 'user', content: question} ]; const prompt = generator.tokenizer.apply_chat_template(messages, { tokenize:false, add_generation_prompt: true, }); const streamer = new TextStreamer(generator.tokenizer, { skip_prompt: true, // Don't stream the user's prompt back skip_special_tokens: true, }); await generator(prompt, { max_new_tokens: 256, streamer: streamer, }); } async function main() { console.clear(); await initializeGemma(); while (true) { const { question } = await inquirer.prompt({ type: 'input', name: 'question', message: "Ask Gemma anything:", }); if (question.toLowerCase() === 'exit') { console.log('See you!'); break; } console.log('\nGemma: '); for await (const chunk of generate(question)) { console.log(chunk); } console.log('\n'); } } main().catch(err => { console.error('An error occurred:', err); }); import os from transformers import AutoTokenizer from google.cloud import aiplatform PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT") LOCATION = os.environ.get("GOOGLE_CLOUD_LOCATION") ENDPOINT_ID = os.environ.get("GOOGLE_CLOUD_ENDPOINT_ID") MODEL_ID = "google/gemma-4-31B-it" tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) def predict_gemma(project: str, endpoint_id: str, prompt: str, location: str = "us-central1"): # Initialize the Vertex AI client aiplatform.init(project=project, location=location) # Reference the deployed endpoint endpoint = aiplatform.Endpoint(endpoint_id) # F