
Gemini Image
Run Gemini vision on screenshots and assets to describe UI, OCR text, recover code from images, and compare visuals from the terminal or agent workflow.
Overview
Gemini Image is an agent skill most often used in Build (also Validate and Ship) that analyzes images with Gemini Pro vision for description, OCR, code extraction, and UI understanding via CLI prompts.
Install
npx skills add https://github.com/johnlindquist/claude --skill gemini-imageWhat is this skill?
- CLI flows: gemini -m pro -f image with single or multi-image prompts
- OCR-oriented prompts for buttons, labels, and full-screen text extraction
- Code-from-screenshot recovery with indentation and uncertainty callouts
- Structured UI analysis template (app identity, layout, errors, affordances)
- Prerequisites: pip install google-generativeai and GEMINI_API_KEY
- 5-part comprehensive image description prompt template
- 5-part structured UI analysis prompt template
Adoption & trust: 1.1k installs on skills.sh; 24 GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have a screenshot or design capture but need accurate text, layout, or code context inside your agent session without manual transcription.
Who is it for?
Solo builders using Claude Code, Cursor, or Codex who already have a Gemini API key and want quick multimodal analysis from image files.
Skip if: Workflows that require batch OCR pipelines, on-device-only privacy, or pixel-perfect diff testing without an external vision API.
When should I use this skill?
Analyze images using Gemini's vision capabilities—for image analysis, text extraction from screenshots, and visual content understanding.
What do I get? / Deliverables
You get structured vision-backed answers—descriptions, OCR, or code blocks—that agents can paste into fixes, docs, or validation notes.
- OCR or description text from image inputs
- Formatted code or UI analysis blocks suitable for agent follow-up
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Builders most often install this while extending agent tooling during implementation and debug loops. Agent-tooling is the primary shelf because the skill is CLI- and API-driven vision analysis invoked alongside coding agents—not a standalone design or launch play.
Where it fits
Describe a landing-page mock and visible copy to sanity-check messaging before you build the real page.
Feed a stack-trace screenshot into Gemini to extract the error string for your coding agent.
Turn a diagram or whiteboard photo into structured bullet notes for README or internal docs.
Compare two UI captures after a release candidate to spot visual regressions described in plain language.
How it compares
Use as a lightweight vision integration skill—not a full computer-vision pipeline or hosted MCP media server.
Common Questions / FAQ
Who is gemini-image for?
Indie developers and agent users who routinely work from PNG/JPG screenshots and want Gemini Pro to interpret them from the command line.
When should I use gemini-image?
In Validate when reviewing prototype or landing mocks; in Build when extracting errors or code from screenshots; in Ship when documenting UI bugs or comparing before/after captures.
Is gemini-image safe to install?
It implies network calls and API keys for Google Gemini—never commit GEMINI_API_KEY; review the Security Audits panel on this Prism page before piping sensitive screenshots.
SKILL.md
READMESKILL.md - Gemini Image
# Gemini Image Analysis Analyze images using Gemini Pro's vision capabilities. ## Prerequisites ```bash pip install google-generativeai export GEMINI_API_KEY=your_api_key ``` ## CLI Reference ### Basic Image Analysis ```bash # Analyze an image gemini -m pro -f /path/to/image.png "Describe this image in detail" # With specific question gemini -m pro -f screenshot.png "What error message is shown?" # Multiple images gemini -m pro -f image1.png -f image2.png "Compare these two images" ``` ## Analysis Operations ### General Description ```bash gemini -m pro -f image.png "Describe this image comprehensively: 1. Main subject/content 2. Colors and composition 3. Text visible (if any) 4. Context and purpose 5. Notable details" ``` ### Extract Text (OCR) ```bash gemini -m pro -f screenshot.png "Extract all text from this image. Format as plain text, preserving layout where possible. Include any text in buttons, labels, or UI elements." ``` ### Code from Screenshot ```bash gemini -m pro -f code-screenshot.png "Extract the code from this screenshot. Provide as properly formatted code with correct indentation. Note any parts that are unclear or partially visible." ``` ### UI Analysis ```bash gemini -m pro -f ui-screenshot.png "Analyze this UI: 1. What application/website is this? 2. What page/screen is shown? 3. Main UI elements and their purpose 4. User flow/actions available 5. Any UX issues or suggestions" ``` ### Error Analysis ```bash gemini -m pro -f error-screenshot.png "Analyze this error: 1. What error is shown? 2. What is the likely cause? 3. How to fix it? 4. Any related information visible?" ``` ### Diagram Understanding ```bash gemini -m pro -f diagram.png "Explain this diagram: 1. What type of diagram is this? 2. Main components and their relationships 3. Data/process flow 4. Key takeaways" ``` ## Specific Use Cases ### Debug Screenshot ```bash gemini -m pro -f debug-screen.png "I'm debugging an issue. From this screenshot: 1. What is the current state? 2. What errors or warnings are visible? 3. What should I look at? 4. Suggested next steps" ``` ### Compare Before/After ```bash gemini -m pro -f before.png -f after.png "Compare these before and after images: 1. What changed? 2. Is this an improvement? 3. Any issues in the 'after' version? 4. Anything missing?" ``` ### Design Feedback ```bash gemini -m pro -f design.png "Provide design feedback: 1. Visual hierarchy 2. Color usage 3. Typography 4. Spacing and alignment 5. Accessibility concerns 6. Suggestions for improvement" ``` ### Data Extraction ```bash gemini -m pro -f chart.png "Extract data from this chart: 1. Chart type 2. Data series and values 3. Axes labels and ranges 4. Key trends or insights 5. Output as structured data if possible" ``` ### Form Analysis ```bash gemini -m pro -f form.png "Analyze this form: 1. Form purpose 2. Fields and their types 3. Required vs optional 4. Validation rules visible 5. UX suggestions" ``` ## Workflow Patterns ### Screenshot to Issue ```bash # Capture screenshot (macOS) screencapture -i /tmp/bug.png # Analyze and format as issue gemini -m pro -f /tmp/bug.png "Create a bug report from this screenshot: ## Summary [One-line description] ## Steps to Reproduce [Inferred from screenshot] ## Expected Behavior [What should happen] ## Actual Behavior [What the screenshot shows] ## Environment [Any visible system info]" ``` ### UI to Code ```bash gemini -m pro -f ui-design.png "Generate React component code that recreates this UI: - Use Tailwind CSS for styling - Make it responsive - Include proper TypeScript types - Add appropriate accessibility attributes" ``` ### Documentation ```bash gemini -m pro -f app-screen.png "Write user documentation for this screen: - What this screen is for - How to use each feature - Co