
Minimax Image Understanding
Call MiniMax VLM understand_image on screenshots, diagrams, and UI photos when text context is missing or you need OCR-style extraction.
Overview
MiniMax Image Understanding is an agent skill most often used in Build (also Ship) that analyzes images via the MiniMax understand_image VLM tool.
Install
npx skills add https://github.com/imsus/pi-extension-minimax-coding-plan-mcp --skill minimax-image-understandingWhat is this skill?
- Single understand_image tool: prompt plus image_url (HTTPS or data:image base64)
- POST {api_host}/v1/coding_plan/vlm with JSON prompt and image_url
- Documented use cases: error screenshots, UI/UX review, charts, OCR, visual debugging
- Explicit anti-patterns: skip when image already described, simple icons, inaccessible URLs, or redundant file context
- Single POST endpoint path v1/coding_plan/vlm documented in SKILL.md
Adoption & trust: 520 installs on skills.sh; 13 GitHub stars; 1/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your agent cannot see the screenshot, diagram, or UI bug you are staring at, so debugging and design feedback stall.
Who is it for?
Developers using the pi-extension MiniMax MCP who routinely paste screenshots and want consistent tool invocation rules.
Skip if: Sessions where the image is already fully described, the asset is a trivial icon, or no valid image_url is available.
When should I use this skill?
You need to analyze, describe, or extract information from an image using the understand_image tool.
What do I get? / Deliverables
You get a text analysis from MiniMax VLM for the provided image URL or base64 payload, usable for fixes, OCR, or UX notes.
- VLM text content field summarizing or answering about the image
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Build › integrations because the skill is a thin wrapper around the MiniMax coding_plan VLM HTTP tool inside an MCP extension stack. Integrations subphase is where agent skills bind to hosted multimodal APIs rather than local-only logic.
Where it fits
Upload a layout screenshot and ask which component broke responsive spacing.
Parse a diagram of a third-party webhook flow before wiring handlers.
Interpret a production error dialog image during pre-release QA.
How it compares
Agent skill guardrails around a hosted VLM call—not a replacement for pasting long prose descriptions when context is already complete.
Common Questions / FAQ
Who is minimax-image-understanding for?
Solo builders running Claude Code, Cursor, or similar agents with the MiniMax coding-plan MCP who need vision on demand.
When should I use minimax-image-understanding?
During Build for UI and integration debugging from screenshots, and during Ship review when visual regressions or error dialogs need interpretation.
Is minimax-image-understanding safe to install?
Review the Security Audits panel on this Prism page; the skill sends image data to MiniMax APIs so confirm keys, data policy, and MCP extension source.
SKILL.md
READMESKILL.md - Minimax Image Understanding
# MiniMax Image Understanding Skill Use this skill when you need to analyze, describe, or extract information from images. ## How to Use Call the `understand_image` tool directly with a prompt and image URL: ``` understand_image({ prompt: "Your question about the image", image_url: "https://example.com/image.png" }) ``` ## When to Use Use `understand_image` when: - **Screenshots**: Error messages, UI issues, code in screenshots - **Visual content**: Photos, diagrams, charts, graphs - **Documents**: Extracting text from images (OCR), understanding layouts - **UI/UX analysis**: Evaluating designs, identifying components - **Visual debugging**: Understanding visual bugs or layout issues ## When NOT to Use Do NOT use `understand_image` when: - **Image is already described** in the conversation - **The image is a simple icon** or emoji you recognize - **No image is provided** or the image URL is inaccessible - **Redundant with existing context** (e.g., file contents already visible) ## Usage ``` understand_image({ prompt: "What do you see in this image?", image_url: "https://example.com/screenshot.png" }) ``` ## API Details **Endpoint**: `POST {api_host}/v1/coding_plan/vlm` **Request Body**: ```json { "prompt": "Your question about the image", "image_url": "data:image/jpeg;base64,/9j/4AAQ..." } ``` **Response Format**: ```json { "content": "AI analysis of the image...", "base_resp": { "status_code": 0, "status_msg": "success" } } ``` ## Image Processing The tool automatically handles three types of image inputs: 1. **HTTP/HTTPS URLs**: Downloads the image and converts to base64 - Example: `https://example.com/image.jpg` 2. **Local file paths**: Reads local files and converts to base64 - Absolute: `/Users/username/Documents/image.png` - Relative: `images/photo.png` - Removes `@` prefix if present 3. **Base64 data URLs**: Passes through existing base64 data - Example: `data:image/png;base64,iVBORw0KGgo...` ## Image Formats Supported: - **JPEG** (.jpg, .jpeg) - **PNG** (.png) - **WebP** (.webp) Not supported: - PDF, GIF, PSD, SVG, and other formats ## Crafting Effective Prompts ### For Descriptions - "Describe what's in this image in detail" - "What is the main subject of this image?" - "Describe the visual style and composition" ### For Code/Technical - "What code is shown in this screenshot?" - "Extract all text from this image" - "Identify the UI framework/components used" ### For Analysis - "Analyze this UI design. What is working well and what could be improved?" - "What emotions or mood does this image convey?" - "Compare this design to Material Design principles" ### For OCR/Text Extraction - "Extract all text from this image" - "Read the error message in this screenshot" - "What does the label say in this image?" ## Examples ### Error Analysis ``` understand_image({ prompt: "What is the error message and where is it located in this screenshot?", image_url: "./error-screenshot.png" }) ``` ### Code Screenshot ``` understand_image({ prompt: "What code is shown in this screenshot? Please transcribe it exactly.", image_url: "https://example.com/code.png" }) ``` ### Design Review ``` understand_image({ prompt: "Analyze this UI design. What is working well and what could be improved?", image_url: "https://example.com/mockup.png" }) ``` ### OCR ``` understand_image({ prompt: "Extract all text from this image", image_url: "/Users/username/Documents/scan.png" }) ``` ## Tips 1. **Be specific** in your prompt about what you want to know 2. **Mention format** if you need structured output (e.g., "list all elements") 3. **Include context** if the image is part of a larger task 4. **For screenshots**, specify if you need full-page or just a specific area 5. **Complex analysis** may trigger a confirmation prompt (analyze, extract, describe, r