
Gemini Computer Use
Wire Gemini 2.5 Computer Use preview actions to a Playwright-controlled browser so your coding agent can navigate, click, type, and screenshot-verify real web UIs.
Overview
Gemini Computer Use is an agent skill for the Build phase that connects Gemini’s computer-use preview model to Playwright so agents execute and confirm browser actions with screenshots.
Install
npx skills add https://github.com/am-will/codex-skills --skill gemini-computer-useWhat is this skill?
- Requires model gemini-2.5-computer-use-preview-10-2025 with Computer Use tool
- Playwright sync client: open browser, screenshot, return function_response after each action
- Documented browser actions: navigate, click_at, type_text_at, scroll, drag_and_drop, key_combination, and more
- Safety gate: honor safety_decision require_confirmation before executing risky steps
- Env configuration for GEMINI_API_KEY and optional Chrome/Edge/Brave executables
- Model pinned to gemini-2.5-computer-use-preview-10-2025
- Default viewport 1440×900 with supported action set including open_web_browser, navigate, click_at, type_text_at, scroll
Adoption & trust: 1.2k installs on skills.sh; 941 GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your agent can reason about web tasks but has no documented loop to run Gemini computer-use function calls and feed back live browser state.
Who is it for?
Developers building or debugging agent workflows that must interact with real websites during integration or QA-style walkthroughs.
Skip if: Headless CI smoke tests that should not call external LLM APIs or skip human confirmation on sensitive actions.
When should I use this skill?
You need to integrate Gemini Computer Use tool actions with a Playwright browser and return screenshots after each step.
What do I get? / Deliverables
You get a runnable Python + Playwright pattern that executes supported browser actions and returns screenshots until the modeled task finishes or the user confirms gated steps.
- Configurable browser automation runner for computer-use actions
- Documented action loop with screenshot + URL function responses
- Environment template for API key and browser channel/executable
Recommended Skills
Journey fit
Agent-tooling is the canonical shelf because the skill implements the client-side execution loop for model-issued browser actions, not a one-off marketing or ship checklist. Computer-use tooling extends what agents can do during product build—automating flows, demos, and UI verification through executable function responses.
How it compares
This is a Gemini API + Playwright integration skill—not a packaged MCP browser server or a no-code RPA recorder.
Common Questions / FAQ
Who is gemini-computer-use for?
Solo builders and agent integrators using Codex-style repos who need a reference implementation for Gemini Computer Use with Playwright.
When should I use gemini-computer-use?
During build when prototyping agent-driven browser automation, validating staging UIs, or wiring function_response screenshot loops for the preview computer-use model.
Is gemini-computer-use safe to install?
Treat API keys and browser control as high trust: review Security Audits on this page, never commit GEMINI_API_KEY, and require user confirmation when safety_decision demands it.
SKILL.md
READMESKILL.md - Gemini Computer Use
# Copy to env.sh and source it before running. export GEMINI_API_KEY="" # Optional: Use a Playwright browser channel (e.g., chrome, msedge). # Leave empty to use Playwright's bundled Chromium. export COMPUTER_USE_BROWSER_CHANNEL="" # Optional: Point to a Chromium-based browser executable (e.g., Brave). # Takes precedence over COMPUTER_USE_BROWSER_CHANNEL if set. export COMPUTER_USE_BROWSER_EXECUTABLE="" # Gemini Computer Use Notes - Model: `gemini-2.5-computer-use-preview-10-2025` (required when using the Computer Use tool). - The model emits `function_call` actions that must be executed client-side. - After each action, return a `function_response` with the latest screenshot + URL. - If a response includes `safety_decision: require_confirmation`, you must ask the user to confirm before executing the action. Supported actions (browser environment): - open_web_browser - wait_5_seconds - go_back - go_forward - search - navigate - click_at - hover_at - type_text_at - key_combination - scroll_document - scroll_at - drag_and_drop #!/usr/bin/env python3 import argparse import os import sys import time from typing import Any, Dict, List, Tuple from playwright.sync_api import sync_playwright from google import genai from google.genai import types from google.genai.types import Content, Part MODEL_NAME = "gemini-2.5-computer-use-preview-10-2025" DEFAULT_START_URL = "https://www.google.com" DEFAULT_SCREEN_WIDTH = 1440 DEFAULT_SCREEN_HEIGHT = 900 SUPPORTED_ACTIONS = { "open_web_browser", "wait_5_seconds", "go_back", "go_forward", "search", "navigate", "click_at", "hover_at", "type_text_at", "key_combination", "scroll_document", "scroll_at", "drag_and_drop", } def parse_args() -> argparse.Namespace: parser = argparse.ArgumentParser( description="Run a Gemini Computer Use browser automation loop via Playwright.", ) parser.add_argument("--prompt", required=True, help="User goal to send to the model") parser.add_argument( "--start-url", default=DEFAULT_START_URL, help=f"Initial page to load (default: {DEFAULT_START_URL})", ) parser.add_argument("--turn-limit", type=int, default=6, help="Max turns") parser.add_argument( "--headless", action="store_true", help="Run browser in headless mode", ) parser.add_argument( "--screen-width", type=int, default=DEFAULT_SCREEN_WIDTH, help="Viewport width in pixels", ) parser.add_argument( "--screen-height", type=int, default=DEFAULT_SCREEN_HEIGHT, help="Viewport height in pixels", ) parser.add_argument( "--exclude", action="append", default=[], help="Exclude predefined Computer Use actions (can repeat)", ) return parser.parse_args() def require_env() -> str: api_key = os.getenv("GEMINI_API_KEY") if not api_key: print("Missing GEMINI_API_KEY. Export it before running.", file=sys.stderr) sys.exit(1) return api_key def denormalize(value: int, size: int) -> int: return int(value / 1000 * size) def normalize_keys(keys: str) -> str: mapping = { "ctrl": "Control", "control": "Control", "cmd": "Meta", "command": "Meta", "meta": "Meta", "alt": "Alt", "shift": "Shift", "enter": "Enter", "return": "Enter", "tab": "Tab", "backspace": "Backspace", "delete": "Delete", "esc": "Escape", "escape": "Escape", "space": "Space", } parts = [p.strip() for p in keys.split("+")] normalized_parts = [] for part in parts: lower = part.lower() if lower in mapping: normalized_parts.append(mapping[lower]) elif len(part) == 1: normalized_parts.append(part.upper()) else: normalized_parts.append(part.capitalize()) return "