Gemini Computer Use

Name: Gemini Computer Use
Author: am-will

am-will/codex-skills

Wire Gemini 2.5 Computer Use preview actions to a Playwright-controlled browser so your coding agent can navigate, click, type, and screenshot-verify real web UIs.

Overview

Gemini Computer Use is an agent skill for the Build phase that connects Gemini’s computer-use preview model to Playwright so agents execute and confirm browser actions with screenshots.

Install

npx skills add https://github.com/am-will/codex-skills --skill gemini-computer-use

What is this skill?

Requires model gemini-2.5-computer-use-preview-10-2025 with Computer Use tool
Playwright sync client: open browser, screenshot, return function_response after each action
Documented browser actions: navigate, click_at, type_text_at, scroll, drag_and_drop, key_combination, and more
Safety gate: honor safety_decision require_confirmation before executing risky steps
Env configuration for GEMINI_API_KEY and optional Chrome/Edge/Brave executables
Model pinned to gemini-2.5-computer-use-preview-10-2025
Default viewport 1440×900 with supported action set including open_web_browser, navigate, click_at, type_text_at, scroll

Compatible agents: Codex, Claude Code, Cursor, any compatible agent

Adoption & trust: 1.2k installs on skills.sh; 941 GitHub stars; 2/3 security scanners passed (skills.sh audits).

What problem does it solve?

Your agent can reason about web tasks but has no documented loop to run Gemini computer-use function calls and feed back live browser state.

Who is it for?

Developers building or debugging agent workflows that must interact with real websites during integration or QA-style walkthroughs.

Skip if: Headless CI smoke tests that should not call external LLM APIs or skip human confirmation on sensitive actions.

When should I use this skill?

You need to integrate Gemini Computer Use tool actions with a Playwright browser and return screenshots after each step.

What do I get? / Deliverables

You get a runnable Python + Playwright pattern that executes supported browser actions and returns screenshots until the modeled task finishes or the user confirms gated steps.

Configurable browser automation runner for computer-use actions
Documented action loop with screenshot + URL function responses
Environment template for API key and browser channel/executable

Recommended Skills

Microsoft Foundrymicrosoft/azure-skills

Microsoft Foundry skill guides agents through the full Azure AI Foundry lifecycle—containerizing agents, pushing to ACR,…377k installs·1.2k stars

Azure Aimicrosoft/azure-skills

azure-ai is a Prism-oriented quick reference for Microsoft Azure AI work, with the published body centered on the Azure …375k installs·1.2k stars

Azure Hosted Copilot Sdkmicrosoft/azure-skills

Azure Hosted Copilot SDK is Microsoft's entry skill for repos using @github/copilot-sdk—it detects CopilotClient usage, …346k installs·1.2k stars

Lark Eventlarksuite/cli

Lark real-time subscription skill via lark-cli event consume for building bots and streaming webhook-style agent workers…208k installs·13.7k stars

Running Claude Code Via Litellm Copilotxixu-me/skills

Running Claude Code via LiteLLM Copilot walks through pointing Claude Code at a local LiteLLM proxy that forwards Anthro…200k installs·61 stars

Setup Matt Pocock Skillsmattpocock/skills

One-time per-repo setup so Matt Pocock engineering skills share correct issue tracker, triage strings, and domain docume…180k installs·121k stars

Journey fit

Primary fit

BuildAgent skills & templates

Agent-tooling is the canonical shelf because the skill implements the client-side execution loop for model-issued browser actions, not a one-off marketing or ship checklist. Computer-use tooling extends what agents can do during product build—automating flows, demos, and UI verification through executable function responses.

Also useful

ShipTesting & QA

How it compares

This is a Gemini API + Playwright integration skill—not a packaged MCP browser server or a no-code RPA recorder.

Common Questions / FAQ

Who is gemini-computer-use for?

Solo builders and agent integrators using Codex-style repos who need a reference implementation for Gemini Computer Use with Playwright.

When should I use gemini-computer-use?

During build when prototyping agent-driven browser automation, validating staging UIs, or wiring function_response screenshot loops for the preview computer-use model.

Is gemini-computer-use safe to install?

Treat API keys and browser control as high trust: review Security Audits on this page, never commit GEMINI_API_KEY, and require user confirmation when safety_decision demands it.

SKILL.md

READMESKILL.md - Gemini Computer Use

# Copy to env.sh and source it before running.
export GEMINI_API_KEY=""

# Optional: Use a Playwright browser channel (e.g., chrome, msedge).
# Leave empty to use Playwright's bundled Chromium.
export COMPUTER_USE_BROWSER_CHANNEL=""

# Optional: Point to a Chromium-based browser executable (e.g., Brave).
# Takes precedence over COMPUTER_USE_BROWSER_CHANNEL if set.
export COMPUTER_USE_BROWSER_EXECUTABLE=""


# Gemini Computer Use Notes

- Model: `gemini-2.5-computer-use-preview-10-2025` (required when using the Computer Use tool).
- The model emits `function_call` actions that must be executed client-side.
- After each action, return a `function_response` with the latest screenshot + URL.
- If a response includes `safety_decision: require_confirmation`, you must ask the user to confirm before executing the action.

Supported actions (browser environment):
- open_web_browser
- wait_5_seconds
- go_back
- go_forward
- search
- navigate
- click_at
- hover_at
- type_text_at
- key_combination
- scroll_document
- scroll_at
- drag_and_drop


#!/usr/bin/env python3
import argparse
import os
import sys
import time
from typing import Any, Dict, List, Tuple

from playwright.sync_api import sync_playwright
from google import genai
from google.genai import types
from google.genai.types import Content, Part

MODEL_NAME = "gemini-2.5-computer-use-preview-10-2025"
DEFAULT_START_URL = "https://www.google.com"
DEFAULT_SCREEN_WIDTH = 1440
DEFAULT_SCREEN_HEIGHT = 900

SUPPORTED_ACTIONS = {
    "open_web_browser",
    "wait_5_seconds",
    "go_back",
    "go_forward",
    "search",
    "navigate",
    "click_at",
    "hover_at",
    "type_text_at",
    "key_combination",
    "scroll_document",
    "scroll_at",
    "drag_and_drop",
}


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(
        description="Run a Gemini Computer Use browser automation loop via Playwright.",
    )
    parser.add_argument("--prompt", required=True, help="User goal to send to the model")
    parser.add_argument(
        "--start-url",
        default=DEFAULT_START_URL,
        help=f"Initial page to load (default: {DEFAULT_START_URL})",
    )
    parser.add_argument("--turn-limit", type=int, default=6, help="Max turns")
    parser.add_argument(
        "--headless",
        action="store_true",
        help="Run browser in headless mode",
    )
    parser.add_argument(
        "--screen-width",
        type=int,
        default=DEFAULT_SCREEN_WIDTH,
        help="Viewport width in pixels",
    )
    parser.add_argument(
        "--screen-height",
        type=int,
        default=DEFAULT_SCREEN_HEIGHT,
        help="Viewport height in pixels",
    )
    parser.add_argument(
        "--exclude",
        action="append",
        default=[],
        help="Exclude predefined Computer Use actions (can repeat)",
    )
    return parser.parse_args()


def require_env() -> str:
    api_key = os.getenv("GEMINI_API_KEY")
    if not api_key:
        print("Missing GEMINI_API_KEY. Export it before running.", file=sys.stderr)
        sys.exit(1)
    return api_key


def denormalize(value: int, size: int) -> int:
    return int(value / 1000 * size)


def normalize_keys(keys: str) -> str:
    mapping = {
        "ctrl": "Control",
        "control": "Control",
        "cmd": "Meta",
        "command": "Meta",
        "meta": "Meta",
        "alt": "Alt",
        "shift": "Shift",
        "enter": "Enter",
        "return": "Enter",
        "tab": "Tab",
        "backspace": "Backspace",
        "delete": "Delete",
        "esc": "Escape",
        "escape": "Escape",
        "space": "Space",
    }
    parts = [p.strip() for p in keys.split("+")]
    normalized_parts = []
    for part in parts:
        lower = part.lower()
        if lower in mapping:
            normalized_parts.append(mapping[lower])
        elif len(part) == 1:
            normalized_parts.append(part.upper())
        else:
            normalized_parts.append(part.capitalize())
    return "

What is this skill?

Requires model gemini-2.5-computer-use-preview-10-2025 with Computer Use tool

Playwright sync client: open browser, screenshot, return function_response after each action

Documented browser actions: navigate, click_at, type_text_at, scroll, drag_and_drop, key_combination, and more

Safety gate: honor safety_decision require_confirmation before executing risky steps

Env configuration for GEMINI_API_KEY and optional Chrome/Edge/Brave executables

Model pinned to gemini-2.5-computer-use-preview-10-2025

Default viewport 1440×900 with supported action set including open_web_browser, navigate, click_at, type_text_at, scroll

Compatible agents: Codex, Claude Code, Cursor, any compatible agent

Adoption & trust: 1.2k installs on skills.sh; 941 GitHub stars; 2/3 security scanners passed (skills.sh audits).

What do I get? / Deliverables

You get a runnable Python + Playwright pattern that executes supported browser actions and returns screenshots until the modeled task finishes or the user confirms gated steps.

Configurable browser automation runner for computer-use actions

Documented action loop with screenshot + URL function responses

Environment template for API key and browser channel/executable

Journey fit

Primary fit

BuildAgent skills & templates

Also useful

ShipTesting & QA

SKILL.md

READMESKILL.md - Gemini Computer Use

# Copy to env.sh and source it before running.
export GEMINI_API_KEY=""

# Optional: Use a Playwright browser channel (e.g., chrome, msedge).
# Leave empty to use Playwright's bundled Chromium.
export COMPUTER_USE_BROWSER_CHANNEL=""

# Optional: Point to a Chromium-based browser executable (e.g., Brave).
# Takes precedence over COMPUTER_USE_BROWSER_CHANNEL if set.
export COMPUTER_USE_BROWSER_EXECUTABLE=""


# Gemini Computer Use Notes

- Model: `gemini-2.5-computer-use-preview-10-2025` (required when using the Computer Use tool).
- The model emits `function_call` actions that must be executed client-side.
- After each action, return a `function_response` with the latest screenshot + URL.
- If a response includes `safety_decision: require_confirmation`, you must ask the user to confirm before executing the action.

Supported actions (browser environment):
- open_web_browser
- wait_5_seconds
- go_back
- go_forward
- search
- navigate
- click_at
- hover_at
- type_text_at
- key_combination
- scroll_document
- scroll_at
- drag_and_drop


#!/usr/bin/env python3
import argparse
import os
import sys
import time
from typing import Any, Dict, List, Tuple

from playwright.sync_api import sync_playwright
from google import genai
from google.genai import types
from google.genai.types import Content, Part

MODEL_NAME = "gemini-2.5-computer-use-preview-10-2025"
DEFAULT_START_URL = "https://www.google.com"
DEFAULT_SCREEN_WIDTH = 1440
DEFAULT_SCREEN_HEIGHT = 900

SUPPORTED_ACTIONS = {
    "open_web_browser",
    "wait_5_seconds",
    "go_back",
    "go_forward",
    "search",
    "navigate",
    "click_at",
    "hover_at",
    "type_text_at",
    "key_combination",
    "scroll_document",
    "scroll_at",
    "drag_and_drop",
}


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(
        description="Run a Gemini Computer Use browser automation loop via Playwright.",
    )
    parser.add_argument("--prompt", required=True, help="User goal to send to the model")
    parser.add_argument(
        "--start-url",
        default=DEFAULT_START_URL,
        help=f"Initial page to load (default: {DEFAULT_START_URL})",
    )
    parser.add_argument("--turn-limit", type=int, default=6, help="Max turns")
    parser.add_argument(
        "--headless",
        action="store_true",
        help="Run browser in headless mode",
    )
    parser.add_argument(
        "--screen-width",
        type=int,
        default=DEFAULT_SCREEN_WIDTH,
        help="Viewport width in pixels",
    )
    parser.add_argument(
        "--screen-height",
        type=int,
        default=DEFAULT_SCREEN_HEIGHT,
        help="Viewport height in pixels",
    )
    parser.add_argument(
        "--exclude",
        action="append",
        default=[],
        help="Exclude predefined Computer Use actions (can repeat)",
    )
    return parser.parse_args()


def require_env() -> str:
    api_key = os.getenv("GEMINI_API_KEY")
    if not api_key:
        print("Missing GEMINI_API_KEY. Export it before running.", file=sys.stderr)
        sys.exit(1)
    return api_key


def denormalize(value: int, size: int) -> int:
    return int(value / 1000 * size)


def normalize_keys(keys: str) -> str:
    mapping = {
        "ctrl": "Control",
        "control": "Control",
        "cmd": "Meta",
        "command": "Meta",
        "meta": "Meta",
        "alt": "Alt",
        "shift": "Shift",
        "enter": "Enter",
        "return": "Enter",
        "tab": "Tab",
        "backspace": "Backspace",
        "delete": "Delete",
        "esc": "Escape",
        "escape": "Escape",
        "space": "Space",
    }
    parts = [p.strip() for p in keys.split("+")]
    normalized_parts = []
    for part in parts:
        lower = part.lower()
        if lower in mapping:
            normalized_parts.append(mapping[lower])
        elif len(part) == 1:
            normalized_parts.append(part.upper())
        else:
            normalized_parts.append(part.capitalize())
    return "

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is gemini-computer-use for?

When should I use gemini-computer-use?

Is gemini-computer-use safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is gemini-computer-use for?

When should I use gemini-computer-use?

Is gemini-computer-use safe to install?

SKILL.md