Desktop Computer Automation

Name: Desktop Computer Automation
Author: web-infra-dev

web-infra-dev/midscene-skills

Run vision-driven desktop UI tests on Electron, Qt, or native apps—and remote Windows over RDP—when browser automation cannot see the screen.

Overview

Desktop Computer Automation is an agent skill most often used in Ship (also Build) that drives macOS, Windows, or Linux desktops—and RDP Windows hosts—from screenshots and natural language via Midscene.

Install

npx skills add https://github.com/web-infra-dev/midscene-skills --skill desktop-computer-automation

What is this skill?

Vision-driven control from screenshots—no DOM or accessibility tree required
Local desktop on macOS, Windows, and Linux plus remote Windows via RDP
Explicit guardrail: prefer Browser Automation for web apps; reserve this for desktop-native stacks
CRITICAL: one synchronous Midscene command at a time—never background the screenshot-analyze-act loop
Natural-language triggers: open app, click on screen, keyboard shortcuts, window switch, screen capture
Two CRITICAL rules: never run Midscene in the background; only one Midscene command at a time

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 2.9k installs on skills.sh; 240 GitHub stars; 1/3 security scanners passed (skills.sh audits).

What problem does it solve?

You need to test or operate a desktop-native app that has no reliable DOM, and browser automation cannot click what users actually see on screen.

Who is it for?

Indie builders validating Electron or native desktop builds, or exercising a remote Windows desktop over RDP from an agent session.

Skip if: Standard web apps in a browser (use browser automation), unattended multi-command fan-out, or environments where taking over the user’s live keyboard and mouse is unacceptable.

When should I use this skill?

Triggers include open app, press key, desktop, click on screen, type text, screenshot desktop, launch application, switch window, desktop automation, rdp, remote desktop, test Electron app.

What do I get? / Deliverables

Your agent completes synchronous screenshot-analyze-act loops to open apps, type, click, and verify windows—without breaking the workflow by running Midscene commands in parallel or in the background.

Executed desktop UI steps with screenshot-backed verification
Documented synchronous command sequence following Midscene output

Recommended Skills

Agent Browservercel-labs/agent-browser

agent-browser is a Node-installed browser automation CLI built for AI agents that need dependable programmatic web inter…428k installs·35.5k stars

Lark Imlarksuite/cli

Lark IM is a Larksuite agent skill that exposes Feishu/Lark instant messaging to Claude Code, Cursor, and similar agents…210k installs·13.7k stars

Lark Calendarlarksuite/cli

lark-calendar is an agent skill for Feishu/Lark Calendar v4 exposed via lark-cli. Solo builders and small teams who alre…209k installs·13.7k stars

Lark Sheetslarksuite/cli

Skill for programmatic Feishu spreadsheet and worksheet management—create tables, bulk data IO, lookup, and export—using…209k installs·13.7k stars

Lark Vclarksuite/cli

lark-vc is an agent skill for Feishu/Lark video conferencing history and artifacts through lark-cli. After calls end, so…208k installs·13.7k stars

Lark Contactlarksuite/cli

CLI skill for Lark directory lookup: search employees and fetch metadata by open_id, with clear boundaries vs IM, calend…208k installs·13.7k stars

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Desktop takeover automation is primarily a Ship concern: validating desktop-native builds before release, with secondary use while wiring agent-driven integration tests in Build. End-to-end verification of non-web UIs fits the testing shelf; the skill drives real mouse, keyboard, and screenshots rather than DOM selectors.

Also useful

BuildIntegrations & version control

Where it fits

Example use

ShipTesting & QA

Run through installer, login, and settings flows in an Electron app before tagging a release.

Example use

BuildIntegrations & version control

Wire an agent workflow that opens a native design tool and exports assets when no API exists.

Example use

OperateIteration & experiments

Reproduce a customer-reported desktop-only bug by replaying clicks on the exact window layout they described.

How it compares

Vision-based desktop control skill—not DOM-based browser automation and not a headless-only CI harness unless you isolate the machine.

Common Questions / FAQ

Who is desktop-computer-automation for?

Solo builders and small teams who ship desktop-native software and want Claude Code, Cursor, or similar agents to drive the real UI from screenshots when web drivers fall short.

When should I use desktop-computer-automation?

In Ship testing for Electron/Qt/native E2E checks; in Build integrations when wiring agent tooling for desktop workflows; use RDP mode when the target is a remote Windows host—not for routine marketing sites in Chrome.

Is desktop-computer-automation safe to install?

It allows Bash and can control your live desktop—review the Security Audits panel on this page, run in VMs when possible, and never background Midscene commands per the skill’s critical rules.

SKILL.md

READMESKILL.md - Desktop Computer Automation

# Desktop Computer Automation

> **CRITICAL RULES — VIOLATIONS WILL BREAK THE WORKFLOW:**
>
> 1. **Never run midscene commands in the background.** Each command must run synchronously so you can read its output (especially screenshots) before deciding the next action. Background execution breaks the screenshot-analyze-act loop.
> 2. **Run only one midscene command at a time.** Wait for the previous command to finish, read the screenshot, then decide the next action. Never chain multiple commands together.
> 3. **Allow enough time for each command to complete.** Midscene commands involve AI inference and screen interaction, which can take longer than typical shell commands. A typical command needs about 1 minute; complex `act` commands may need even longer.
> 4. **Always report task results before finishing.** After completing the automation task, you MUST proactively summarize the results to the user — including key data found, actions completed, screenshots taken, and any relevant findings. Never silently end after the last automation step; the user expects a complete response in a single interaction.
> 5. **Only minimize windows, never close them unless explicitly asked.** When you need to dismiss or get a window out of the way, minimize it instead of closing it. Do not close any app or window unless the user explicitly asks you to do so.

Control your desktop (macOS, Windows, Linux) using `npx -y @midscene/computer@1`. Each CLI command maps directly to an MCP tool — you (the AI agent) act as the brain, deciding which actions to take based on screenshots.

## What `act` Can Do

Inside a single `act` call on desktop, Midscene can move the mouse, click, double-click, right-click, drag items, type or clear text, scroll, press single keys or keyboard shortcuts, and work through multi-step interactions on whatever is visible on the selected display.

## Prerequisites

Midscene requires models with strong visual grounding capabilities. The following environment variables must be configured — either as system environment variables or in a `.env` file in the current working directory (Midscene loads `.env` automatically):

```bash
MIDSCENE_MODEL_API_KEY="your-api-key"
MIDSCENE_MODEL_NAME="model-name"
MIDSCENE_MODEL_BASE_URL="https://..."
MIDSCENE_MODEL_FAMILY="family-identifier"
```

Example: Gemini (Gemini-3-Flash)

```bash
MIDSCENE_MODEL_API_KEY="your-google-api-key"
MIDSCENE_MODEL_NAME="gemini-3-flash"
MIDSCENE_MODEL_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
MIDSCENE_MODEL_FAMILY="gemini"
```

Example: Qwen 3.5

```bash
MIDSCENE_MODEL_API_KEY="your-aliyun-api-key"
MIDSCENE_MODEL_NAME="qwen3.5-plus"
MIDSCENE_MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
MIDSCENE_MODEL_FAMILY="qwen3.5"
MIDSCENE_MODEL_REASONING_ENABLED="false"
# If using OpenRouter, set:
# MIDSCENE_MODEL_API_KEY="your-openrouter-api-key"
# MIDSCEN

What is this skill?

Vision-driven control from screenshots—no DOM or accessibility tree required

Local desktop on macOS, Windows, and Linux plus remote Windows via RDP

Explicit guardrail: prefer Browser Automation for web apps; reserve this for desktop-native stacks

CRITICAL: one synchronous Midscene command at a time—never background the screenshot-analyze-act loop

Natural-language triggers: open app, click on screen, keyboard shortcuts, window switch, screen capture

Two CRITICAL rules: never run Midscene in the background; only one Midscene command at a time

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 2.9k installs on skills.sh; 240 GitHub stars; 1/3 security scanners passed (skills.sh audits).

Who is it for?

Indie builders validating Electron or native desktop builds, or exercising a remote Windows desktop over RDP from an agent session.

Skip if: Standard web apps in a browser (use browser automation), unattended multi-command fan-out, or environments where taking over the user’s live keyboard and mouse is unacceptable.

What do I get? / Deliverables

Executed desktop UI steps with screenshot-backed verification

Documented synchronous command sequence following Midscene output

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

BuildIntegrations & version control

Where it fits

Example use

ShipTesting & QA

Run through installer, login, and settings flows in an Electron app before tagging a release.

Example use

BuildIntegrations & version control

Wire an agent workflow that opens a native design tool and exports assets when no API exists.

Example use

OperateIteration & experiments

Reproduce a customer-reported desktop-only bug by replaying clicks on the exact window layout they described.

SKILL.md

READMESKILL.md - Desktop Computer Automation

# Desktop Computer Automation

> **CRITICAL RULES — VIOLATIONS WILL BREAK THE WORKFLOW:**
>
> 1. **Never run midscene commands in the background.** Each command must run synchronously so you can read its output (especially screenshots) before deciding the next action. Background execution breaks the screenshot-analyze-act loop.
> 2. **Run only one midscene command at a time.** Wait for the previous command to finish, read the screenshot, then decide the next action. Never chain multiple commands together.
> 3. **Allow enough time for each command to complete.** Midscene commands involve AI inference and screen interaction, which can take longer than typical shell commands. A typical command needs about 1 minute; complex `act` commands may need even longer.
> 4. **Always report task results before finishing.** After completing the automation task, you MUST proactively summarize the results to the user — including key data found, actions completed, screenshots taken, and any relevant findings. Never silently end after the last automation step; the user expects a complete response in a single interaction.
> 5. **Only minimize windows, never close them unless explicitly asked.** When you need to dismiss or get a window out of the way, minimize it instead of closing it. Do not close any app or window unless the user explicitly asks you to do so.

Control your desktop (macOS, Windows, Linux) using `npx -y @midscene/computer@1`. Each CLI command maps directly to an MCP tool — you (the AI agent) act as the brain, deciding which actions to take based on screenshots.

## What `act` Can Do

Inside a single `act` call on desktop, Midscene can move the mouse, click, double-click, right-click, drag items, type or clear text, scroll, press single keys or keyboard shortcuts, and work through multi-step interactions on whatever is visible on the selected display.

## Prerequisites

Midscene requires models with strong visual grounding capabilities. The following environment variables must be configured — either as system environment variables or in a `.env` file in the current working directory (Midscene loads `.env` automatically):

```bash
MIDSCENE_MODEL_API_KEY="your-api-key"
MIDSCENE_MODEL_NAME="model-name"
MIDSCENE_MODEL_BASE_URL="https://..."
MIDSCENE_MODEL_FAMILY="family-identifier"
```

Example: Gemini (Gemini-3-Flash)

```bash
MIDSCENE_MODEL_API_KEY="your-google-api-key"
MIDSCENE_MODEL_NAME="gemini-3-flash"
MIDSCENE_MODEL_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
MIDSCENE_MODEL_FAMILY="gemini"
```

Example: Qwen 3.5

```bash
MIDSCENE_MODEL_API_KEY="your-aliyun-api-key"
MIDSCENE_MODEL_NAME="qwen3.5-plus"
MIDSCENE_MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
MIDSCENE_MODEL_FAMILY="qwen3.5"
MIDSCENE_MODEL_REASONING_ENABLED="false"
# If using OpenRouter, set:
# MIDSCENE_MODEL_API_KEY="your-openrouter-api-key"
# MIDSCEN

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is desktop-computer-automation for?

When should I use desktop-computer-automation?

Is desktop-computer-automation safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is desktop-computer-automation for?

When should I use desktop-computer-automation?

Is desktop-computer-automation safe to install?

SKILL.md