
Desktop Computer Automation
Run vision-driven desktop UI tests on Electron, Qt, or native apps—and remote Windows over RDP—when browser automation cannot see the screen.
Overview
Desktop Computer Automation is an agent skill most often used in Ship (also Build) that drives macOS, Windows, or Linux desktops—and RDP Windows hosts—from screenshots and natural language via Midscene.
Install
npx skills add https://github.com/web-infra-dev/midscene-skills --skill desktop-computer-automationWhat is this skill?
- Vision-driven control from screenshots—no DOM or accessibility tree required
- Local desktop on macOS, Windows, and Linux plus remote Windows via RDP
- Explicit guardrail: prefer Browser Automation for web apps; reserve this for desktop-native stacks
- CRITICAL: one synchronous Midscene command at a time—never background the screenshot-analyze-act loop
- Natural-language triggers: open app, click on screen, keyboard shortcuts, window switch, screen capture
- Two CRITICAL rules: never run Midscene in the background; only one Midscene command at a time
Adoption & trust: 2.9k installs on skills.sh; 240 GitHub stars; 1/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need to test or operate a desktop-native app that has no reliable DOM, and browser automation cannot click what users actually see on screen.
Who is it for?
Indie builders validating Electron or native desktop builds, or exercising a remote Windows desktop over RDP from an agent session.
Skip if: Standard web apps in a browser (use browser automation), unattended multi-command fan-out, or environments where taking over the user’s live keyboard and mouse is unacceptable.
When should I use this skill?
Triggers include open app, press key, desktop, click on screen, type text, screenshot desktop, launch application, switch window, desktop automation, rdp, remote desktop, test Electron app.
What do I get? / Deliverables
Your agent completes synchronous screenshot-analyze-act loops to open apps, type, click, and verify windows—without breaking the workflow by running Midscene commands in parallel or in the background.
- Executed desktop UI steps with screenshot-backed verification
- Documented synchronous command sequence following Midscene output
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Desktop takeover automation is primarily a Ship concern: validating desktop-native builds before release, with secondary use while wiring agent-driven integration tests in Build. End-to-end verification of non-web UIs fits the testing shelf; the skill drives real mouse, keyboard, and screenshots rather than DOM selectors.
Where it fits
Run through installer, login, and settings flows in an Electron app before tagging a release.
Wire an agent workflow that opens a native design tool and exports assets when no API exists.
Reproduce a customer-reported desktop-only bug by replaying clicks on the exact window layout they described.
How it compares
Vision-based desktop control skill—not DOM-based browser automation and not a headless-only CI harness unless you isolate the machine.
Common Questions / FAQ
Who is desktop-computer-automation for?
Solo builders and small teams who ship desktop-native software and want Claude Code, Cursor, or similar agents to drive the real UI from screenshots when web drivers fall short.
When should I use desktop-computer-automation?
In Ship testing for Electron/Qt/native E2E checks; in Build integrations when wiring agent tooling for desktop workflows; use RDP mode when the target is a remote Windows host—not for routine marketing sites in Chrome.
Is desktop-computer-automation safe to install?
It allows Bash and can control your live desktop—review the Security Audits panel on this page, run in VMs when possible, and never background Midscene commands per the skill’s critical rules.
SKILL.md
READMESKILL.md - Desktop Computer Automation
# Desktop Computer Automation > **CRITICAL RULES — VIOLATIONS WILL BREAK THE WORKFLOW:** > > 1. **Never run midscene commands in the background.** Each command must run synchronously so you can read its output (especially screenshots) before deciding the next action. Background execution breaks the screenshot-analyze-act loop. > 2. **Run only one midscene command at a time.** Wait for the previous command to finish, read the screenshot, then decide the next action. Never chain multiple commands together. > 3. **Allow enough time for each command to complete.** Midscene commands involve AI inference and screen interaction, which can take longer than typical shell commands. A typical command needs about 1 minute; complex `act` commands may need even longer. > 4. **Always report task results before finishing.** After completing the automation task, you MUST proactively summarize the results to the user — including key data found, actions completed, screenshots taken, and any relevant findings. Never silently end after the last automation step; the user expects a complete response in a single interaction. > 5. **Only minimize windows, never close them unless explicitly asked.** When you need to dismiss or get a window out of the way, minimize it instead of closing it. Do not close any app or window unless the user explicitly asks you to do so. Control your desktop (macOS, Windows, Linux) using `npx -y @midscene/computer@1`. Each CLI command maps directly to an MCP tool — you (the AI agent) act as the brain, deciding which actions to take based on screenshots. ## What `act` Can Do Inside a single `act` call on desktop, Midscene can move the mouse, click, double-click, right-click, drag items, type or clear text, scroll, press single keys or keyboard shortcuts, and work through multi-step interactions on whatever is visible on the selected display. ## Prerequisites Midscene requires models with strong visual grounding capabilities. The following environment variables must be configured — either as system environment variables or in a `.env` file in the current working directory (Midscene loads `.env` automatically): ```bash MIDSCENE_MODEL_API_KEY="your-api-key" MIDSCENE_MODEL_NAME="model-name" MIDSCENE_MODEL_BASE_URL="https://..." MIDSCENE_MODEL_FAMILY="family-identifier" ``` Example: Gemini (Gemini-3-Flash) ```bash MIDSCENE_MODEL_API_KEY="your-google-api-key" MIDSCENE_MODEL_NAME="gemini-3-flash" MIDSCENE_MODEL_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/" MIDSCENE_MODEL_FAMILY="gemini" ``` Example: Qwen 3.5 ```bash MIDSCENE_MODEL_API_KEY="your-aliyun-api-key" MIDSCENE_MODEL_NAME="qwen3.5-plus" MIDSCENE_MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1" MIDSCENE_MODEL_FAMILY="qwen3.5" MIDSCENE_MODEL_REASONING_ENABLED="false" # If using OpenRouter, set: # MIDSCENE_MODEL_API_KEY="your-openrouter-api-key" # MIDSCEN