
Computer Automation
Drive native desktop and Electron UIs from natural language when browser DOM automation is not available.
Overview
Computer-automation is an agent skill for the Build phase that vision-drives local or RDP Windows desktops with Midscene.js using synchronous screenshot-analyze-act commands.
Install
npx skills add https://github.com/web-infra-dev/midscene-skills --skill computer-automationWhat is this skill?
- Vision-driven control from screenshots—no DOM or accessibility tree required
- Local macOS, Windows, and Linux desktops plus remote Windows over RDP
- Strict one-command-at-a-time synchronous loop; never run Midscene in the background
- Documented hard rule: prefer Browser Automation for web apps; desktop skill only for native/Electron/RDP
- Powered by Midscene.js with Bash as the allowed tool surface
- 2 critical workflow rules: no background Midscene runs; one command at a time
Adoption & trust: 745 installs on skills.sh; 240 GitHub stars; 0/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need to test or operate desktop-native or Electron apps where DOM-based browser automation cannot see or control the interface.
Who is it for?
Solo builders automating Electron, Qt, or native desktop apps—or a remote Windows box over RDP—when SKILL.md’s local takeover risk is acceptable.
Skip if: Teams shipping standard web apps in a browser; SKILL.md explicitly prefers Browser Automation there and warns against background or parallel Midscene runs.
When should I use this skill?
Triggers include open app, desktop, Electron test, mouse click, keyboard shortcut, screen capture, RDP, or remote Windows—when browser automation cannot apply.
What do I get? / Deliverables
After the skill runs, the agent completes ordered Midscene desktop actions with verified screen state instead of guessing UI state from logs alone.
- Completed desktop action sequences driven by Midscene command output and screenshots
- Verified on-screen UI state after each synchronous step
Recommended Skills
Journey fit
Fits Build because it extends the coding agent with vision-driven desktop control for apps that never run in a browser. Agent-tooling is the canonical shelf for Midscene-style skills that add synchronous screenshot-analyze-act loops to the agent toolchain.
How it compares
Use instead of Playwright-style browser skills when the UI is not in a DOM—at the cost of controlling the real desktop input devices.
Common Questions / FAQ
Who is computer-automation for?
Indie developers and agent users who must interact with desktop-native or Electron UIs, or a Windows server via RDP, when screenshot-driven control is the only viable path.
When should I use computer-automation?
During Build agent-tooling when triggers match—open app, desktop click, Electron test, RDP—and not for routine web flows where Browser Automation is the documented default.
Is computer-automation safe to install?
Review the Security Audits panel on this Prism page before enabling Bash execution; local mode can capture screens and drive your real mouse and keyboard.
SKILL.md
READMESKILL.md - Computer Automation
# Desktop Computer Automation > **CRITICAL RULES — VIOLATIONS WILL BREAK THE WORKFLOW:** > > 1. **Never run midscene commands in the background.** Each command must run synchronously so you can read its output (especially screenshots) before deciding the next action. Background execution breaks the screenshot-analyze-act loop. > 2. **Run only one midscene command at a time.** Wait for the previous command to finish, read the screenshot, then decide the next action. Never chain multiple commands together. > 3. **Allow enough time for each command to complete.** Midscene commands involve AI inference and screen interaction, which can take longer than typical shell commands. A typical command needs about 1 minute; complex `act` commands may need even longer. > 4. **Always report task results before finishing.** After completing the automation task, you MUST proactively summarize the results to the user — including key data found, actions completed, screenshots taken, and any relevant findings. Never silently end after the last automation step; the user expects a complete response in a single interaction. > 5. **Only minimize windows, never close them unless explicitly asked.** When you need to dismiss or get a window out of the way, minimize it instead of closing it. Do not close any app or window unless the user explicitly asks you to do so. Control your desktop (macOS, Windows, Linux) using `npx -y @midscene/computer@1`. Each CLI command maps directly to an MCP tool — you (the AI agent) act as the brain, deciding which actions to take based on screenshots. ## What `act` Can Do Inside a single `act` call on desktop, Midscene can move the mouse, click, double-click, right-click, drag items, type or clear text, scroll, press single keys or keyboard shortcuts, and work through multi-step interactions on whatever is visible on the selected display. ## Prerequisites Midscene requires models with strong visual grounding capabilities. The following environment variables must be configured — either as system environment variables or in a `.env` file in the current working directory (Midscene loads `.env` automatically): ```bash MIDSCENE_MODEL_API_KEY="your-api-key" MIDSCENE_MODEL_NAME="model-name" MIDSCENE_MODEL_BASE_URL="https://..." MIDSCENE_MODEL_FAMILY="family-identifier" ``` Example: Gemini (Gemini-3-Flash) ```bash MIDSCENE_MODEL_API_KEY="your-google-api-key" MIDSCENE_MODEL_NAME="gemini-3-flash" MIDSCENE_MODEL_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/" MIDSCENE_MODEL_FAMILY="gemini" ``` Example: Qwen 3.5 ```bash MIDSCENE_MODEL_API_KEY="your-aliyun-api-key" MIDSCENE_MODEL_NAME="qwen3.5-plus" MIDSCENE_MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1" MIDSCENE_MODEL_FAMILY="qwen3.5" MIDSCENE_MODEL_REASONING_ENABLED="false" # If using OpenRouter, set: # MIDSCENE_MODEL_API_KEY="your-openrouter-api-key" # MIDSCEN