Computer Automation

Name: Computer Automation
Author: web-infra-dev

web-infra-dev/midscene-skills

Drive native desktop and Electron UIs from natural language when browser DOM automation is not available.

Overview

Computer-automation is an agent skill for the Build phase that vision-drives local or RDP Windows desktops with Midscene.js using synchronous screenshot-analyze-act commands.

Install

npx skills add https://github.com/web-infra-dev/midscene-skills --skill computer-automation

What is this skill?

Vision-driven control from screenshots—no DOM or accessibility tree required
Local macOS, Windows, and Linux desktops plus remote Windows over RDP
Strict one-command-at-a-time synchronous loop; never run Midscene in the background
Documented hard rule: prefer Browser Automation for web apps; desktop skill only for native/Electron/RDP
Powered by Midscene.js with Bash as the allowed tool surface
2 critical workflow rules: no background Midscene runs; one command at a time

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 745 installs on skills.sh; 240 GitHub stars; 0/3 security scanners passed (skills.sh audits).

What problem does it solve?

You need to test or operate desktop-native or Electron apps where DOM-based browser automation cannot see or control the interface.

Who is it for?

Solo builders automating Electron, Qt, or native desktop apps—or a remote Windows box over RDP—when SKILL.md’s local takeover risk is acceptable.

Skip if: Teams shipping standard web apps in a browser; SKILL.md explicitly prefers Browser Automation there and warns against background or parallel Midscene runs.

When should I use this skill?

Triggers include open app, desktop, Electron test, mouse click, keyboard shortcut, screen capture, RDP, or remote Windows—when browser automation cannot apply.

What do I get? / Deliverables

After the skill runs, the agent completes ordered Midscene desktop actions with verified screen state instead of guessing UI state from logs alone.

Completed desktop action sequences driven by Midscene command output and screenshots
Verified on-screen UI state after each synchronous step

Recommended Skills

Agent Browservercel-labs/agent-browser

agent-browser is a Node-installed browser automation CLI built for AI agents that need dependable programmatic web inter…428k installs·35.5k stars

Lark Imlarksuite/cli

Lark IM is a Larksuite agent skill that exposes Feishu/Lark instant messaging to Claude Code, Cursor, and similar agents…210k installs·13.7k stars

Lark Calendarlarksuite/cli

lark-calendar is an agent skill for Feishu/Lark Calendar v4 exposed via lark-cli. Solo builders and small teams who alre…209k installs·13.7k stars

Lark Sheetslarksuite/cli

Skill for programmatic Feishu spreadsheet and worksheet management—create tables, bulk data IO, lookup, and export—using…209k installs·13.7k stars

Lark Vclarksuite/cli

lark-vc is an agent skill for Feishu/Lark video conferencing history and artifacts through lark-cli. After calls end, so…208k installs·13.7k stars

Lark Contactlarksuite/cli

CLI skill for Lark directory lookup: search employees and fetch metadata by open_id, with clear boundaries vs IM, calend…208k installs·13.7k stars

Journey fit

Primary fit

BuildAgent skills & templates

Fits Build because it extends the coding agent with vision-driven desktop control for apps that never run in a browser. Agent-tooling is the canonical shelf for Midscene-style skills that add synchronous screenshot-analyze-act loops to the agent toolchain.

Also useful

ShipTesting & QA

How it compares

Use instead of Playwright-style browser skills when the UI is not in a DOM—at the cost of controlling the real desktop input devices.

Common Questions / FAQ

Who is computer-automation for?

Indie developers and agent users who must interact with desktop-native or Electron UIs, or a Windows server via RDP, when screenshot-driven control is the only viable path.

When should I use computer-automation?

During Build agent-tooling when triggers match—open app, desktop click, Electron test, RDP—and not for routine web flows where Browser Automation is the documented default.

Is computer-automation safe to install?

Review the Security Audits panel on this Prism page before enabling Bash execution; local mode can capture screens and drive your real mouse and keyboard.

SKILL.md

READMESKILL.md - Computer Automation

# Desktop Computer Automation

> **CRITICAL RULES — VIOLATIONS WILL BREAK THE WORKFLOW:**
>
> 1. **Never run midscene commands in the background.** Each command must run synchronously so you can read its output (especially screenshots) before deciding the next action. Background execution breaks the screenshot-analyze-act loop.
> 2. **Run only one midscene command at a time.** Wait for the previous command to finish, read the screenshot, then decide the next action. Never chain multiple commands together.
> 3. **Allow enough time for each command to complete.** Midscene commands involve AI inference and screen interaction, which can take longer than typical shell commands. A typical command needs about 1 minute; complex `act` commands may need even longer.
> 4. **Always report task results before finishing.** After completing the automation task, you MUST proactively summarize the results to the user — including key data found, actions completed, screenshots taken, and any relevant findings. Never silently end after the last automation step; the user expects a complete response in a single interaction.
> 5. **Only minimize windows, never close them unless explicitly asked.** When you need to dismiss or get a window out of the way, minimize it instead of closing it. Do not close any app or window unless the user explicitly asks you to do so.

Control your desktop (macOS, Windows, Linux) using `npx -y @midscene/computer@1`. Each CLI command maps directly to an MCP tool — you (the AI agent) act as the brain, deciding which actions to take based on screenshots.

## What `act` Can Do

Inside a single `act` call on desktop, Midscene can move the mouse, click, double-click, right-click, drag items, type or clear text, scroll, press single keys or keyboard shortcuts, and work through multi-step interactions on whatever is visible on the selected display.

## Prerequisites

Midscene requires models with strong visual grounding capabilities. The following environment variables must be configured — either as system environment variables or in a `.env` file in the current working directory (Midscene loads `.env` automatically):

```bash
MIDSCENE_MODEL_API_KEY="your-api-key"
MIDSCENE_MODEL_NAME="model-name"
MIDSCENE_MODEL_BASE_URL="https://..."
MIDSCENE_MODEL_FAMILY="family-identifier"
```

Example: Gemini (Gemini-3-Flash)

```bash
MIDSCENE_MODEL_API_KEY="your-google-api-key"
MIDSCENE_MODEL_NAME="gemini-3-flash"
MIDSCENE_MODEL_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
MIDSCENE_MODEL_FAMILY="gemini"
```

Example: Qwen 3.5

```bash
MIDSCENE_MODEL_API_KEY="your-aliyun-api-key"
MIDSCENE_MODEL_NAME="qwen3.5-plus"
MIDSCENE_MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
MIDSCENE_MODEL_FAMILY="qwen3.5"
MIDSCENE_MODEL_REASONING_ENABLED="false"
# If using OpenRouter, set:
# MIDSCENE_MODEL_API_KEY="your-openrouter-api-key"
# MIDSCEN

What is this skill?

Vision-driven control from screenshots—no DOM or accessibility tree required

Local macOS, Windows, and Linux desktops plus remote Windows over RDP

Strict one-command-at-a-time synchronous loop; never run Midscene in the background

Documented hard rule: prefer Browser Automation for web apps; desktop skill only for native/Electron/RDP

2 critical workflow rules: no background Midscene runs; one command at a time

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 745 installs on skills.sh; 240 GitHub stars; 0/3 security scanners passed (skills.sh audits).

Who is it for?

Solo builders automating Electron, Qt, or native desktop apps—or a remote Windows box over RDP—when SKILL.md’s local takeover risk is acceptable.

Skip if: Teams shipping standard web apps in a browser; SKILL.md explicitly prefers Browser Automation there and warns against background or parallel Midscene runs.

Journey fit

Primary fit

BuildAgent skills & templates

Also useful

ShipTesting & QA

SKILL.md

READMESKILL.md - Computer Automation

# Desktop Computer Automation

> **CRITICAL RULES — VIOLATIONS WILL BREAK THE WORKFLOW:**
>
> 1. **Never run midscene commands in the background.** Each command must run synchronously so you can read its output (especially screenshots) before deciding the next action. Background execution breaks the screenshot-analyze-act loop.
> 2. **Run only one midscene command at a time.** Wait for the previous command to finish, read the screenshot, then decide the next action. Never chain multiple commands together.
> 3. **Allow enough time for each command to complete.** Midscene commands involve AI inference and screen interaction, which can take longer than typical shell commands. A typical command needs about 1 minute; complex `act` commands may need even longer.
> 4. **Always report task results before finishing.** After completing the automation task, you MUST proactively summarize the results to the user — including key data found, actions completed, screenshots taken, and any relevant findings. Never silently end after the last automation step; the user expects a complete response in a single interaction.
> 5. **Only minimize windows, never close them unless explicitly asked.** When you need to dismiss or get a window out of the way, minimize it instead of closing it. Do not close any app or window unless the user explicitly asks you to do so.

Control your desktop (macOS, Windows, Linux) using `npx -y @midscene/computer@1`. Each CLI command maps directly to an MCP tool — you (the AI agent) act as the brain, deciding which actions to take based on screenshots.

## What `act` Can Do

Inside a single `act` call on desktop, Midscene can move the mouse, click, double-click, right-click, drag items, type or clear text, scroll, press single keys or keyboard shortcuts, and work through multi-step interactions on whatever is visible on the selected display.

## Prerequisites

Midscene requires models with strong visual grounding capabilities. The following environment variables must be configured — either as system environment variables or in a `.env` file in the current working directory (Midscene loads `.env` automatically):

```bash
MIDSCENE_MODEL_API_KEY="your-api-key"
MIDSCENE_MODEL_NAME="model-name"
MIDSCENE_MODEL_BASE_URL="https://..."
MIDSCENE_MODEL_FAMILY="family-identifier"
```

Example: Gemini (Gemini-3-Flash)

```bash
MIDSCENE_MODEL_API_KEY="your-google-api-key"
MIDSCENE_MODEL_NAME="gemini-3-flash"
MIDSCENE_MODEL_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
MIDSCENE_MODEL_FAMILY="gemini"
```

Example: Qwen 3.5

```bash
MIDSCENE_MODEL_API_KEY="your-aliyun-api-key"
MIDSCENE_MODEL_NAME="qwen3.5-plus"
MIDSCENE_MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
MIDSCENE_MODEL_FAMILY="qwen3.5"
MIDSCENE_MODEL_REASONING_ENABLED="false"
# If using OpenRouter, set:
# MIDSCENE_MODEL_API_KEY="your-openrouter-api-key"
# MIDSCEN

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is computer-automation for?

When should I use computer-automation?

Is computer-automation safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is computer-automation for?

When should I use computer-automation?

Is computer-automation safe to install?

SKILL.md