
Browser Automation
Run vision-driven Midscene browser automation from your agent to navigate sites, scrape data, fill forms, and QA freshly built UI without brittle DOM selectors.
Overview
browser-automation is an agent skill most often used in Ship (also Build and Validate) that runs vision-driven Midscene browser steps from screenshots to navigate, scrape, interact, and QA web UI.
Install
npx skills add https://github.com/web-infra-dev/midscene-skills --skill browser-automationWhat is this skill?
- Vision-driven Midscene.js automation from screenshots—no DOM or accessibility labels required
- Three run modes: default headless Puppeteer, CDP attach, and Bridge mode for an existing Chrome (does not hijack mouse/k
- 3 critical workflow rules: never run Midscene in the background, one command at a time, allow each command to finish bef
- Bash-invoked flows for browse, scrape, forms, clicks, screenshots, and multi-step web workflows
- Connect to user Chrome via CDP, DevTools Protocol, or remote debugging when local inspection matters
- 3 critical workflow rules (no background runs, one command at a time, allow completion)
- 3 browser connection modes: headless Puppeteer, CDP, and Bridge
Adoption & trust: 3.4k installs on skills.sh; 240 GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need to exercise real websites and fresh UI flows, but DOM selectors, manual clicking, and flaky automation make scraping, forms, and post-build verification slow and unreliable.
Who is it for?
Solo builders shipping web apps who want agent-run browser QA, scraping, and form workflows using Midscene headless Puppeteer or attached Chrome via CDP/Bridge.
Skip if: Teams that need always-on production monitoring, non-web clients only, or fully unattended parallel browser farms—the skill explicitly forbids background and chained Midscene runs.
When should I use this skill?
User wants to browse or navigate web pages, scrape or extract site data, fill forms or click UI, verify or QA frontend behavior, take screenshots, automate multi-step web workflows, test what was just built, or connect t
What do I get? / Deliverables
Your agent completes synchronous Midscene browser commands—navigation, interaction, extraction, and screenshots—with observable output after each step so you can validate UI behavior and finish multi-step web tasks confi
- Browser screenshots from each synchronous Midscene step
- Scraped or extracted web data from completed flows
- Documented pass/fail observations from UI verification runs
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
The skill’s headline triggers are verify, validate, test, and QA frontend behavior in a browser—work that solo builders do most often right before or after shipping changes. Testing is the canonical shelf because the documented loop is screenshot-analyze-act for checking whether pages and flows behave correctly, not for ideation or long-term ops monitoring.
Where it fits
Walk a staging landing page with headless Midscene to confirm the signup form submits before you scope the full product.
After implementing a settings screen, run one synchronous command at a time to click toggles and capture screenshots for your agent to compare against the spec.
Regression-test checkout and navigation paths in headless Puppeteer before you tag a release.
Collect page screenshots through Bridge mode attached to Chrome as visual evidence during a pre-launch review.
Reproduce a customer-reported UI bug in the browser and save screenshots to attach to a fix prompt.
How it compares
Use this vision-first Midscene skill package for screenshot-driven web acts—not a generic Playwright script generator or an MCP browser server you leave running in parallel.
Common Questions / FAQ
Who is browser-automation for?
It is for solo and indie builders using Claude Code, Cursor, or Codex who want their agent to browse, scrape, test, and screenshot real web UIs via Midscene without writing selector-heavy automation.
When should I use browser-automation?
Use it during Validate to click through a prototype landing page, during Build to test what you just shipped to staging, and during Ship to QA forms, navigation, and multi-step flows—with headless Puppeteer or your Chrome over CDP/Bridge.
Is browser-automation safe to install?
It invokes Bash and drives a real browser with network access, so review what URLs and credentials your agent uses; check the Security Audits panel on this Prism page before trusting it in production repos.
SKILL.md
READMESKILL.md - Browser Automation
# Browser Automation > **CRITICAL RULES — VIOLATIONS WILL BREAK THE WORKFLOW:** > > 1. **Never run midscene commands in the background.** Each command must run synchronously so you can read its output (especially screenshots) before deciding the next action. Background execution breaks the screenshot-analyze-act loop. > 2. **Run only one midscene command at a time.** Wait for the previous command to finish, read the screenshot, then decide the next action. Never chain multiple commands together. > 3. **Allow enough time for each command to complete.** Midscene commands involve AI inference and screen interaction, which can take longer than typical shell commands. A typical command needs about 1 minute; complex `act` commands may need even longer. > 4. **Always report task results before finishing.** After completing the automation task, you MUST proactively summarize the results to the user — including key data found, actions completed, screenshots taken, and any relevant findings. Never silently end after the last automation step; the user expects a complete response in a single interaction. Automate web browsing using `npx -y @midscene/web@1`. By default, launches a headless Chrome via Puppeteer that **persists across CLI calls** — no session loss between commands. Also supports **CDP mode** and **Bridge mode** to connect to an existing Chrome browser. ## What `act` Can Do Inside a single `act` call in the browser, Midscene can click, right-click, double-click, hover, type or clear text, press keys, scroll, drag, long-press, and continue through multi-step page flows based on what is currently visible. When touch input is enabled, it can also handle swipe- or pinch-style interactions on touch-oriented pages. ## When to Use This skill has three modes. Choose based on the user's intent: ### Mode Selection Guide | Mode | When to use | How it works | |------|------------|-------------| | **Puppeteer (default)** | User wants to browse a URL, scrape data, test UI — no need for their own browser | Launches a new headless Chrome, isolated from user's browser | | **CDP mode** | User says "connect to my Chrome", "control my browser", "CDP", "remote debugging", or wants to operate their existing browser. Also use when the task **implicitly requires login state** (e.g., "check my orders", "open my dashboard", "look at my account") | Connects to user's Chrome via DevTools Protocol. Requires remote debugging enabled (`chrome://inspect` > "Allow remote debugging"). No extension needed | | **Bridge mode** | User explicitly mentions "bridge", "extension", or has Midscene Chrome Extension installed and prefers to use it | Connects to user's Chrome via the Midscene Chrome Extension | **CDP vs Bridge**: Both control the user's real Chrome with login sessions preserved. CDP only needs a Chrome setting toggle; Bridge needs a Chrome Extension installed. If the user doesn't specify, prefer **CDP mode** as it has fewer prerequisites. ### Precheck: detect available CDP target Before using CDP mode, run a quick precheck to verify Ch