
A B Test Design
Design statistically grounded A/B tests with a clear hypothesis, isolated variants, primary and guardrail metrics, and sample size before you ship experiments on a live product.
Overview
A/B Test Design is an agent skill most often used in Grow (also Ship launch validation, Validate pricing tests) that defines rigorous experiments with hypotheses, metrics, and sample size calculations.
Install
npx skills add https://github.com/owl-listener/designer-skills --skill a-b-test-designWhat is this skill?
- Hypothesis template: If we [change], then [outcome] will [improve/decrease] because [rationale]
- Isolated A/B variants with one-variable discipline and explicit guardrail secondary metrics
- Sample size inputs: MDE, baseline rate, 95% significance, 80% power
- Duration guidance: full weekly cycles, typically 1–2 weeks minimum until sample reached
- Explicit when-not-to-test list for low traffic, ethics, and foundational changes
- 6-part test structure (hypothesis through duration)
- Default statistical significance 95% and power 80%
- Typical minimum run duration 1–2 full weeks
Adoption & trust: 564 installs on skills.sh; 1.5k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You want to test a product or marketing change but lack a clear hypothesis, success metric, or enough sample size planning to trust the result.
Who is it for?
Solo builders with enough traffic for powered tests who need a disciplined design before flipping a feature flag or UI variant live.
Skip if: Ultra-low-traffic sites, one-shot brand redesigns, or cases where withholding a clear user improvement is unethical.
When should I use this skill?
Before launching any controlled A/B or multivariate test when you need hypotheses, metrics, and sample size defined.
What do I get? / Deliverables
You receive a complete experiment brief—hypothesis, variants, primary and guardrail metrics, sample size, and duration—ready to configure in your A/B platform.
- Structured experiment spec with hypothesis and variants
- Sample size and duration recommendation with pitfall checklist
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Experiment design belongs on the Grow/analytics shelf where solo builders measure what changed after Ship/Launch. The skill specifies metrics, power, significance, duration, and pitfalls—core analytics experiment planning, not UI implementation.
Where it fits
Frame a pricing-page variant test with a revenue or signup primary metric and ethical guardrails.
Specify onboarding copy A/B before go-live with duration tied to weekly traffic cycles.
Document MDE and sample size for a checkout button experiment to avoid peeking pitfalls.
Add guardrail metrics when testing email CTA copy so unsubscribe rate does not spike unnoticed.
How it compares
Experiment design spec—not a feature-flag integration or automatic results analysis dashboard.
Common Questions / FAQ
Who is a-b-test-design for?
Indie SaaS and ecommerce operators who run their own growth experiments and need statistical structure without hiring a data scientist full time.
When should I use a-b-test-design?
In Grow/analytics when planning funnel tests; in Ship/launch when validating onboarding changes; in Validate/pricing when testing offer copy—after you have a measurable primary metric.
Is a-b-test-design safe to install?
It outputs planning text only; review the Security Audits panel on this Prism page before installing any third-party skill package.
SKILL.md
READMESKILL.md - A B Test Design
# A/B Test Design You are an expert in designing rigorous A/B experiments that produce actionable results. ## What You Do You design A/B tests with clear hypotheses, controlled variants, appropriate metrics, and statistical rigor. ## Test Structure ### 1. Hypothesis Structured as: 'If we [change], then [outcome] will [improve/decrease] because [rationale].' ### 2. Variants - Control (A): current design - Treatment (B): proposed change - Keep changes isolated — test one variable at a time ### 3. Primary Metric The single most important measure of success. Must be measurable, relevant, and sensitive to the change. ### 4. Secondary Metrics Supporting measures and guardrail metrics to detect unintended consequences. ### 5. Sample Size Based on: minimum detectable effect, baseline conversion rate, statistical significance level (typically 95%), and power (typically 80%). ### 6. Duration Run until sample size is reached. Account for weekly cycles (run in full weeks). Minimum 1-2 weeks typically. ## Common Pitfalls - Peeking at results before completion - Too many variants at once - Metric not sensitive enough to detect change - Sample size too small - Not accounting for novelty effects - Ignoring segmentation effects ## When Not to A/B Test - Very low traffic (insufficient sample) - Ethical concerns with withholding improvement - Foundational changes that affect everything - When qualitative insight is more valuable ## Best Practices - One hypothesis per test - Document everything before starting - Don't stop early on positive results - Analyze segments after overall results - Share learnings broadly regardless of outcome