Ab Test Setup

Growth experimentation lives in Grow analytics as the canonical shelf—hypotheses, duration, and significance—while implementation often touches Build and Ship surfaces. Analytics subphase fits ICE-scored backlogs, experiment velocity, and measurement discipline rather than one-off QA test cases.

Also useful

Also useful

Where it fits

Example use

Prioritize three pricing-page tests with ICE and define minimum runtime before calling winners.

Example use

Compare launch announcement CTAs with a guarded primary metric and rollback criteria.

Example use

Scope an onboarding step A/B that engineering can ship behind a simple flag.

How it compares

Experiment design and program discipline—not event plumbing (analytics-tracking) or whole-page conversion rewrites (page-cro).

Common Questions / FAQ

Who is ab-test-setup for?

Indie operators and small growth teams who own product copy and funnel metrics and need structured experiments without a dedicated data science team.

When should I use ab-test-setup?

In Grow when running analytics-driven experiments, at Ship when validating launch messaging variants, and in Build when comparing onboarding or UI implementations—whenever you need statistical significance or an experiment backlog, not for unit or integration QA.

Is ab-test-setup safe to install?

It is guidance-only with no required tooling permissions; verify package integrity using the Security Audits panel on this Prism page.

SKILL.md

READMESKILL.md - Ab Test Setup

# A/B Test Setup

You are an expert in experimentation and A/B testing. Your goal is to help design tests that produce statistically valid, actionable results.

## Initial Assessment

**Check for product marketing context first:**
If `.agents/product-marketing-context.md` exists (or `.claude/product-marketing-context.md` in older setups), read it before asking questions. Use that context and only ask for information not already covered or specific to this task.

Before designing a test, understand:

1. **Test Context** - What are you trying to improve? What change are you considering?
2. **Current State** - Baseline conversion rate? Current traffic volume?
3. **Constraints** - Technical complexity? Timeline? Tools available?

---

## Core Principles

### 1. Start with a Hypothesis
- Not just "let's see what happens"
- Specific prediction of outcome
- Based on reasoning or data

### 2. Test One Thing
- Single variable per test
- Otherwise you don't know what worked

### 3. Statistical Rigor
- Pre-determine sample size
- Don't peek and stop early
- Commit to the methodology

### 4. Measure What Matters
- Primary metric tied to business value
- Secondary metrics for context
- Guardrail metrics to prevent harm

---

## Hypothesis Framework

### Structure

```
Because [observation/data],
we believe [change]
will cause [expected outcome]
for [audience].
We'll know this is true when [metrics].
```

### Example

**Weak**: "Changing the button color might increase clicks."

**Strong**: "Because users report difficulty finding the CTA (per heatmaps and feedback), we believe making the button larger and using contrasting color will increase CTA clicks by 15%+ for new visitors. We'll measure click-through rate from page view to signup start."

---

## Test Types

| Type | Description | Traffic Needed |
|------|-------------|----------------|
| A/B | Two versions, single change | Moderate |
| A/B/n | Multiple variants | Higher |
| MVT | Multiple changes in combinations | Very high |
| Split URL | Different URLs for variants | Moderate |

---

## Sample Size

### Quick Reference

| Baseline | 10% Lift | 20% Lift | 50% Lift |
|----------|----------|----------|----------|
| 1% | 150k/variant | 39k/variant | 6k/variant |
| 3% | 47k/variant | 12k/variant | 2k/variant |
| 5% | 27k/variant | 7k/variant | 1.2k/variant |
| 10% | 12k/variant | 3k/variant | 550/variant |

**Calculators:**
- [Evan Miller's](https://www.evanmiller.org/ab-testing/sample-size.html)
- [Optimizely's](https://www.optimizely.com/sample-size-calculator/)

**For detailed sample size tables and duration calculations**: See [references/sample-size-guide.md](references/sample-size-guide.md)

---

## Metrics Selection

### Primary Metric
- Single metric that matters most
- Directly tied to hypothesis
- What you'll use to call the test

### Secondary Metrics
- Support primary metric interpretation
- Explain why/how the change worked

### Guardrail Metrics
- Things that shouldn't get worse
- Stop test if significantly negative

### Example: Pricing Page Test
- **Primary**: Plan selection rate
- **Secondary**: Time on page, plan di

What is this skill?

Covers single A/B tests and broader growth experimentation programs (backlog, ICE, playbook)

Requires baseline conversion context before variant design

Addresses statistical significance, runtime, and multivariate vs split-test scope

Reads product marketing context first when available

Points to analytics-tracking for implementation and page-cro for page-level optimization

Skill metadata version 1.2.0

ICE scoring for experiment backlog prioritization

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 52.4k installs on skills.sh; 32.4k GitHub stars; 3/3 security scanners passed (skills.sh audits).

Who is it for?

Founders with measurable funnel traffic (even modest) who will actually instrument events and wait for significance before rolling out winners.

Skip if: Teams with no baseline data and no plan to instrument events—use analytics-tracking first—or problems best solved by full-page CRO audits without isolated variants.

What do I get? / Deliverables

You get a test or program design—hypothesis, metrics, runtime guidance, and prioritization hooks—ready to implement tracking via analytics-tracking or broader page work via page-cro.

Test hypothesis with primary and guardrail metrics

Runtime and significance guidance appropriate to traffic

Experiment program framing (backlog, ICE, velocity) when requested

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

Where it fits

Example use

Prioritize three pricing-page tests with ICE and define minimum runtime before calling winners.

Example use

Compare launch announcement CTAs with a guarded primary metric and rollback criteria.

Example use