
Ab Test Setup
Lock a statistically valid A/B test plan with hypothesis gates, power checks, and execution readiness before you ship experiment code.
Overview
A/B Test Setup is an agent skill most often used in Grow (also Validate, Launch) that structures experiment design with mandatory hypothesis, metrics, and validity gates before implementation.
Install
npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill ab-test-setupWhat is this skill?
- Hard gate: final hypothesis must be confirmed before variants or metrics design
- Hypothesis quality checklist covers observation, single change, audience, and measurable success criteria
- Mandatory assumptions review for traffic stability, randomization, seasonality, and releases
- Blocks invalid tests and enforces statistical power before any implementation
- Structured flow from prerequisites through test type selection and execution readiness
- 5-part hypothesis quality checklist
- 3 mandatory pre-gate sections before variant design
Adoption & trust: 611 installs on skills.sh; 40.1k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You want to run an A/B test but risk invalid hypotheses, peeking, and underpowered results if you jump straight to code or copy changes.
Who is it for?
Solo builders planning conversion, onboarding, or pricing experiments who have analytics and rough traffic estimates and want process discipline before coding.
Skip if: Quick one-off UI tweaks with no metrics plan, or teams that already have an approved experiment spec and only need engineering execution.
When should I use this skill?
Before designing A/B variants or writing experiment code when you need valid hypotheses, metrics, and statistical readiness.
What do I get? / Deliverables
You leave with a locked hypothesis, documented assumptions, chosen test type, and execution readiness so implementation and analysis stay statistically defensible.
- Locked final hypothesis with audience and primary metric
- Assumptions and validity review
- Test type and execution readiness plan
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Growth analytics is the canonical shelf because the skill centers on experiment metrics, MDE, and peeking prevention—core optimization work after you have traffic. Analytics subphase matches hypothesis locking, primary metrics, and assumptions about traffic and metric reliability.
Where it fits
Frame a pricing-page test with a locked primary metric before you build alternate checkout copy.
Validate assumptions about campaign traffic before comparing two landing hero variants.
Run the hypothesis lock and MDE step before instrumenting onboarding funnel variants.
Document independence and seasonality assumptions before an email or paywall experiment.
How it compares
Use instead of ad-hoc “let’s try two versions” chat planning when you need statistical gates, not a generic analytics dashboard skill.
Common Questions / FAQ
Who is ab-test-setup for?
Indie and solo product builders shipping SaaS, mobile, or content products who run experiments without a full growth engineering team.
When should I use ab-test-setup?
In Grow when defining analytics experiments; in Validate when testing pricing or scope hypotheses; in Launch when optimizing landing or distribution conversion—before writing experiment code.
Is ab-test-setup safe to install?
It is procedural guidance with no built-in shell or network calls in the skill itself; review the Security Audits panel on this Prism page before installing from the upstream repo.
SKILL.md
READMESKILL.md - Ab Test Setup
# A/B Test Setup ## 1️⃣ Purpose & Scope Ensure every A/B test is **valid, rigorous, and safe** before a single line of code is written. - Prevents "peeking" - Enforces statistical power - Blocks invalid hypotheses --- ## 2️⃣ Pre-Requisites You must have: - A clear user problem - Access to an analytics source - Roughly estimated traffic volume ### Hypothesis Quality Checklist A valid hypothesis includes: - Observation or evidence - Single, specific change - Directional expectation - Defined audience - Measurable success criteria --- ### 3️⃣ Hypothesis Lock (Hard Gate) Before designing variants or metrics, you MUST: - Present the **final hypothesis** - Specify: - Target audience - Primary metric - Expected direction of effect - Minimum Detectable Effect (MDE) Ask explicitly: > “Is this the final hypothesis we are committing to for this test?” **Do NOT proceed until confirmed.** --- ### 4️⃣ Assumptions & Validity Check (Mandatory) Explicitly list assumptions about: - Traffic stability - User independence - Metric reliability - Randomization quality - External factors (seasonality, campaigns, releases) If assumptions are weak or violated: - Warn the user - Recommend delaying or redesigning the test --- ### 5️⃣ Test Type Selection Choose the simplest valid test: - **A/B Test** – single change, two variants - **A/B/n Test** – multiple variants, higher traffic required - **Multivariate Test (MVT)** – interaction effects, very high traffic - **Split URL Test** – major structural changes Default to **A/B** unless there is a clear reason otherwise. --- ### 6️⃣ Metrics Definition #### Primary Metric (Mandatory) - Single metric used to evaluate success - Directly tied to the hypothesis - Pre-defined and frozen before launch #### Secondary Metrics - Provide context - Explain _why_ results occurred - Must not override the primary metric #### Guardrail Metrics - Metrics that must not degrade - Used to prevent harmful wins - Trigger test stop if significantly negative --- ### 7️⃣ Sample Size & Duration Define upfront: - Baseline rate - MDE - Significance level (typically 95%) - Statistical power (typically 80%) Estimate: - Required sample size per variant - Expected test duration **Do NOT proceed without a realistic sample size estimate.** --- ### 8️⃣ Execution Readiness Gate (Hard Stop) You may proceed to implementation **only if all are true**: - Hypothesis is locked - Primary metric is frozen - Sample size is calculated - Test duration is defined - Guardrails are set - Tracking is verified If any item is missing, stop and resolve it. --- ## Running the Test ### During the Test **DO:** - Monitor technical health - Document external factors **DO NOT:** - Stop early due to “good-looking” results - Change variants mid-test - Add new traffic sources - Redefine success criteria --- ## Analyzing Results ### Analysis Discipline When interpreting results: - Do NOT generalize beyond the tested population - Do NOT claim causality beyond the tested change - Do NOT override guardrail failures - Separate statistical significance from business judgment ### Interpretation Outcomes | Result | Action | | -------------------- | -------------------------------------- | | Significant positive | Consider rollout | | Significant negative | Reject variant, document learning | | Inconclusive | Consider more traffic or bolder change | | Guardrail failure | Do not ship, even if primary wins | --- ## Documentation & Learning ### Test Record (Mandatory) Document: - Hypothesis - Variants - Metrics - Sample size vs achieved - Results - Decision - Learnings - Follow-up ideas Store records in a share