Ab Test Setup

Name: Ab Test Setup
Author: sickn33

sickn33/antigravity-awesome-skills

803 installs
44k repo stars
Updated July 27, 2026
sickn33/antigravity-awesome-skills

ab-test-setup is an agent skill workflow that enforces statistically valid A/B test design with hypothesis locks, power checks, and execution readiness gates for developers who need rigorous experiments before writing co

About

ab-test-setup is a structured agent skill that walks developers through eight mandatory gates—from hypothesis lock and assumptions checks through metrics definition, sample-size estimation, and a five-point tracking verification checklist—before any experiment code is written. The workflow blocks peeking, mid-test variant changes, and undefined primary metrics, and it documents refusal conditions when baseline rates or traffic are insufficient. Developers reach for ab-test-setup when launching feature flags, onboarding flows, or pricing experiments on SaaS products where invalid tests waste traffic and produce false positives. The skill outputs a frozen hypothesis record, primary and guardrail metric definitions, sample-size estimates at 95% significance and 80% power, and a post-test analysis and learning document stored for future reference.

Hard gate: final hypothesis must be confirmed before variants or metrics design
Hypothesis quality checklist covers observation, single change, audience, and measurable success criteria
Mandatory assumptions review for traffic stability, randomization, seasonality, and releases
Blocks invalid tests and enforces statistical power before any implementation
Structured flow from prerequisites through test type selection and execution readiness

Ab Test Setup by the numbers

803 all-time installs (skills.sh)
+19 installs in the week ending Jul 27, 2026 (Skillselion tracking)
Ranked #385 of 2,066 Data Science & ML skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill ab-test-setup

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/sickn33/antigravity-awesome-skills/ab-test-setup.svg)](https://skillselion.com/skills/sickn33/antigravity-awesome-skills/ab-test-setup)

Installs	803
repo stars	★ 44k
Security audit	3 / 3 scanners passed
Last updated	July 27, 2026
Repository	sickn33/antigravity-awesome-skills ↗

How do you design a statistically valid A/B test?

Lock a statistically valid A/B test plan with hypothesis gates, power checks, and execution readiness before you ship experiment code.

Who is it for?

Developers and growth engineers planning SaaS experiments who need statistical rigor before writing feature-flag or analytics instrumentation code.

Skip if: Teams that only need quick qualitative user interviews or already have a locked experiment design ready for immediate implementation.

When should I use this skill?

A developer asks to set up, plan, or validate an A/B test, split test, or experiment before writing code or launching variants.

What you get

Locked hypothesis document, frozen primary metric, sample-size estimate, guardrail definitions, tracking verification checklist, and post-test learning record.

hypothesis lock document
sample-size estimate
test record with learnings

By the numbers

Defines 8 sequential workflow sections ending in an execution readiness hard stop
Requires 95% significance level and 80% statistical power for sample-size estimates
Includes a 5-point tracking verification checklist before launch

Files

SKILL.mdMarkdownGitHub ↗

A/B Test Setup

1️⃣ Purpose & Scope

Ensure every A/B test is valid, rigorous, and safe before a single line of code is written.

Prevents "peeking"
Enforces statistical power
Blocks invalid hypotheses

---

2️⃣ Pre-Requisites

You must have:

A clear user problem
Access to an analytics source
Roughly estimated traffic volume

Hypothesis Quality Checklist

A valid hypothesis includes:

Observation or evidence
Single, specific change
Directional expectation
Defined audience
Measurable success criteria

---

3️⃣ Hypothesis Lock (Hard Gate)

Before designing variants or metrics, you MUST:

Present the final hypothesis
Specify:
Target audience
Primary metric
Expected direction of effect
Minimum Detectable Effect (MDE)

Ask explicitly:

“Is this the final hypothesis we are committing to for this test?”

Do NOT proceed until confirmed.

---

4️⃣ Assumptions & Validity Check (Mandatory)

Explicitly list assumptions about:

Traffic stability
User independence
Metric reliability
Randomization quality
External factors (seasonality, campaigns, releases)

If assumptions are weak or violated:

Warn the user
Recommend delaying or redesigning the test

---

5️⃣ Test Type Selection

Choose the simplest valid test:

A/B Test – single change, two variants
A/B/n Test – multiple variants, higher traffic required
Multivariate Test (MVT) – interaction effects, very high traffic
Split URL Test – major structural changes

Default to A/B unless there is a clear reason otherwise.

---

6️⃣ Metrics Definition

Primary Metric (Mandatory)

Single metric used to evaluate success
Directly tied to the hypothesis
Pre-defined and frozen before launch

Secondary Metrics

Provide context
Explain _why_ results occurred
Must not override the primary metric

Guardrail Metrics

Metrics that must not degrade
Used to prevent harmful wins
Trigger test stop if significantly negative

---

7️⃣ Sample Size & Duration

Define upfront:

Baseline rate
MDE
Significance level (typically 95%)
Statistical power (typically 80%)

Estimate:

Required sample size per variant
Expected test duration

Do NOT proceed without a realistic sample size estimate.

---

Tracking Verification (Required before Gate 8)

Before entering the Execution Readiness Gate below, run through this checklist to make "Tracking is verified" mean something concrete:

1. Event firing: Trigger each event the primary and secondary metrics depend on (sign-up, add-to-cart, custom event) on staging or a debug page, and confirm it lands in your analytics destination within 30 seconds. 2. Variant attribution: Verify that the variant assignment ID is attached to every fired event — not just the entry event. Use your analytics' raw event view to compare a sample of 5+ events per variant. 3. De-duplication: Confirm that a user reloading the page does not cause double-counted events. If your stack uses client-side de-duping, the variant ID must be part of the dedup key. 4. Sample randomization: Pull the first 100 assignment records from your assignment table; the variant split should be within ±5% of the configured allocation. 5. Guardrail metric pipeline: Each guardrail metric defined in §6️⃣ must have a working dashboard or alert by the time the test launches.

If any of the above fails, stop and resolve it before Gate 8.

---

8️⃣ Execution Readiness Gate (Hard Stop)

You may proceed to implementation only if all are true:

Hypothesis is locked
Primary metric is frozen
Sample size is calculated
Test duration is defined
Guardrails are set
Tracking is verified

If any item is missing, stop and resolve it.

---

Running the Test

During the Test

DO:

Monitor technical health
Document external factors

DO NOT:

Stop early due to “good-looking” results
Change variants mid-test
Add new traffic sources
Redefine success criteria

---

Analyzing Results

Analysis Discipline

When interpreting results:

Do NOT generalize beyond the tested population
Do NOT claim causality beyond the tested change
Do NOT override guardrail failures
Separate statistical significance from business judgment

Interpretation Outcomes

Result	Action
Significant positive	Consider rollout
Significant negative	Reject variant, document learning
Inconclusive	Consider more traffic or bolder change
Guardrail failure	Do not ship, even if primary wins

---

Documentation & Learning

Test Record (Mandatory)

Document:

Hypothesis
Variants
Metrics
Sample size vs achieved
Results
Decision
Learnings
Follow-up ideas

Store records in a shared, searchable location to avoid repeated failures.

---

Refusal Conditions (Safety)

Refuse to proceed if:

Baseline rate is unknown and cannot be estimated
Traffic is insufficient to detect the MDE
Primary metric is undefined
Multiple variables are changed without proper design
Hypothesis cannot be clearly stated

Explain why and recommend next steps.

---

Key Principles (Non-Negotiable)

One hypothesis per test
One primary metric
Commit before launch
No peeking
Learning over winning
Statistical rigor first

---

Final Reminder

A/B testing is not about proving ideas right. It is about learning the truth with confidence.

If you feel tempted to rush, simplify, or “just try it” — that is the signal to slow down and re-check the design.

When to Use

This skill is applicable to execute the workflow or actions described in the overview.

Limitations

Use this skill only when the task clearly matches the scope described above.
Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

How it compares

Pick ab-test-setup over generic analytics skills when experiment statistical validity and pre-code gating matter more than dashboard setup or SQL query writing.

FAQ

What gates must pass before ab-test-setup allows implementation?

ab-test-setup requires hypothesis lock, frozen primary metric, calculated sample size, defined test duration, guardrail metrics, and verified tracking across five checklist items. Missing any gate stops the workflow until resolved.

Does ab-test-setup allow stopping a test early for good results?

ab-test-setup explicitly forbids stopping early due to good-looking results, changing variants mid-test, or redefining success criteria during the run. The skill treats premature stops as invalid experiment design.

Is Ab Test Setup safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Data Science & MLlifecycle