Experiment Designer

Name: Experiment Designer
Author: alirezarezvani

alirezarezvani/claude-skills

580 installs
23.5k repo stars
Updated July 17, 2026
alirezarezvani/claude-skills

experiment-designer is a planning skill that structures rigorous A/B, multivariate, and holdout experiments with primary metrics, guardrails, and pre-registered stopping rules for developers who must validate product or

About

experiment-designer is an Experiment Playbook skill in alirezarezvani/claude-skills for designing statistically disciplined tests before launch. The skill distinguishes A/B tests for directional control-versus-variant decisions, multivariate tests for interaction effects that need larger traffic, and holdout tests that keep a percentage unexposed to measure incremental lift. Metric design separates one primary ship-or-no-ship metric aligned to user value, guardrail metrics such as error rate, latency, churn proxy, or support contacts to block harmful wins, and diagnostic metrics that explain mechanism without becoming ad-hoc gates. Stopping rules must be fixed before launch, covering per-group sample size, minimum run duration for weekday and weekend coverage, and guardrail breach handling. Developers reach for experiment-designer when scoping feature flags, pricing experiments, or growth tests and need pre-registration discipline instead of post-hoc metric shopping.

Experiment types: A/B, multivariate, and holdout tests with when-to-use guidance
Primary, guardrail, and diagnostic metric roles with explicit decision gates
Pre-launch checklist including If/Then/Because hypothesis framing
Stopping rules: fixed sample size, minimum duration, guardrail pause criteria
Novelty and primacy effect mitigations for trustworthy ship decisions

Experiment Designer by the numbers

580 all-time installs (skills.sh)
Ranked #413 of 2,065 Data Science & ML skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 31, 2026 (Skillselion catalog sync)

npx skills add https://github.com/alirezarezvani/claude-skills --skill experiment-designer

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/alirezarezvani/claude-skills/experiment-designer.svg)](https://skillselion.com/skills/alirezarezvani/claude-skills/experiment-designer)

Installs	580
repo stars	★ 23.5k
Security audit	3 / 3 scanners passed
Last updated	July 17, 2026
Repository	alirezarezvani/claude-skills ↗

How do you design an A/B test with guardrail metrics?

Design rigorous A/B or multivariate experiments with primary metrics, guardrails, and pre-registered stopping rules before you ship product or pricing changes.

Who is it for?

Product engineers and data-aware developers planning feature or pricing experiments who need pre-registered metrics and stopping rules before running tests.

Skip if: Developers who only need SQL dashboard queries or already-finished experiment analysis without upfront design should skip experiment-designer.

When should I use this skill?

The user asks to design an A/B test, define guardrail metrics, choose multivariate versus holdout design, or set stopping rules before a launch.

What you get

Experiment brief with primary metric, guardrail list, diagnostic metrics, and documented stopping rules before launch.

Experiment design brief
Metric and guardrail specification
Stopping rule checklist

Files

SKILL.mdMarkdownGitHub ↗

Experiment Designer

Design, prioritize, and evaluate product experiments with clear hypotheses and defensible decisions.

When To Use

Use this skill for:

A/B and multivariate experiment planning
Hypothesis writing and success criteria definition
Sample size and minimum detectable effect planning
Experiment prioritization with ICE scoring
Reading statistical output for product decisions

Core Workflow

1. Write hypothesis in If/Then/Because format

If we change [intervention]
Then [metric] will change by [expected direction/magnitude]
Because [behavioral mechanism]

2. Define metrics before running test

Primary metric: single decision metric
Guardrail metrics: quality/risk protection
Secondary metrics: diagnostics only

3. Estimate sample size

Baseline conversion or baseline mean
Minimum detectable effect (MDE)
Significance level (alpha) and power

Use:

python3 scripts/sample_size_calculator.py --baseline-rate 0.12 --mde 0.02 --mde-type absolute

4. Prioritize experiments with ICE

Impact: potential upside
Confidence: evidence quality
Ease: cost/speed/complexity

ICE Score = (Impact Confidence Ease) / 10

5. Launch with stopping rules

Decide fixed sample size or fixed duration in advance
Avoid repeated peeking without proper method
Monitor guardrails continuously

6. Interpret results

Statistical significance is not business significance
Compare point estimate + confidence interval to decision threshold
Investigate novelty effects and segment heterogeneity

Hypothesis Quality Checklist

[ ] Contains explicit intervention and audience
[ ] Specifies measurable metric change
[ ] States plausible causal reason
[ ] Includes expected minimum effect
[ ] Defines failure condition

Common Experiment Pitfalls

Underpowered tests leading to false negatives
Running too many simultaneous changes without isolation
Changing targeting or implementation mid-test
Stopping early on random spikes
Ignoring sample ratio mismatch and instrumentation drift
Declaring success from p-value without effect-size context

Statistical Interpretation Guardrails

p-value < alpha indicates evidence against null, not guaranteed truth.
Confidence interval crossing zero/no-effect means uncertain directional claim.
Wide intervals imply low precision even when significant.
Use practical significance thresholds tied to business impact.

See:

references/experiment-playbook.md
references/statistics-reference.md

Tooling

`scripts/sample_size_calculator.py`

Computes required sample size (per variant and total) from:

baseline rate
MDE (absolute or relative)
significance level (alpha)
statistical power

Example:

python3 scripts/sample_size_calculator.py \
  --baseline-rate 0.10 \
  --mde 0.015 \
  --mde-type absolute \
  --alpha 0.05 \
  --power 0.8

#!/usr/bin/env python3
"""Calculate sample size for two-proportion A/B tests."""

import argparse
import math
import statistics


def clamp_rate(value: float, name: str) -> float:
    if value <= 0 or value >= 1:
        raise ValueError(f"{name} must be between 0 and 1 (exclusive).")
    return value


def required_sample_size_per_group(
    baseline_rate: float,
    target_rate: float,
    alpha: float,
    power: float,
) -> int:
    delta = abs(target_rate - baseline_rate)
    if delta <= 0:
        raise ValueError("MDE resolves to zero; target and baseline must differ.")

    z_alpha = statistics.NormalDist().inv_cdf(1 - alpha / 2)
    z_beta = statistics.NormalDist().inv_cdf(power)
    pooled = (baseline_rate + target_rate) / 2

    numerator = 2 * pooled * (1 - pooled) * (z_alpha + z_beta) ** 2
    n = numerator / (delta ** 2)
    return math.ceil(n)


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(
        description="Compute sample size for two-proportion product experiments."
    )
    parser.add_argument("--baseline-rate", type=float, required=True)
    parser.add_argument(
        "--mde",
        type=float,
        required=True,
        help="Minimum detectable effect. Absolute points when --mde-type absolute, otherwise relative uplift.",
    )
    parser.add_argument("--mde-type", choices=["absolute", "relative"], default="relative")
    parser.add_argument("--alpha", type=float, default=0.05)
    parser.add_argument("--power", type=float, default=0.8)
    parser.add_argument(
        "--daily-samples",
        type=int,
        default=0,
        help="Optional total daily samples to estimate runtime in days.",
    )
    return parser.parse_args()


def main() -> int:
    args = parse_args()
    baseline = clamp_rate(args.baseline_rate, "baseline-rate")

    if args.mde <= 0:
        raise ValueError("mde must be > 0")
    if args.alpha <= 0 or args.alpha >= 1:
        raise ValueError("alpha must be between 0 and 1")
    if args.power <= 0 or args.power >= 1:
        raise ValueError("power must be between 0 and 1")

    if args.mde_type == "absolute":
        target = baseline + args.mde
    else:
        target = baseline * (1 + args.mde)

    target = clamp_rate(target, "target-rate")

    n_per_group = required_sample_size_per_group(
        baseline_rate=baseline,
        target_rate=target,
        alpha=args.alpha,
        power=args.power,
    )
    total_n = n_per_group * 2

    print("A/B Test Sample Size Estimate")
    print(f"baseline_rate: {baseline:.6f}")
    print(f"target_rate: {target:.6f}")
    print(f"mde_type: {args.mde_type}")
    print(f"alpha: {args.alpha}")
    print(f"power: {args.power}")
    print(f"n_per_group: {n_per_group}")
    print(f"n_total: {total_n}")

    if args.daily_samples > 0:
        days = math.ceil(total_n / args.daily_samples)
        print(f"estimated_days_at_daily_samples_{args.daily_samples}: {days}")

    return 0


if __name__ == "__main__":
    raise SystemExit(main())

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

How it compares

Use experiment-designer before writing experiment code or dashboards when the team needs metric pre-registration rather than ad-hoc analysis after results arrive.

FAQ

What experiment types does experiment-designer describe?

experiment-designer covers A/B tests for control-versus-variant decisions, multivariate tests for factor interactions with larger traffic needs, and holdout tests that leave a slice unexposed to measure incremental lift.

What stopping rules should be defined before launch?

experiment-designer requires pre-launch stopping rules including fixed sample size per group, minimum run duration to capture weekday and weekend behavior, and guardrail breach handling before interpreting results.

Is Experiment Designer safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Data Science & MLanalytics

About

Experiment Designer by the numbers

Add your badge

How do you design an A/B test with guardrail metrics?

Who is it for?

When should I use this skill?

What you get

Files

Experiment Designer

When To Use

Core Workflow

Hypothesis Quality Checklist

Common Experiment Pitfalls

Statistical Interpretation Guardrails

Tooling

scripts/sample_size_calculator.py

Experiment Playbook

Experiment Types

A/B Test

Multivariate Test

Holdout Test

Metric Design

Primary Metric

Guardrail Metrics

Diagnostic Metrics

Stopping Rules

Novelty and Primacy Effects

Pre-Launch Checklist

Post-Test Readout Template

Statistics Reference for Product Managers

p-value

Confidence Interval (CI)

Minimum Detectable Effect (MDE)

Statistical Power

Type I and Type II Errors

Practical Significance

Power Analysis Inputs

Related skills

How it compares

FAQ

What experiment types does experiment-designer describe?

What stopping rules should be defined before launch?

Is Experiment Designer safe to install?

This week in AI coding

`scripts/sample_size_calculator.py`