
Experiment Designer
Design rigorous A/B or multivariate experiments with primary metrics, guardrails, and pre-registered stopping rules before you ship product or pricing changes.
Install
npx skills add https://github.com/alirezarezvani/claude-skills --skill experiment-designerWhat is this skill?
- Experiment types: A/B, multivariate, and holdout tests with when-to-use guidance
- Primary, guardrail, and diagnostic metric roles with explicit decision gates
- Pre-launch checklist including If/Then/Because hypothesis framing
- Stopping rules: fixed sample size, minimum duration, guardrail pause criteria
- Novelty and primacy effect mitigations for trustworthy ship decisions
Adoption & trust: 545 installs on skills.sh; 17.5k GitHub stars; 3/3 security scanners passed (skills.sh audits).
Recommended Skills
Journey fit
Grow analytics is the canonical shelf because the playbook centers on measuring lift, guardrails, and ship/no-ship decisions after you have traffic or users to experiment on. Analytics subphase matches metric design, cohort checks, and inference discipline rather than raw frontend implementation or ad-hoc brainstorming.
Common Questions / FAQ
Is Experiment Designer safe to install?
skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Experiment Designer
# Experiment Playbook ## Experiment Types ### A/B Test - Compare one control versus one variant. - Best for high-confidence directional decisions. ### Multivariate Test - Test combinations of multiple factors. - Useful for interaction effects, requires larger traffic. ### Holdout Test - Keep a percentage unexposed to intervention. - Useful for measuring incremental lift over broader changes. ## Metric Design ### Primary Metric - One metric that decides ship/no-ship. - Must align with user value and business objective. ### Guardrail Metrics - Prevent local optimization damage. - Examples: error rate, latency, churn proxy, support contacts. ### Diagnostic Metrics - Explain why change happened. - Do not use as decision gate unless pre-specified. ## Stopping Rules Define before launch: - Fixed sample size per group - Minimum run duration (to capture weekday/weekend behavior) - Guardrail breach thresholds (pause criteria) Avoid: - Continuous peeking with fixed-horizon inference - Changing success metric mid-test - Retroactive segmentation without correction ## Novelty and Primacy Effects - Novelty effect: short-term spike due to newness, not durable value. - Primacy effect: early exposure creates bias in user behavior. Mitigation: - Run long enough for behavior stabilization. - Check returning users and delayed cohorts separately. - Re-run key tests when stakes are high. ## Pre-Launch Checklist - [ ] Hypothesis complete (If/Then/Because) - [ ] Metric definitions frozen - [ ] Instrumentation validated - [ ] Randomization and assignment verified - [ ] Sample size and duration approved - [ ] Rollback plan documented ## Post-Test Readout Template 1. Hypothesis and scope 2. Experiment setup and quality checks 3. Primary metric effect size + confidence interval 4. Guardrail status 5. Segment-level observations (pre-registered only) 6. Decision: ship, iterate, or reject 7. Follow-up experiments # Statistics Reference for Product Managers ## p-value The p-value is the probability of observing data at least as extreme as yours if there were no true effect. - Small p-value means data is less consistent with "no effect". - It does not tell you the probability that the variant is best. ## Confidence Interval (CI) A CI gives a plausible range for the true effect size. - Narrow interval: more precise estimate. - Wide interval: uncertain estimate. - If CI includes zero (or no-effect), directional confidence is weak. ## Minimum Detectable Effect (MDE) The smallest effect worth detecting. - Set MDE by business value threshold, not wishful optimism. - Smaller MDE requires larger sample size. ## Statistical Power Power is the probability of detecting a true effect of at least MDE. - Common target: 80% (0.8) - Higher power increases sample requirements. ## Type I and Type II Errors - Type I (false positive): claim effect when none exists (controlled by alpha). - Type II (false negative): miss a real effect (controlled by power). ## Practical Significance An effect can be statistically significant but too small to matter. Always ask: - Does the effect clear implementation cost? - Does it move strategic KPIs materially? ## Power Analysis Inputs For conversion experiments (two proportions): - Baseline conversion rate - MDE (absolute points or relative uplift) - Alpha (e.g., 0.05) - Power (e.g., 0.8) Output: - Required sample size per variant - Total sample size - Approximate runtime based on traffic volume #!/usr/bin/env python3 """Calculate sample size for two-proportion A/B tests.""" import argparse import math import statistics def clamp_rate(value: float, name: str) -> float: if value <= 0 or value >= 1: raise ValueError(f"{name} must be between 0 and 1 (exclusive).") return value def required_sample_size_per_group( baseline_rate: float, target_rate: float, alpha: float, power: float, ) -> int: delta = abs(target_rate - baseline_rate) if delta <= 0: raise V