
Ab Test Analysis
Turn raw experiment exports into statistically defensible ship, extend, or stop decisions without guessing at significance.
Overview
ab-test-analysis is an agent skill most often used in Grow (also Validate and Launch) that evaluates A/B test results with statistical rigor and ship, extend, or stop recommendations.
Install
npx skills add https://github.com/phuryn/pm-skills --skill ab-test-analysisWhat is this skill?
- Validates sample size, duration, SRM, and novelty effects before trusting lifts
- Computes conversion rates, confidence intervals, and significance with Python when needed
- Reads CSV, Excel, or analytics exports directly from user-supplied files
- Frames primary and guardrail metrics with explicit ship, extend, or stop recommendations
- Structured flow: hypothesis → setup validation → statistics → product decision
- Sample-size formula with 80% power underpowered flag
- Validates at least 1–2 full business cycles for duration
Adoption & trust: 1.1k installs on skills.sh; 12.3k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You finished an experiment but cannot tell if the lift is real, powered, or safe to ship on your primary and guardrail metrics.
Who is it for?
Solo builders with CSV or analytics exports who need a disciplined read before rolling out UI, pricing, or funnel changes.
Skip if: Teams with no primary metric, no traffic split metadata, or experiments too short to wash out novelty without extending the test.
When should I use this skill?
Evaluating experiment results, checking if a test reached significance, interpreting split test data, or deciding whether to ship a variant.
What do I get? / Deliverables
You get validated significance, interval estimates, and a documented ship, extend, or stop call grounded in sample size and test-health checks.
- Statistical summary with significance and confidence intervals
- Ship, extend, or stop recommendation with rationale
- Optional Python analysis script for reproducible calculations
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Grow because A/B conclusions compound user and revenue decisions after you have traffic, even though you may run tests earlier. Analytics is where split-test metrics, guardrails, and rollout calls belong in the Prism journey.
Where it fits
Decide whether a onboarding tweak beat control on activation without breaching revenue guardrails.
Judge a pricing-page A/B after validating power and business-cycle duration before committing list prices.
Compare headline variants on a launch landing page once traffic supports significance and SRM checks pass.
How it compares
Use instead of eyeballing dashboard deltas or spreadsheet percentages without power and SRM checks.
Common Questions / FAQ
Who is ab-test-analysis for?
Indie and solo product builders who run experiments themselves and need statistical guardrails before changing production.
When should I use ab-test-analysis?
Use it in Grow analytics when evaluating results; in Validate when testing pricing or scope hypotheses; and in Launch when judging distribution or GEO copy variants after enough traffic.
Is ab-test-analysis safe to install?
Review the Security Audits panel on this Prism page and restrict file access to exports you intend to analyze; generated Python should run in a trusted environment.
SKILL.md
READMESKILL.md - Ab Test Analysis
## A/B Test Analysis Evaluate A/B test results with statistical rigor and translate findings into clear product decisions. ### Context You are analyzing A/B test results for **$ARGUMENTS**. If the user provides data files (CSV, Excel, or analytics exports), read and analyze them directly. Generate Python scripts for statistical calculations when needed. ### Instructions 1. **Understand the experiment**: - What was the hypothesis? - What was changed (the variant)? - What is the primary metric? Any guardrail metrics? - How long did the test run? - What is the traffic split? 2. **Validate the test setup**: - **Sample size**: Is the sample large enough for the expected effect size? - Use the formula: n = (Z²α/2 × 2 × p × (1-p)) / MDE² - Flag if the test is underpowered (<80% power) - **Duration**: Did the test run for at least 1-2 full business cycles? - **Randomization**: Any evidence of sample ratio mismatch (SRM)? - **Novelty/primacy effects**: Was there enough time to wash out initial behavior changes? 3. **Calculate statistical significance**: - **Conversion rate** for control and variant - **Relative lift**: (variant - control) / control × 100 - **p-value**: Using a two-tailed z-test or chi-squared test - **Confidence interval**: 95% CI for the difference - **Statistical significance**: Is p < 0.05? - **Practical significance**: Is the lift meaningful for the business? If the user provides raw data, generate and run a Python script to calculate these. 4. **Check guardrail metrics**: - Did any guardrail metrics (revenue, engagement, page load time) degrade? - A winning primary metric with degraded guardrails may not be a true win 5. **Interpret results**: | Outcome | Recommendation | |---|---| | Significant positive lift, no guardrail issues | **Ship it** — roll out to 100% | | Significant positive lift, guardrail concerns | **Investigate** — understand trade-offs before shipping | | Not significant, positive trend | **Extend the test** — need more data or larger effect | | Not significant, flat | **Stop the test** — no meaningful difference detected | | Significant negative lift | **Don't ship** — revert to control, analyze why | 6. **Provide the analysis summary**: ``` ## A/B Test Results: [Test Name] **Hypothesis**: [What we expected] **Duration**: [X days] | **Sample**: [N control / M variant] | Metric | Control | Variant | Lift | p-value | Significant? | |---|---|---|---|---|---| | [Primary] | X% | Y% | +Z% | 0.0X | Yes/No | | [Guardrail] | ... | ... | ... | ... | ... | **Recommendation**: [Ship / Extend / Stop / Investigate] **Reasoning**: [Why] **Next steps**: [What to do] ``` Think step by step. Save as markdown. Generate Python scripts for calculations if raw data is provided. --- ### Further Reading - [A/B Testing 101 + Examples](https://www.productcompass.pm/p/ab-testing-101-for-pms) - [Testing Product Ideas: The Ultimate Validation Experiments Library](https://www.productcompass.pm/p/the-ultimate-experiments-library) - [Are You Tracking the Right Metrics?](https://www.productcompass.pm/p/are-you-tracking-the-right-metrics)