
Ab Testing
Design rigorous A/B or multivariate tests with hypotheses, metrics, and duration rules instead of guessing copy or CTA winners.
Overview
ab-testing is an agent skill most often used in Grow (also Validate, Launch) that designs A/B experiments with hypotheses, metrics, sample-size reasoning, and peeking guardrails.
Install
npx skills add https://github.com/coreyhaines31/marketingskills --skill ab-testingWhat is this skill?
- Builds hypotheses with observation, belief, expected outcome, and measurable metric
- Separates A/B vs multivariate test types and structured test-plan output
- Ties sample-size thinking to monthly visitors and baseline conversion rates
- Defines primary, secondary, and guardrail metrics for each experiment
- Warns against the peeking problem and recommends fixed test duration
- Eval scenarios reference ~15,000 monthly visitors and 3.2% signup baseline for sample-size reasoning
Adoption & trust: 14.5k installs on skills.sh; 32.4k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You want to test copy or UI changes but lack a hypothesis, metric plan, or sample-size sense—and risk false winners from peeking.
Who is it for?
Solo builders with meaningful web traffic optimizing signup, pricing, or homepage conversion with disciplined experiments.
Skip if: Pre-traffic idea validation with no baseline rate, one-off brand storytelling with no measurable outcome, or teams forbidding any live experimentation.
When should I use this skill?
Planning A/B or multivariate experiments on pages, copy, or CTAs when you have traffic and a measurable baseline conversion metric.
What do I get? / Deliverables
You leave with a structured test plan: hypothesis, variants, primary and guardrail metrics, duration guidance, and execution notes tied to your traffic baseline.
- Structured experiment plan with hypothesis and variant definitions
- Primary, secondary, and guardrail metric list
- Duration and peeking guidance aligned to traffic
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Experimentation compounds once you have traffic and baselines—canonical shelf is Grow analytics even though tests can start earlier. Centers on measurement design, sample size, and metric guardrails—not building the page itself or ad buying.
Where it fits
Draft a headline A/B plan before scaling ad spend on a waitlist page with a stated signup baseline.
Compare benefit-focused hero copy against feature-led copy for a launch landing page with guardrails on bounce rate.
Run a multivariate CTA color test on pricing with primary metric checkout starts and fixed duration.
Test onboarding email subject lines with secondary metrics on activation, not just open rate.
How it compares
Use for experiment design and statistics discipline—not for generating ad creative alone or implementing feature flags in code.
Common Questions / FAQ
Who is ab-testing for?
Indie SaaS and content founders who run their own growth loops and need agent help drafting statistically sane A/B or multivariate tests.
When should I use ab-testing?
In Validate when testing landing value props, at Launch when tuning signup flows, and in Grow when optimizing headlines, CTAs, or pricing pages with real traffic baselines.
Is ab-testing safe to install?
Review the Security Audits panel on this Prism page; the skill may reference your marketing docs—avoid pasting credentials or private revenue data into prompts.
SKILL.md
READMESKILL.md - Ab Testing
{ "skill_name": "ab-testing", "evals": [ { "id": 1, "prompt": "I want to A/B test our homepage headline. We currently say 'The All-in-One Project Management Tool' and want to test something benefit-focused. We get about 15,000 visitors/month and our current signup rate is 3.2%.", "expected_output": "Should check for product-marketing.md first. Should build a proper hypothesis using the framework: 'Because [observation], we believe [change] will cause [outcome], which we'll measure by [metric].' Should identify this as an A/B test (two variants). Should calculate or reference sample size needs based on 15,000 monthly visitors and 3.2% baseline. Should define primary metric (signup rate), secondary metrics, and guardrail metrics. Should warn about the peeking problem and recommend a fixed test duration. Should provide the test plan in the structured output format.", "assertions": [ "Checks for product-marketing.md", "Uses the hypothesis framework with observation, belief, outcome, and metric", "Identifies as A/B test type", "Addresses sample size calculation based on traffic and baseline rate", "Defines primary metric (signup rate)", "Defines secondary and guardrail metrics", "Warns about the peeking problem", "Provides structured test plan output" ], "files": [] }, { "id": 2, "prompt": "we want to test like 4 different CTA button colors on our pricing page. is that a good idea?", "expected_output": "Should trigger on casual phrasing. Should identify this as an A/B/n test (multiple variants). Should caution that testing 4 variants requires significantly more traffic than a simple A/B test. Should reference the sample size quick reference showing traffic multipliers for multiple variants. Should question whether button color alone is likely to produce meaningful lift vs testing CTA copy, placement, or surrounding context. Should recommend either reducing to 2 variants or ensuring sufficient traffic. Should still provide hypothesis framework and test setup if proceeding.", "assertions": [ "Triggers on casual phrasing", "Identifies as A/B/n test (multiple variants)", "Cautions about increased traffic needs for 4 variants", "References sample size requirements", "Questions whether button color alone is high-impact", "Suggests alternative higher-impact elements to test", "Provides hypothesis framework" ], "files": [] }, { "id": 3, "prompt": "Our test has been running for 3 days and Variant B is winning with 95% confidence. Should we call it?", "expected_output": "Should immediately address the peeking problem. Should explain that checking results early inflates false positive rates. Should recommend running for the full pre-calculated duration regardless of early results. Should explain why early significance can be misleading (regression to the mean, day-of-week effects, audience mix shifts). Should provide guidance on when it IS appropriate to stop early (sequential testing methods). Should recommend the pre-test commitment to duration.", "assertions": [ "Addresses the peeking problem directly", "Explains why early significance is misleading", "Recommends running for full pre-calculated duration", "Mentions day-of-week effects or audience mix shifts", "Explains false positive rate inflation from peeking", "Mentions sequential testing as alternative approach" ], "files": [] }, { "id": 4, "prompt": "Help me set up a multivariate test on our landing page. I want to test the headline, hero image, and CTA button simultaneously.", "expected_output": "Should identify this as a Multivariate Test (MVT). Should explain that MVT tests combinations of elements and requires much more traffic than A/B tests. Should calculate