
Statistical Analyst
Choose and interpret the right frequentist tests for A/B and conversion experiments so solo builders do not misread p-values or sample size.
Overview
Statistical Analyst is an agent skill most often used in Grow (also Validate/pricing, Ship/testing) that applies frequentist hypothesis tests and clear p-value interpretation for conversion and experiment analysis.
Install
npx skills add https://github.com/alirezarezvani/claude-skills --skill statistical-analystWhat is this skill?
- Frequentist framework: null/alternative hypotheses, p-values, and pre-set α significance
- Type I/Type II error table with typical α=0.05 and power=80% (β=0.20) guidance
- Two-proportion z-test path for binary conversion comparisons with stated assumptions
- Reference depth on p-value interpretation misconceptions for agent-grounded answers
- Pairs with lean SKILL.md plus extended statistical concepts reference document
- Typical α significance level 0.05 documented
- Typical β false-negative rate 0.20 (80% power) documented
- Two-proportion z-test includes n×p ≥ 5 assumption guidance
Adoption & trust: 512 installs on skills.sh; 17.5k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You ran an A/B test or compared two conversion rates but are unsure which test applies or what the p-value actually means.
Who is it for?
Indie SaaS founders analyzing signup, click, or paywall experiments with moderate sample sizes who want agent help on z-tests and error tradeoffs.
Skip if: Bayesian-only workflows, clinical trials requiring formal biostat review, or teams that need automated experiment platforms instead of interpretive guidance.
When should I use this skill?
User compares experiment arms, conversion rates, or asks for statistical test choice and interpretation in a frequentist framework.
What do I get? / Deliverables
You get statistically framed test selection, assumption checks, and interpretation aligned to frequentist practice so decisions are not based on common p-value myths.
- Recommended test (e.g. two-proportion z-test) with assumptions stated
- Null/alternative hypothesis framing and error-type context
- Plain-language interpretation guardrails for p-values and decisions
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Experiment analysis and metric interpretation are central once you ship and measure; Grow/analytics is the canonical shelf for ongoing conversion and product metrics work. Analytics subphase covers hypothesis tests, power, and error types applied to user-facing experiments—not one-off unit tests in CI.
Where it fits
Compare onboarding variant conversion with a two-proportion z-test and correct p-value wording for a launch retrospective.
Frame a paywall copy experiment with α and power tradeoffs before committing to a price change.
Sanity-check whether a rollout metric difference meets z-test assumptions before calling a win.
Interpret early campaign landing-page tests without overclaiming certainty from small samples.
How it compares
Use as interpretive statistical reference for agents, not as a one-click analytics product or generic code-review skill.
Common Questions / FAQ
Who is statistical-analyst for?
Solo builders and small teams who measure product experiments and want an agent grounded in frequentist testing concepts when discussing results or implementation.
When should I use statistical-analyst?
In Grow/analytics when reviewing live experiment results; in Validate/pricing when comparing offer variants; in Ship/testing when validating rollout metrics—whenever you compare two proportions or need H₀/H₁ framing.
Is statistical-analyst safe to install?
The skill is reference-oriented and does not imply audited medical-grade tooling; review the Security Audits panel on this Prism page and avoid pasting sensitive user-level datasets into chats you do not control.
SKILL.md
READMESKILL.md - Statistical Analyst
# Statistical Testing Concepts Reference Deep-dive reference for the Statistical Analyst skill. Keeps SKILL.md lean while preserving the theory. --- ## The Frequentist Framework All tests in this skill operate in the **frequentist framework**: we define a null hypothesis (H₀) and an alternative (H₁), then ask "how often would we see data this extreme if H₀ were true?" - **H₀ (null):** No difference exists between control and treatment - **H₁ (alternative):** A difference exists (two-tailed) - **p-value:** P(observing this result or more extreme | H₀ is true) - **α (significance level):** The threshold we set in advance. Reject H₀ if p < α. ### The p-value misconception A p-value of 0.03 does **not** mean "there is a 97% chance the effect is real." It means: "If there were no effect, we would see data this extreme only 3% of the time." --- ## Type I and Type II Errors | | H₀ True | H₀ False | |---|---|---| | Reject H₀ | **Type I Error (α)** — False Positive | Correct (Power = 1−β) | | Fail to reject H₀ | Correct | **Type II Error (β)** — False Negative | - **α** (false positive rate): Typically 0.05. Reduce it when false positives are costly (medical trials, irreversible changes). - **β** (false negative rate): Typically 0.20 (power = 80%). Reduce it when missing real effects is costly. --- ## Two-Proportion Z-Test **When:** Comparing two binary conversion rates (e.g. clicked/not, signed up/not). **Assumptions:** - Independent samples - n×p ≥ 5 and n×(1−p) ≥ 5 for both groups (normal approximation valid) - No interference between units (SUTVA) **Formula:** ``` z = (p̂₂ − p̂₁) / √[p̄(1−p̄)(1/n₁ + 1/n₂)] where p̄ = (x₁ + x₂) / (n₁ + n₂) (pooled proportion) ``` **Effect size — Cohen's h:** ``` h = 2 arcsin(√p₂) − 2 arcsin(√p₁) ``` The arcsine transformation stabilizes variance across different baseline rates. --- ## Welch's Two-Sample t-Test **When:** Comparing means of a continuous metric between two groups (revenue, latency, session length). **Why Welch's (not Student's):** Welch's t-test does not assume equal variances — it is strictly more general and loses little power when variances are equal. Always prefer it. **Formula:** ``` t = (x̄₂ − x̄₁) / √(s₁²/n₁ + s₂²/n₂) Welch–Satterthwaite df: df = (s₁²/n₁ + s₂²/n₂)² / [(s₁²/n₁)²/(n₁−1) + (s₂²/n₂)²/(n₂−1)] ``` **Effect size — Cohen's d:** ``` d = (x̄₂ − x̄₁) / s_pooled s_pooled = √[((n₁−1)s₁² + (n₂−1)s₂²) / (n₁+n₂−2)] ``` **Warning for heavy-tailed metrics (revenue, LTV):** Mean tests are sensitive to outliers. If the distribution has heavy tails, consider: 1. Winsorizing at 99th percentile before testing 2. Log-transforming (if values are positive) 3. Using a non-parametric test (Mann-Whitney U) and flagging for human review --- ## Chi-Square Test **When:** Comparing categorical distributions (e.g. which plan users selected, which error type occurred). **Assumptions:** - Expected count ≥ 5 per cell (otherwise, combine categories or use Fisher's exact) - Independent observations **Formula:** ``` χ² = Σ (Oᵢ − Eᵢ)² / Eᵢ df = k − 1 (goodness-of-fit) df = (r−1)(c−1) (contingency table, r rows, c columns) ``` **Effect size — Cramér's V:** ``` V = √[χ² / (n × (min(r,c) − 1))] ``` --- ## Wilson Score Interval The standard confidence interval formula for proportions (`p̂ ± z√(p̂(1−p̂)/n)`) can produce impossible values (< 0 or > 1) for small n or extreme p. The Wilson score interval fixes this: ``` center = (p̂ + z²/2n) / (1 + z²/n) margin = z/(1+z²/n) × √(p̂(1−p̂)/n + z²/4n²) CI = [center − margin, center + margin] ``` Always use Wilson (or Clopper-Pearson) for proportions. The normal approximation is a historical artifact. --- ## Sample Size & Power **Power:** The probability of correctly detecting a real effect of size δ. ``` n = (z_α/2 + z_β)² × (σ₁² + σ₂²) / δ² [means] n = (z_α/2 + z_β)² × (p₁(1−p₁) + p₂(1−p₂)) / (p₂−p₁)² [proportions] ``` **Key levers:** - Increase n → more power (or detect smaller effects) - Increase MDE → smaller