Statistical Analyst

Name: Statistical Analyst
Author: alirezarezvani

alirezarezvani/claude-skills

550 installs
23.5k repo stars
Updated July 17, 2026
alirezarezvani/claude-skills

Statistical-analyst is an agent skill that selects and interprets frequentist statistical tests for A/B and conversion experiments so developers who run product experiments avoid misreading p-values, significance, and sa

About

Statistical-analyst is a Claude agent skill backed by a deep frequentist testing reference that walks through null and alternative hypotheses, p-values, significance levels, and common misconceptions. The skill helps developers pick the right test for A/B and conversion experiments, interpret results honestly, and avoid classic errors like treating p-values as effect sizes. Developers reach for statistical-analyst when designing experiments, reviewing analytics outcomes, or validating whether a metric change is statistically meaningful before changing code or rollout plans.

Frequentist framework: null/alternative hypotheses, p-values, and pre-set α significance
Type I/Type II error table with typical α=0.05 and power=80% (β=0.20) guidance
Two-proportion z-test path for binary conversion comparisons with stated assumptions
Reference depth on p-value interpretation misconceptions for agent-grounded answers
Pairs with lean SKILL.md plus extended statistical concepts reference document

Statistical Analyst by the numbers

550 all-time installs (skills.sh)
Ranked #421 of 2,065 Data Science & ML skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 31, 2026 (Skillselion catalog sync)

npx skills add https://github.com/alirezarezvani/claude-skills --skill statistical-analyst

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/alirezarezvani/claude-skills/statistical-analyst.svg)](https://skillselion.com/skills/alirezarezvani/claude-skills/statistical-analyst)

Installs	550
repo stars	★ 23.5k
Security audit	3 / 3 scanners passed
Last updated	July 17, 2026
Repository	alirezarezvani/claude-skills ↗

Which statistical test fits my A/B experiment?

Choose and interpret the right frequentist tests for A/B and conversion experiments so developers do not misread p-values or sample size.

Who is it for?

Developers and engineers who run or review A/B tests, funnel experiments, or conversion analyses and need statistically sound interpretation.

Skip if: Teams needing Bayesian modeling, causal inference pipelines, or automated experiment platforms that already encode test selection end to end.

When should I use this skill?

A developer asks which test to use for an A/B result, how to read a p-value, or whether sample size supports a conversion claim.

What you get

Chosen test rationale, hypothesis framing, p-value interpretation notes, and sample-size guidance for the experiment.

test recommendation
interpretation notes
hypothesis framing

Files

SKILL.mdMarkdownGitHub ↗

You are an expert statistician and data scientist. Your goal is to help teams make decisions grounded in statistical evidence — not gut feel. You distinguish signal from noise, size experiments correctly before they start, and interpret results with full context: significance, effect size, power, and practical impact.

You treat "statistically significant" and "practically significant" as separate questions and always answer both.

---

Entry Points

Mode 1 — Analyze Experiment Results (A/B Test)

Use when an experiment has already run and you have result data.

1. Clarify — Confirm metric type (conversion rate, mean, count), sample sizes, and observed values 2. Choose test — Proportions → Z-test; Continuous means → t-test; Categorical → Chi-square 3. Run — Execute hypothesis_tester.py with appropriate method 4. Interpret — Report p-value, confidence interval, effect size (Cohen's d / Cohen's h / Cramér's V) 5. Decide — Ship / hold / extend using the decision framework below

Mode 2 — Size an Experiment (Pre-Launch)

Use before launching a test to ensure it will be conclusive.

1. Define — Baseline rate, minimum detectable effect (MDE), significance level (α), power (1−β) 2. Calculate — Run sample_size_calculator.py to get required N per variant 3. Sanity-check — Confirm traffic volume can deliver N within acceptable time window 4. Document — Lock the stopping rule before launch to prevent p-hacking

Mode 3 — Interpret Existing Numbers

Use when someone shares a result and asks "is this significant?" or "what does this mean?"

1. Ask for: sample sizes, observed values, baseline, and what decision depends on the result 2. Run the appropriate test 3. Report using the Bottom Line → What → Why → How to Act structure 4. Flag any validity threats (peeking, multiple comparisons, SUTVA violations)

---

Tools

`scripts/hypothesis_tester.py`

Run Z-test (proportions), two-sample t-test (means), or Chi-square test (categorical). Returns p-value, confidence interval, effect size, and a plain-English verdict.

# Z-test for two proportions (A/B conversion rates)
python3 scripts/hypothesis_tester.py --test ztest \
  --control-n 5000 --control-x 250 \
  --treatment-n 5000 --treatment-x 310

# Two-sample t-test (comparing means, e.g. revenue per user)
python3 scripts/hypothesis_tester.py --test ttest \
  --control-mean 42.3 --control-std 18.1 --control-n 800 \
  --treatment-mean 46.1 --treatment-std 19.4 --treatment-n 820

# Chi-square test (multi-category outcomes)
python3 scripts/hypothesis_tester.py --test chi2 \
  --observed "120,80,50" --expected "100,100,50"

# Output JSON for downstream use
python3 scripts/hypothesis_tester.py --test ztest \
  --control-n 5000 --control-x 250 \
  --treatment-n 5000 --treatment-x 310 \
  --format json

`scripts/sample_size_calculator.py`

Calculate required sample size per variant before launching an experiment.

# Proportion test (conversion rate experiment)
python3 scripts/sample_size_calculator.py --test proportion \
  --baseline 0.05 --mde 0.20 --alpha 0.05 --power 0.80

# Mean test (continuous metric experiment)
python3 scripts/sample_size_calculator.py --test mean \
  --baseline-mean 42.3 --baseline-std 18.1 --mde 0.10 \
  --alpha 0.05 --power 0.80

# Show tradeoff table across power levels
python3 scripts/sample_size_calculator.py --test proportion \
  --baseline 0.05 --mde 0.20 --table

# Output JSON
python3 scripts/sample_size_calculator.py --test proportion \
  --baseline 0.05 --mde 0.20 --format json

`scripts/confidence_interval.py`

Compute confidence intervals for a proportion or mean. Use for reporting observed metrics with uncertainty bounds.

# CI for a proportion
python3 scripts/confidence_interval.py --type proportion \
  --n 1200 --x 96

# CI for a mean
python3 scripts/confidence_interval.py --type mean \
  --n 800 --mean 42.3 --std 18.1

# Custom confidence level
python3 scripts/confidence_interval.py --type proportion \
  --n 1200 --x 96 --confidence 0.99

# Output JSON
python3 scripts/confidence_interval.py --type proportion \
  --n 1200 --x 96 --format json

---

Test Selection Guide

Scenario	Metric	Test
A/B conversion rate (clicked/not)	Proportion	Z-test for two proportions
A/B revenue, load time, session length	Continuous mean	Two-sample t-test (Welch's)
A/B/C/n multi-variant with categories	Categorical counts	Chi-square
Single sample vs. known value	Mean vs. constant	One-sample t-test
Non-normal data, small n	Rank-based	Use Mann-Whitney U (flag for human)

When NOT to use these tools:

n < 30 per group without checking normality
Metrics with heavy tails (e.g. revenue with whales) — consider log transform or trimmed mean first
Sequential / peeking scenarios — use sequential testing or SPRT instead
Clustered data (e.g. users within countries) — standard tests assume independence

---

Decision Framework (Post-Experiment)

Use this after running the test:

p-value	Effect Size	Practical Impact	Decision
< α	Large / Medium	Meaningful	✅ Ship
< α	Small	Negligible	⚠️ Hold — statistically significant but not worth the complexity
≥ α	—	—	🔁 Extend (if underpowered) or ❌ Kill
< α	Any	Negative UX	❌ Kill regardless

Always ask: "If this effect were exactly as measured, would the business care?" If no — don't ship on significance alone.

---

Effect Size Reference

Effect sizes translate statistical results into practical language:

Cohen's d (means):

d	Interpretation
< 0.2	Negligible
0.2–0.5	Small
0.5–0.8	Medium
> 0.8	Large

Cohen's h (proportions):

h	Interpretation
< 0.2	Negligible
0.2–0.5	Small
0.5–0.8	Medium
> 0.8	Large

Cramér's V (chi-square):

V	Interpretation
< 0.1	Negligible
0.1–0.3	Small
0.3–0.5	Medium
> 0.5	Large

---

Proactive Risk Triggers

Surface these unprompted when you spot the signals:

Peeking / early stopping — Running a test and checking results daily inflates false positive rate. Ask: "Did you look at results before the planned end date?"
Multiple comparisons — Testing 10 metrics at α=0.05 gives ~40% chance of at least one false positive. Flag when > 3 metrics are being evaluated.
Underpowered test — If n is below the required sample size, a non-significant result tells you nothing. Always check power retroactively.
SUTVA violations — If users in control and treatment can interact (e.g. social features, shared inventory), the independence assumption breaks.
Simpson's Paradox — An aggregate result can reverse when segmented. Flag when segment-level results are available.
Novelty effect — Significant early results in UX tests often decay. Flag for post-novelty re-measurement.

---

Output Artifacts

Request	Deliverable
"Did our test win?"	Significance report: p-value, CI, effect size, verdict, caveats
"How big should our test be?"	Sample size report with power/MDE tradeoff table
"What's the confidence interval for X?"	CI report with margin of error and interpretation
"Is this difference real?"	Hypothesis test with plain-English conclusion
"How long should we run this?"	Duration estimate = (required N per variant) / (daily traffic per variant)
"We tested 5 things — what's significant?"	Multiple comparison analysis with Bonferroni-adjusted thresholds

---

Quality Loop

Tag every finding with confidence:

🟢 Verified — Test assumptions met, sufficient n, no validity threats
🟡 Likely — Minor assumption violations; interpret directionally
🔴 Inconclusive — Underpowered, peeking, or data integrity issue; do not act

---

Communication Standard

Structure all results as:

Bottom Line — One sentence: "Treatment increased conversion by 1.2pp (95% CI: 0.4–2.0pp). Result is statistically significant (p=0.003) with a small effect (h=0.18). Recommend shipping."

What — The numbers: observed rates/means, difference, p-value, CI, effect size

Why It Matters — Business translation: what does the effect size mean in revenue, users, or decisions?

How to Act — Ship / hold / extend / kill with specific rationale

---

Related Skills

Skill	Use When
`marketing-skill/ab-test-setup`	Designing the experiment before it runs — randomization, instrumentation, holdout
`engineering/data-quality-auditor`	Verifying input data integrity before running any statistical test
`product-team/experiment-designer`	Structuring the hypothesis, success metrics, and guardrail metrics
`product-team/product-analytics`	Analyzing product funnel and retention metrics
`finance/saas-metrics-coach`	Interpreting SaaS KPIs that may feed into experiments (ARR, churn, LTV)
`marketing-skill/campaign-analytics`	Statistical analysis of marketing campaign performance

When NOT to use this skill:

You need to design or instrument the experiment — use marketing-skill/ab-test-setup or product-team/experiment-designer
You need to clean or validate the input data — use engineering/data-quality-auditor first
You need Bayesian inference or multi-armed bandit analysis — flag that frequentist tests may not be appropriate

---

References

references/statistical-testing-concepts.md — t-test, Z-test, chi-square theory; p-value interpretation; Type I/II errors; power analysis math

Statistical Testing Concepts Reference

Deep-dive reference for the Statistical Analyst skill. Keeps SKILL.md lean while preserving the theory.

---

The Frequentist Framework

All tests in this skill operate in the frequentist framework: we define a null hypothesis (H₀) and an alternative (H₁), then ask "how often would we see data this extreme if H₀ were true?"

H₀ (null): No difference exists between control and treatment
H₁ (alternative): A difference exists (two-tailed)
p-value: P(observing this result or more extreme | H₀ is true)
α (significance level): The threshold we set in advance. Reject H₀ if p < α.

The p-value misconception

A p-value of 0.03 does not mean "there is a 97% chance the effect is real." It means: "If there were no effect, we would see data this extreme only 3% of the time."

---

Type I and Type II Errors

	H₀ True	H₀ False
Reject H₀	Type I Error (α) — False Positive	Correct (Power = 1−β)
Fail to reject H₀	Correct	Type II Error (β) — False Negative

α (false positive rate): Typically 0.05. Reduce it when false positives are costly (medical trials, irreversible changes).
β (false negative rate): Typically 0.20 (power = 80%). Reduce it when missing real effects is costly.

---

Two-Proportion Z-Test

When: Comparing two binary conversion rates (e.g. clicked/not, signed up/not).

Assumptions:

Independent samples
n×p ≥ 5 and n×(1−p) ≥ 5 for both groups (normal approximation valid)
No interference between units (SUTVA)

Formula:

z = (p̂₂ − p̂₁) / √[p̄(1−p̄)(1/n₁ + 1/n₂)]

where p̄ = (x₁ + x₂) / (n₁ + n₂)  (pooled proportion)

Effect size — Cohen's h:

h = 2 arcsin(√p₂) − 2 arcsin(√p₁)

The arcsine transformation stabilizes variance across different baseline rates.

---

Welch's Two-Sample t-Test

When: Comparing means of a continuous metric between two groups (revenue, latency, session length).

Why Welch's (not Student's): Welch's t-test does not assume equal variances — it is strictly more general and loses little power when variances are equal. Always prefer it.

Formula:

t = (x̄₂ − x̄₁) / √(s₁²/n₁ + s₂²/n₂)

Welch–Satterthwaite df:
df = (s₁²/n₁ + s₂²/n₂)² / [(s₁²/n₁)²/(n₁−1) + (s₂²/n₂)²/(n₂−1)]

Effect size — Cohen's d:

d = (x̄₂ − x̄₁) / s_pooled

s_pooled = √[((n₁−1)s₁² + (n₂−1)s₂²) / (n₁+n₂−2)]

Warning for heavy-tailed metrics (revenue, LTV): Mean tests are sensitive to outliers. If the distribution has heavy tails, consider: 1. Winsorizing at 99th percentile before testing 2. Log-transforming (if values are positive) 3. Using a non-parametric test (Mann-Whitney U) and flagging for human review

---

Chi-Square Test

When: Comparing categorical distributions (e.g. which plan users selected, which error type occurred).

Assumptions:

Expected count ≥ 5 per cell (otherwise, combine categories or use Fisher's exact)
Independent observations

Formula:

χ² = Σ (Oᵢ − Eᵢ)² / Eᵢ

df = k − 1  (goodness-of-fit)
df = (r−1)(c−1)  (contingency table, r rows, c columns)

Effect size — Cramér's V:

V = √[χ² / (n × (min(r,c) − 1))]

---

Wilson Score Interval

The standard confidence interval formula for proportions (p̂ ± z√(p̂(1−p̂)/n)) can produce impossible values (< 0 or > 1) for small n or extreme p. The Wilson score interval fixes this:

center = (p̂ + z²/2n) / (1 + z²/n)
margin = z/(1+z²/n) × √(p̂(1−p̂)/n + z²/4n²)
CI = [center − margin, center + margin]

Always use Wilson (or Clopper-Pearson) for proportions. The normal approximation is a historical artifact.

---

Sample Size & Power

Power: The probability of correctly detecting a real effect of size δ.

n = (z_α/2 + z_β)² × (σ₁² + σ₂²) / δ²    [means]
n = (z_α/2 + z_β)² × (p₁(1−p₁) + p₂(1−p₂)) / (p₂−p₁)²    [proportions]

Key levers:

Increase n → more power (or detect smaller effects)
Increase MDE → smaller n (but you might miss smaller real effects)
Increase α → smaller n (but more false positives)
Increase power → larger n

The peeking problem: Checking results before the planned end date inflates your effective α. If you peek at 50%, 75%, and 100% of planned n, your true α is ~0.13 instead of 0.05 — a 2.6× inflation of false positives.

Solutions:

Pre-commit to a stopping rule and don't peek
Use sequential testing (SPRT) if early stopping is required
Use a Bonferroni-corrected α if you peek at scheduled intervals

---

Multiple Comparisons

Testing k hypotheses at α = 0.05 gives P(at least one false positive) ≈ 1 − (1 − 0.05)^k

k tests	P(≥1 false positive)
1	5%
3	14%
5	23%
10	40%
20	64%

Corrections:

Bonferroni: Use α/k per test. Conservative but simple. Appropriate for independent tests.
Benjamini-Hochberg (FDR): Controls false discovery rate, not family-wise error. Preferred when many tests are expected to be true positives.

---

SUTVA (Stable Unit Treatment Value Assumption)

A critical assumption for valid A/B tests: the outcome of unit i depends only on its own treatment assignment, not on other units' assignments.

Violations:

Social features (user A sees user B's activity — network spillover)
Shared inventory (one variant depletes shared stock)
Two-sided marketplaces (buyers and sellers interact)

Solutions:

Cluster randomization (randomize at the group/geography level)
Network A/B testing (graph-based splits)
Holdout-based testing

---

References

Imbens, G. & Rubin, D. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge.
Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments. Cambridge.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. 2nd ed.
Wilson, E.B. (1927). "Probable Inference, the Law of Succession, and Statistical Inference." JASA 22(158): 209–212.

#!/usr/bin/env python3
"""
confidence_interval.py — Confidence intervals for proportions and means.

Methods:
  proportion — Wilson score interval (recommended over normal approximation for small n or extreme p)
  mean       — t-based interval using normal approximation for large n

Usage:
    python3 confidence_interval.py --type proportion --n 1200 --x 96
    python3 confidence_interval.py --type mean --n 800 --mean 42.3 --std 18.1
    python3 confidence_interval.py --type proportion --n 1200 --x 96 --confidence 0.99
    python3 confidence_interval.py --type proportion --n 1200 --x 96 --format json
"""

import argparse
import json
import math
import sys


def normal_ppf(p: float) -> float:
    """Inverse normal CDF via bisection."""
    lo, hi = -10.0, 10.0
    for _ in range(100):
        mid = (lo + hi) / 2
        if 0.5 * math.erfc(-mid / math.sqrt(2)) < p:
            lo = mid
        else:
            hi = mid
    return (lo + hi) / 2


def wilson_interval(n: int, x: int, confidence: float) -> dict:
    """
    Wilson score confidence interval for a proportion.
    More accurate than normal approximation, especially for small n or p near 0/1.
    """
    if n <= 0:
        return {"error": "n must be positive"}
    if x < 0 or x > n:
        return {"error": "x must be between 0 and n"}

    p_hat = x / n
    z = normal_ppf(1 - (1 - confidence) / 2)
    z2 = z ** 2

    center = (p_hat + z2 / (2 * n)) / (1 + z2 / n)
    margin = (z / (1 + z2 / n)) * math.sqrt(p_hat * (1 - p_hat) / n + z2 / (4 * n ** 2))

    lo = max(0.0, center - margin)
    hi = min(1.0, center + margin)

    # Normal approximation for comparison
    se = math.sqrt(p_hat * (1 - p_hat) / n) if n > 0 else 0
    normal_lo = max(0.0, p_hat - z * se)
    normal_hi = min(1.0, p_hat + z * se)

    return {
        "type": "proportion",
        "method": "Wilson score interval",
        "n": n,
        "successes": x,
        "observed_rate": round(p_hat, 6),
        "confidence": confidence,
        "lower": round(lo, 6),
        "upper": round(hi, 6),
        "margin_of_error": round((hi - lo) / 2, 6),
        "normal_approximation": {
            "lower": round(normal_lo, 6),
            "upper": round(normal_hi, 6),
            "note": "Wilson is preferred; normal approx shown for reference",
        },
    }


def mean_interval(n: int, mean: float, std: float, confidence: float) -> dict:
    """
    Confidence interval for a mean.
    Uses normal approximation (z-based) for n >= 30, t-approximation otherwise.
    """
    if n <= 1:
        return {"error": "n must be > 1"}
    if std < 0:
        return {"error": "std must be non-negative"}

    se = std / math.sqrt(n)
    z = normal_ppf(1 - (1 - confidence) / 2)

    lo = mean - z * se
    hi = mean + z * se
    moe = z * se

    rel_moe = moe / abs(mean) * 100 if mean != 0 else None

    precision_note = ""
    if rel_moe and rel_moe > 20:
        precision_note = "Wide CI — consider increasing sample size for tighter estimates."
    elif rel_moe and rel_moe < 5:
        precision_note = "Tight CI — high precision estimate."

    return {
        "type": "mean",
        "method": "Normal approximation (z-based)" if n >= 30 else "Use with caution (n < 30)",
        "n": n,
        "observed_mean": round(mean, 6),
        "std": round(std, 6),
        "standard_error": round(se, 6),
        "confidence": confidence,
        "lower": round(lo, 6),
        "upper": round(hi, 6),
        "margin_of_error": round(moe, 6),
        "relative_margin_of_error_pct": round(rel_moe, 2) if rel_moe is not None else None,
        "precision_note": precision_note,
    }


def print_report(result: dict):
    if "error" in result:
        print(f"Error: {result['error']}", file=sys.stderr)
        sys.exit(1)

    conf_pct = int(result["confidence"] * 100)
    print("=" * 60)
    print(f"  CONFIDENCE INTERVAL REPORT")
    print("=" * 60)
    print(f"  Method: {result['method']}")
    print(f"  Confidence level: {conf_pct}%")
    print()

    if result["type"] == "proportion":
        print(f"  Observed rate: {result['observed_rate']:.4%}  ({result['successes']}/{result['n']})")
        print()
        print(f"  {conf_pct}% CI: [{result['lower']:.4%}, {result['upper']:.4%}]")
        print(f"  Margin of error: ±{result['margin_of_error']:.4%}")
        print()
        norm = result.get("normal_approximation", {})
        print(f"  Normal approx CI (ref): [{norm.get('lower', 0):.4%}, {norm.get('upper', 0):.4%}]")

    elif result["type"] == "mean":
        print(f"  Observed mean: {result['observed_mean']}  (std={result['std']}, n={result['n']})")
        print(f"  Standard error: {result['standard_error']}")
        print()
        print(f"  {conf_pct}% CI: [{result['lower']}, {result['upper']}]")
        print(f"  Margin of error: ±{result['margin_of_error']}")
        if result.get("relative_margin_of_error_pct") is not None:
            print(f"  Relative MoE: ±{result['relative_margin_of_error_pct']:.1f}%")
        if result.get("precision_note"):
            print(f"\n  ℹ️  {result['precision_note']}")

    print()
    # Interpretation guide
    print(f"  Interpretation: If this experiment were repeated many times,")
    print(f"  {conf_pct}% of the computed intervals would contain the true value.")
    print(f"  This does NOT mean there is a {conf_pct}% chance the true value is")
    print(f"  in this specific interval — it either is or it isn't.")
    print("=" * 60)


def main():
    parser = argparse.ArgumentParser(
        description="Compute confidence intervals for proportions and means."
    )
    parser.add_argument("--type", choices=["proportion", "mean"], required=True)
    parser.add_argument("--confidence", type=float, default=0.95,
                        help="Confidence level (default: 0.95)")
    parser.add_argument("--format", choices=["text", "json"], default="text")

    # Proportion
    parser.add_argument("--n", type=int, help="Total sample size")
    parser.add_argument("--x", type=int, help="Number of successes (for proportion)")

    # Mean
    parser.add_argument("--mean", type=float, help="Observed mean")
    parser.add_argument("--std", type=float, help="Observed standard deviation")

    args = parser.parse_args()

    if args.type == "proportion":
        if args.n is None or args.x is None:
            print("Error: --n and --x are required for proportion CI", file=sys.stderr)
            sys.exit(1)
        result = wilson_interval(args.n, args.x, args.confidence)

    elif args.type == "mean":
        if args.n is None or args.mean is None or args.std is None:
            print("Error: --n, --mean, and --std are required for mean CI", file=sys.stderr)
            sys.exit(1)
        result = mean_interval(args.n, args.mean, args.std, args.confidence)

    if args.format == "json":
        print(json.dumps(result, indent=2))
    else:
        print_report(result)


if __name__ == "__main__":
    main()

#!/usr/bin/env python3
"""
hypothesis_tester.py — Z-test (proportions), Welch's t-test (means), Chi-square (categorical).

All math uses Python stdlib (math module only). No scipy, numpy, or pandas required.

Usage:
    python3 hypothesis_tester.py --test ztest \
        --control-n 5000 --control-x 250 \
        --treatment-n 5000 --treatment-x 310

    python3 hypothesis_tester.py --test ttest \
        --control-mean 42.3 --control-std 18.1 --control-n 800 \
        --treatment-mean 46.1 --treatment-std 19.4 --treatment-n 820

    python3 hypothesis_tester.py --test chi2 \
        --observed "120,80,50" --expected "100,100,50"
"""

import argparse
import json
import math
import sys


# ---------------------------------------------------------------------------
# Normal / t-distribution approximations (stdlib only)
# ---------------------------------------------------------------------------

def normal_cdf(z: float) -> float:
    """Cumulative distribution function of standard normal using math.erfc."""
    return 0.5 * math.erfc(-z / math.sqrt(2))


def normal_ppf(p: float) -> float:
    """Percent-point function (inverse CDF) of standard normal via bisection."""
    lo, hi = -10.0, 10.0
    for _ in range(100):
        mid = (lo + hi) / 2
        if normal_cdf(mid) < p:
            lo = mid
        else:
            hi = mid
    return (lo + hi) / 2


def t_cdf(t: float, df: float) -> float:
    """
    CDF of t-distribution via regularized incomplete beta function approximation.
    Uses the relation: P(T ≤ t) = I_{x}(df/2, 1/2) where x = df/(df+t^2).
    Falls back to normal CDF for large df (> 1000).
    """
    if df > 1000:
        return normal_cdf(t)
    x = df / (df + t * t)
    # Regularized incomplete beta via continued fraction (Lentz)
    ib = _regularized_incomplete_beta(x, df / 2, 0.5)
    p = ib / 2
    return p if t <= 0 else 1 - p


def _regularized_incomplete_beta(x: float, a: float, b: float) -> float:
    """Regularized incomplete beta I_x(a,b) via continued fraction expansion."""
    if x < 0 or x > 1:
        return 0.0
    if x == 0:
        return 0.0
    if x == 1:
        return 1.0
    lbeta = math.lgamma(a) + math.lgamma(b) - math.lgamma(a + b)
    front = math.exp(math.log(x) * a + math.log(1 - x) * b - lbeta) / a
    # Use symmetry for better convergence
    if x > (a + 1) / (a + b + 2):
        return 1 - _regularized_incomplete_beta(1 - x, b, a)
    # Lentz continued fraction
    TINY = 1e-30
    f = TINY
    C = f
    D = 0.0
    for m in range(200):
        for s in (0, 1):
            if m == 0 and s == 0:
                num = 1.0
            elif s == 0:
                num = m * (b - m) * x / ((a + 2 * m - 1) * (a + 2 * m))
            else:
                num = -(a + m) * (a + b + m) * x / ((a + 2 * m) * (a + 2 * m + 1))
            D = 1 + num * D
            if abs(D) < TINY:
                D = TINY
            D = 1 / D
            C = 1 + num / C
            if abs(C) < TINY:
                C = TINY
            f *= C * D
            if abs(C * D - 1) < 1e-10:
                break
    return front * f


def two_tail_p_normal(z: float) -> float:
    return 2 * (1 - normal_cdf(abs(z)))


def two_tail_p_t(t: float, df: float) -> float:
    return 2 * (1 - t_cdf(abs(t), df))


# ---------------------------------------------------------------------------
# Effect sizes
# ---------------------------------------------------------------------------

def cohens_h(p1: float, p2: float) -> float:
    """Cohen's h for two proportions."""
    return 2 * math.asin(math.sqrt(p1)) - 2 * math.asin(math.sqrt(p2))


def cohens_d(mean1: float, std1: float, n1: int, mean2: float, std2: float, n2: int) -> float:
    """Cohen's d using pooled standard deviation."""
    pooled = math.sqrt(((n1 - 1) * std1 ** 2 + (n2 - 1) * std2 ** 2) / (n1 + n2 - 2))
    return (mean1 - mean2) / pooled if pooled else 0.0


def cramers_v(chi2: float, n: int, k: int) -> float:
    """Cramér's V effect size for chi-square test."""
    return math.sqrt(chi2 / (n * (k - 1))) if n and k > 1 else 0.0


def effect_label(val: float, metric: str) -> str:
    thresholds = {"h": [0.2, 0.5, 0.8], "d": [0.2, 0.5, 0.8], "v": [0.1, 0.3, 0.5]}
    t = thresholds.get(metric, [0.2, 0.5, 0.8])
    v = abs(val)
    if v < t[0]:
        return "negligible"
    if v < t[1]:
        return "small"
    if v < t[2]:
        return "medium"
    return "large"


# ---------------------------------------------------------------------------
# Tests
# ---------------------------------------------------------------------------

def ztest_proportions(cn: int, cx: int, tn: int, tx: int, alpha: float) -> dict:
    """Two-proportion Z-test."""
    if cn <= 0 or tn <= 0:
        return {"error": "Sample sizes must be positive."}

    p_c = cx / cn
    p_t = tx / tn
    p_pool = (cx + tx) / (cn + tn)

    se = math.sqrt(p_pool * (1 - p_pool) * (1 / cn + 1 / tn))
    if se == 0:
        return {"error": "Standard error is zero — check input values."}

    z = (p_t - p_c) / se
    p_value = two_tail_p_normal(z)

    # Confidence interval for difference (unpooled SE)
    se_diff = math.sqrt(p_c * (1 - p_c) / cn + p_t * (1 - p_t) / tn)
    z_crit = normal_ppf(1 - alpha / 2)
    diff = p_t - p_c
    ci_lo = diff - z_crit * se_diff
    ci_hi = diff + z_crit * se_diff

    h = cohens_h(p_t, p_c)
    lift = (p_t - p_c) / p_c * 100 if p_c else 0

    return {
        "test": "Two-proportion Z-test",
        "control": {"n": cn, "conversions": cx, "rate": round(p_c, 6)},
        "treatment": {"n": tn, "conversions": tx, "rate": round(p_t, 6)},
        "difference": round(diff, 6),
        "relative_lift_pct": round(lift, 2),
        "z_statistic": round(z, 4),
        "p_value": round(p_value, 6),
        "significant": p_value < alpha,
        "alpha": alpha,
        "confidence_interval": {
            "level": f"{int((1 - alpha) * 100)}%",
            "lower": round(ci_lo, 6),
            "upper": round(ci_hi, 6),
        },
        "effect_size": {
            "cohens_h": round(abs(h), 4),
            "interpretation": effect_label(h, "h"),
        },
    }


def ttest_means(cm: float, cs: float, cn: int, tm: float, ts: float, tn: int, alpha: float) -> dict:
    """Welch's two-sample t-test (unequal variances)."""
    if cn < 2 or tn < 2:
        return {"error": "Each group needs at least 2 observations."}

    se = math.sqrt(cs ** 2 / cn + ts ** 2 / tn)
    if se == 0:
        return {"error": "Standard error is zero — check std values."}

    t = (tm - cm) / se

    # Welch–Satterthwaite degrees of freedom
    num = (cs ** 2 / cn + ts ** 2 / tn) ** 2
    denom = (cs ** 2 / cn) ** 2 / (cn - 1) + (ts ** 2 / tn) ** 2 / (tn - 1)
    df = num / denom if denom else cn + tn - 2

    p_value = two_tail_p_t(t, df)

    z_crit = normal_ppf(1 - alpha / 2) if df > 1000 else normal_ppf(1 - alpha / 2)
    # Use t critical value approximation
    from_t = abs(t) / (p_value / 2) if p_value > 0 else z_crit  # rough
    t_crit = normal_ppf(1 - alpha / 2)  # normal approx for CI

    diff = tm - cm
    ci_lo = diff - t_crit * se
    ci_hi = diff + t_crit * se

    d = cohens_d(tm, ts, tn, cm, cs, cn)
    lift = (tm - cm) / cm * 100 if cm else 0

    return {
        "test": "Welch's two-sample t-test",
        "control": {"n": cn, "mean": round(cm, 4), "std": round(cs, 4)},
        "treatment": {"n": tn, "mean": round(tm, 4), "std": round(ts, 4)},
        "difference": round(diff, 4),
        "relative_lift_pct": round(lift, 2),
        "t_statistic": round(t, 4),
        "degrees_of_freedom": round(df, 1),
        "p_value": round(p_value, 6),
        "significant": p_value < alpha,
        "alpha": alpha,
        "confidence_interval": {
            "level": f"{int((1 - alpha) * 100)}%",
            "lower": round(ci_lo, 4),
            "upper": round(ci_hi, 4),
        },
        "effect_size": {
            "cohens_d": round(abs(d), 4),
            "interpretation": effect_label(d, "d"),
        },
    }


def chi2_test(observed: list[float], expected: list[float], alpha: float) -> dict:
    """Chi-square goodness-of-fit test."""
    if len(observed) != len(expected):
        return {"error": "Observed and expected must have the same number of categories."}
    if any(e <= 0 for e in expected):
        return {"error": "Expected values must all be positive."}
    if any(e < 5 for e in expected):
        return {"warning": "Some expected values < 5 — chi-square approximation may be unreliable.",
                "suggestion": "Consider combining categories or using Fisher's exact test."}

    chi2 = sum((o - e) ** 2 / e for o, e in zip(observed, expected))
    k = len(observed)
    df = k - 1
    n = sum(observed)

    # Chi-square CDF via regularized gamma function approximation
    p_value = 1 - _chi2_cdf(chi2, df)

    v = cramers_v(chi2, int(n), k)

    return {
        "test": "Chi-square goodness-of-fit",
        "categories": k,
        "observed": observed,
        "expected": expected,
        "chi2_statistic": round(chi2, 4),
        "degrees_of_freedom": df,
        "p_value": round(p_value, 6),
        "significant": p_value < alpha,
        "alpha": alpha,
        "effect_size": {
            "cramers_v": round(v, 4),
            "interpretation": effect_label(v, "v"),
        },
    }


def _chi2_cdf(x: float, k: float) -> float:
    """CDF of chi-square via regularized lower incomplete gamma."""
    if x <= 0:
        return 0.0
    return _regularized_gamma(k / 2, x / 2)


def _regularized_gamma(a: float, x: float) -> float:
    """Lower regularized incomplete gamma P(a, x) via series expansion."""
    if x < 0:
        return 0.0
    if x == 0:
        return 0.0
    if x < a + 1:
        # Series expansion
        ap = a
        delta = 1.0 / a
        total = delta
        for _ in range(300):
            ap += 1
            delta *= x / ap
            total += delta
            if abs(delta) < abs(total) * 1e-10:
                break
        return total * math.exp(-x + a * math.log(x) - math.lgamma(a))
    else:
        # Continued fraction (Lentz)
        b = x + 1 - a
        c = 1e30
        d = 1 / b
        f = d
        for i in range(1, 300):
            an = -i * (i - a)
            b += 2
            d = an * d + b
            if abs(d) < 1e-30:
                d = 1e-30
            c = b + an / c
            if abs(c) < 1e-30:
                c = 1e-30
            d = 1 / d
            delta = d * c
            f *= delta
            if abs(delta - 1) < 1e-10:
                break
        return 1 - math.exp(-x + a * math.log(x) - math.lgamma(a)) * f


# ---------------------------------------------------------------------------
# Reporting
# ---------------------------------------------------------------------------

DIRECTION = {True: "statistically significant", False: "NOT statistically significant"}


def verdict(result: dict) -> str:
    if "error" in result:
        return f"ERROR: {result['error']}"
    sig = result.get("significant", False)
    p = result.get("p_value", 1.0)
    alpha = result.get("alpha", 0.05)
    diff = result.get("difference", 0)
    lift = result.get("relative_lift_pct")
    ci = result.get("confidence_interval", {})
    es = result.get("effect_size", {})
    es_name = "Cohen's h" if "cohens_h" in es else ("Cohen's d" if "cohens_d" in es else "Cramér's V")
    es_val = es.get("cohens_h") or es.get("cohens_d") or es.get("cramers_v", 0)
    es_interp = es.get("interpretation", "")

    lines = [
        "",
        "=" * 60,
        f"  {result.get('test', 'Hypothesis Test')}",
        "=" * 60,
    ]

    if "control" in result and "rate" in result["control"]:
        c = result["control"]
        t = result["treatment"]
        lines += [
            f"  Control:   {c['rate']:.4%}  (n={c['n']}, conversions={c['conversions']})",
            f"  Treatment: {t['rate']:.4%}  (n={t['n']}, conversions={t['conversions']})",
            f"  Difference: {diff:+.4%}  ({'+' if lift >= 0 else ''}{lift:.1f}% relative lift)",
        ]
    elif "control" in result and "mean" in result["control"]:
        c = result["control"]
        t = result["treatment"]
        lines += [
            f"  Control:   mean={c['mean']}  std={c['std']}  n={c['n']}",
            f"  Treatment: mean={t['mean']}  std={t['std']}  n={t['n']}",
            f"  Difference: {diff:+.4f}  ({'+' if lift >= 0 else ''}{lift:.1f}% relative lift)",
        ]
    elif "observed" in result:
        lines += [
            f"  Observed: {result['observed']}",
            f"  Expected: {result['expected']}",
        ]

    lines += [
        "",
        f"  p-value:    {p:.6f}  (α={alpha})",
        f"  Result:     {DIRECTION[sig].upper()}",
    ]
    if ci:
        lines.append(f"  {ci['level']} CI: [{ci['lower']}, {ci['upper']}]")
    lines += [
        f"  Effect:     {es_name} = {es_val}  ({es_interp})",
        "",
    ]

    # Plain English verdict
    if sig:
        lines.append(f"  ✅ VERDICT: The difference is real (p={p:.4f} < α={alpha}).")
        if es_interp in ("negligible", "small"):
            lines.append("  ⚠️  BUT: Effect is small — confirm practical significance before shipping.")
        else:
            lines.append("  Effect size is meaningful. Recommend shipping if no negative guardrails.")
    else:
        lines.append(f"  ❌ VERDICT: Insufficient evidence to conclude a difference exists (p={p:.4f} ≥ α={alpha}).")
        lines.append("  Options: extend the test, increase MDE, or kill if underpowered.")

    lines.append("=" * 60)
    return "\n".join(lines)


def main():
    parser = argparse.ArgumentParser(description="Run hypothesis tests on experiment results.")
    parser.add_argument("--test", choices=["ztest", "ttest", "chi2"], required=True)
    parser.add_argument("--alpha", type=float, default=0.05, help="Significance level (default: 0.05)")
    parser.add_argument("--format", choices=["text", "json"], default="text")

    # Z-test / t-test shared
    parser.add_argument("--control-n", type=int)
    parser.add_argument("--treatment-n", type=int)

    # Z-test
    parser.add_argument("--control-x", type=int, help="Conversions in control group")
    parser.add_argument("--treatment-x", type=int, help="Conversions in treatment group")

    # t-test
    parser.add_argument("--control-mean", type=float)
    parser.add_argument("--control-std", type=float)
    parser.add_argument("--treatment-mean", type=float)
    parser.add_argument("--treatment-std", type=float)

    # chi2
    parser.add_argument("--observed", help="Comma-separated observed counts")
    parser.add_argument("--expected", help="Comma-separated expected counts")

    args = parser.parse_args()

    if args.test == "ztest":
        for req in ["control_n", "control_x", "treatment_n", "treatment_x"]:
            if getattr(args, req) is None:
                print(f"Error: --{req.replace('_', '-')} is required for ztest", file=sys.stderr)
                sys.exit(1)
        result = ztest_proportions(args.control_n, args.control_x, args.treatment_n, args.treatment_x, args.alpha)

    elif args.test == "ttest":
        for req in ["control_n", "control_mean", "control_std", "treatment_n", "treatment_mean", "treatment_std"]:
            if getattr(args, req) is None:
                print(f"Error: --{req.replace('_', '-')} is required for ttest", file=sys.stderr)
                sys.exit(1)
        result = ttest_means(
            args.control_mean, args.control_std, args.control_n,
            args.treatment_mean, args.treatment_std, args.treatment_n,
            args.alpha
        )

    elif args.test == "chi2":
        if not args.observed or not args.expected:
            print("Error: --observed and --expected are required for chi2", file=sys.stderr)
            sys.exit(1)
        observed = [float(x.strip()) for x in args.observed.split(",")]
        expected = [float(x.strip()) for x in args.expected.split(",")]
        result = chi2_test(observed, expected, args.alpha)

    if args.format == "json":
        print(json.dumps(result, indent=2))
    else:
        if "error" in result:
            print(f"Error: {result['error']}", file=sys.stderr)
            sys.exit(1)
        print(verdict(result))


if __name__ == "__main__":
    main()

#!/usr/bin/env python3
from __future__ import annotations
"""
sample_size_calculator.py — Required sample size per variant for A/B experiments.

Supports proportion tests (conversion rates) and mean tests (continuous metrics).
All math uses Python stdlib only.

Usage:
    python3 sample_size_calculator.py --test proportion \
        --baseline 0.05 --mde 0.20 --alpha 0.05 --power 0.80

    python3 sample_size_calculator.py --test mean \
        --baseline-mean 42.3 --baseline-std 18.1 --mde 0.10 \
        --alpha 0.05 --power 0.80

    python3 sample_size_calculator.py --test proportion \
        --baseline 0.05 --mde 0.20 --table

    python3 sample_size_calculator.py --test proportion \
        --baseline 0.05 --mde 0.20 --format json
"""

import argparse
import json
import math
import sys


def normal_cdf(z: float) -> float:
    return 0.5 * math.erfc(-z / math.sqrt(2))


def normal_ppf(p: float) -> float:
    """Inverse normal CDF via bisection."""
    lo, hi = -10.0, 10.0
    for _ in range(100):
        mid = (lo + hi) / 2
        if normal_cdf(mid) < p:
            lo = mid
        else:
            hi = mid
    return (lo + hi) / 2


def sample_size_proportion(baseline: float, mde: float, alpha: float, power: float) -> int:
    """
    Required n per variant for a two-proportion Z-test.

    Uses the standard formula:
        n = (z_α/2 + z_β)² × (p1(1−p1) + p2(1−p2)) / (p1 − p2)²

    Args:
        baseline: Control conversion rate (e.g. 0.05 for 5%)
        mde: Minimum detectable effect as relative change (e.g. 0.20 for +20% relative)
        alpha: Significance level (e.g. 0.05)
        power: Statistical power (e.g. 0.80)
    """
    p1 = baseline
    p2 = baseline * (1 + mde)

    if not (0 < p1 < 1) or not (0 < p2 < 1):
        raise ValueError(f"Rates must be between 0 and 1. Got baseline={p1}, treatment={p2:.4f}")

    z_alpha = normal_ppf(1 - alpha / 2)
    z_beta = normal_ppf(power)

    numerator = (z_alpha + z_beta) ** 2 * (p1 * (1 - p1) + p2 * (1 - p2))
    denominator = (p2 - p1) ** 2

    return math.ceil(numerator / denominator)


def sample_size_mean(baseline_mean: float, baseline_std: float, mde: float, alpha: float, power: float) -> int:
    """
    Required n per variant for a two-sample t-test.

    Uses:
        n = 2 × σ² × (z_α/2 + z_β)² / δ²

    where δ = mde × baseline_mean (absolute effect).

    Args:
        baseline_mean: Control group mean
        baseline_std: Control group standard deviation
        mde: Minimum detectable effect as relative change (e.g. 0.10 for +10%)
        alpha: Significance level
        power: Statistical power
    """
    delta = abs(mde * baseline_mean)
    if delta == 0:
        raise ValueError("MDE × baseline_mean = 0. Cannot size experiment with zero effect.")

    z_alpha = normal_ppf(1 - alpha / 2)
    z_beta = normal_ppf(power)

    n = 2 * baseline_std ** 2 * (z_alpha + z_beta) ** 2 / delta ** 2
    return math.ceil(n)


def duration_estimate(n_per_variant: int, daily_traffic: int | None, variants: int = 2) -> str:
    if daily_traffic and daily_traffic > 0:
        traffic_per_variant = daily_traffic / variants
        days = math.ceil(n_per_variant / traffic_per_variant)
        weeks = days / 7
        return f"{days} days ({weeks:.1f} weeks) at {daily_traffic:,} daily users split {variants} ways"
    return "Provide --daily-traffic to estimate duration"


def print_report(
    test: str, n: int, baseline: float, mde: float, alpha: float, power: float,
    daily_traffic: int | None, variants: int,
    baseline_mean: float | None = None, baseline_std: float | None = None
):
    total = n * variants
    treatment_rate = baseline * (1 + mde) if test == "proportion" else None
    absolute_mde = baseline * mde if test == "proportion" else (baseline_mean or 0) * mde

    print("=" * 60)
    print("  SAMPLE SIZE REPORT")
    print("=" * 60)

    if test == "proportion":
        print(f"  Baseline conversion rate: {baseline:.2%}")
        print(f"  Target conversion rate:   {treatment_rate:.2%}")
        print(f"  MDE: {mde:+.1%} relative  ({absolute_mde:+.4f} absolute)")
    else:
        print(f"  Baseline mean: {baseline_mean}  (std: {baseline_std})")
        print(f"  MDE: {mde:+.1%} relative  (absolute: {absolute_mde:+.4f})")

    print(f"  Significance level (α): {alpha}")
    print(f"  Statistical power (1−β): {power:.0%}")
    print(f"  Variants: {variants}")
    print()
    print(f"  Required per variant:  {n:>10,}")
    print(f"  Required total:        {total:>10,}")
    print()
    print(f"  Duration: {duration_estimate(n, daily_traffic, variants)}")
    print()

    # Risk interpretation
    if n < 100:
        print("  ⚠️  Very small sample — results may be sensitive to outliers.")
    elif n > 1_000_000:
        print("  ⚠️  Very large sample required — consider increasing MDE or accepting lower power.")
    else:
        print("  ✅ Sample size is achievable for most web/app products.")

    print("=" * 60)


def print_table(test: str, baseline: float, mde: float, alpha: float,
                baseline_mean: float | None, baseline_std: float | None):
    """Print tradeoff table across power levels and MDE values."""
    powers = [0.70, 0.75, 0.80, 0.85, 0.90, 0.95]
    mdes = [mde * 0.5, mde * 0.75, mde, mde * 1.5, mde * 2.0]

    print("=" * 70)
    print(f"  SAMPLE SIZE TRADEOFF TABLE  (α={alpha}, baseline={'proportion' if test == 'proportion' else 'mean'})")
    print("=" * 70)
    header = f"  {'MDE':>8} | " + " | ".join(f"power={p:.0%}" for p in powers)
    print(header)
    print("  " + "-" * (len(header) - 2))

    for m in mdes:
        row = f"  {m:>+7.1%} | "
        cells = []
        for p in powers:
            try:
                if test == "proportion":
                    n = sample_size_proportion(baseline, m, alpha, p)
                else:
                    n = sample_size_mean(baseline_mean, baseline_std, m, alpha, p)
                cells.append(f"{n:>9,}")
            except ValueError:
                cells.append(f"{'N/A':>9}")
        row += " | ".join(cells)
        print(row)

    print("=" * 70)
    print("  (Values = required n per variant)")
    print()


def main():
    parser = argparse.ArgumentParser(description="Calculate required sample size for A/B experiments.")
    parser.add_argument("--test", choices=["proportion", "mean"], required=True,
                        help="Type of metric: proportion (conversion rate) or mean (continuous)")
    parser.add_argument("--alpha", type=float, default=0.05, help="Significance level (default: 0.05)")
    parser.add_argument("--power", type=float, default=0.80, help="Statistical power (default: 0.80)")
    parser.add_argument("--mde", type=float, required=True,
                        help="Minimum detectable effect as relative change (e.g. 0.20 = +20%%)")
    parser.add_argument("--variants", type=int, default=2, help="Number of variants including control (default: 2)")
    parser.add_argument("--daily-traffic", type=int, help="Daily unique users (for duration estimate)")
    parser.add_argument("--table", action="store_true", help="Print tradeoff table across power and MDE")
    parser.add_argument("--format", choices=["text", "json"], default="text")

    # Proportion-specific
    parser.add_argument("--baseline", type=float, help="Baseline conversion rate (e.g. 0.05 for 5%%)")

    # Mean-specific
    parser.add_argument("--baseline-mean", type=float, help="Control group mean")
    parser.add_argument("--baseline-std", type=float, help="Control group standard deviation")

    args = parser.parse_args()

    try:
        if args.test == "proportion":
            if args.baseline is None:
                print("Error: --baseline is required for proportion test", file=sys.stderr)
                sys.exit(1)
            n = sample_size_proportion(args.baseline, args.mde, args.alpha, args.power)
        else:
            if args.baseline_mean is None or args.baseline_std is None:
                print("Error: --baseline-mean and --baseline-std are required for mean test", file=sys.stderr)
                sys.exit(1)
            n = sample_size_mean(args.baseline_mean, args.baseline_std, args.mde, args.alpha, args.power)
    except ValueError as e:
        print(f"Error: {e}", file=sys.stderr)
        sys.exit(1)

    if args.format == "json":
        output = {
            "test": args.test,
            "n_per_variant": n,
            "n_total": n * args.variants,
            "alpha": args.alpha,
            "power": args.power,
            "mde": args.mde,
            "variants": args.variants,
        }
        if args.test == "proportion":
            output["baseline_rate"] = args.baseline
            output["treatment_rate"] = round(args.baseline * (1 + args.mde), 6)
        else:
            output["baseline_mean"] = args.baseline_mean
            output["baseline_std"] = args.baseline_std
        if args.daily_traffic:
            days = math.ceil(n / (args.daily_traffic / args.variants))
            output["estimated_days"] = days
        print(json.dumps(output, indent=2))
        return

    if args.table:
        print_table(args.test, args.baseline if args.test == "proportion" else None,
                    args.mde, args.alpha, args.baseline_mean, args.baseline_std)

    print_report(
        args.test, n,
        baseline=args.baseline or 0,
        mde=args.mde,
        alpha=args.alpha,
        power=args.power,
        daily_traffic=args.daily_traffic,
        variants=args.variants,
        baseline_mean=args.baseline_mean,
        baseline_std=args.baseline_std,
    )


if __name__ == "__main__":
    main()

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

How it compares

Pick statistical-analyst for human-readable test selection and interpretation guidance during experiment design, not for building automated analytics pipelines.

FAQ

What framework does statistical-analyst use?

Statistical-analyst operates in the frequentist framework, defining a null hypothesis H₀ and alternative H₁, then evaluating how often observed data would appear if H₀ were true via the p-value against a preset α threshold.

When should developers use statistical-analyst?

Statistical-analyst fits developers designing or reviewing A/B and conversion experiments who need help choosing tests, setting significance levels, and avoiding common p-value misreadings before shipping changes.

Is Statistical Analyst safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Data Science & MLanalyticspipelines

About

Statistical Analyst by the numbers

Add your badge

Which statistical test fits my A/B experiment?

Who is it for?

When should I use this skill?

What you get

Files

Entry Points

Mode 1 — Analyze Experiment Results (A/B Test)

Mode 2 — Size an Experiment (Pre-Launch)

Mode 3 — Interpret Existing Numbers

Tools

scripts/hypothesis_tester.py

scripts/sample_size_calculator.py

scripts/confidence_interval.py

Test Selection Guide

Decision Framework (Post-Experiment)

Effect Size Reference

Proactive Risk Triggers

Output Artifacts

Quality Loop

Communication Standard

Related Skills

References

Statistical Testing Concepts Reference

The Frequentist Framework

The p-value misconception

Type I and Type II Errors

Two-Proportion Z-Test

Welch's Two-Sample t-Test

Chi-Square Test

Wilson Score Interval

Sample Size & Power

Multiple Comparisons

SUTVA (Stable Unit Treatment Value Assumption)

References

Related skills

How it compares

FAQ

What framework does statistical-analyst use?

When should developers use statistical-analyst?

Is Statistical Analyst safe to install?

This week in AI coding

`scripts/hypothesis_tester.py`

`scripts/sample_size_calculator.py`

`scripts/confidence_interval.py`