Backtest Expert

Name: Backtest Expert
Author: tradermonty

tradermonty/claude-trading-skills

1.4k installs
2.5k repo stars
Updated July 26, 2026
tradermonty/claude-trading-skills

backtest-expert is an agent skill for systematically backtest and stress-test quantitative trading strategies before live use.

About

The backtest-expert skill is designed for systematically backtest and stress-test quantitative trading strategies before live use. Backtest Expert Systematic approach to backtesting trading strategies based on professional methodology that prioritizes robustness over optimistic results. Core Philosophy Goal: Find strategies that "break the least", not strategies that "profit the most" on paper. Invoke when the user develops, tests, or stress-tests quantitative trading strategy backtests.

Developing or validating systematic trading strategies.
Evaluating whether a trading idea is robust enough for live implementation.
Troubleshooting why a backtest might be misleading.
Learning proper backtesting methodology.
Avoiding common pitfalls (curve-fitting, look-ahead bias, survivorship bias).

Backtest Expert by the numbers

1,414 all-time installs (skills.sh)
+64 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #95 of 1,136 Finance & Trading skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

At a glance

backtest-expert capabilities & compatibility

Capabilities: developing or validating systematic trading stra · evaluating whether a trading idea is robust enou · troubleshooting why a backtest might be misleadi · learning proper backtesting methodology

From the docs

What backtest-expert says it does

Expert guidance for systematic backtesting of trading strategies. Use when developing, testing, stress-testing, or validating quantitative trading strategies. Covers "beating ideas

SKILL.md

Expert guidance for systematic backtesting of trading strategies. Use when developing, testing, stress-testing, or validating quantitative trading strategies. C

SKILL.md

npx skills add https://github.com/tradermonty/claude-trading-skills --skill backtest-expert

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/tradermonty/claude-trading-skills/backtest-expert.svg)](https://skillselion.com/skills/tradermonty/claude-trading-skills/backtest-expert)

Installs	1.4k
repo stars	★ 2.5k
Security audit	3 / 3 scanners passed
Last updated	July 26, 2026
Repository	tradermonty/claude-trading-skills ↗

How do I systematically backtest and stress-test quantitative trading strategies before live use?

Systematically backtest and stress-test quantitative trading strategies before live use.

Who is it for?

Quant traders validating strategies with rigorous backtesting methodology.

Skip if: Skip for discretionary trading advice without systematic backtest scope.

When should I use this skill?

User develops, tests, or stress-tests quantitative trading strategy backtests.

What you get

Completed backtest-expert workflow with documented commands, files, and expected deliverables.

failure case study writeup
red-flags checklist
documented anti-patterns

Files

SKILL.mdMarkdownGitHub ↗

Backtest Expert

Systematic approach to backtesting trading strategies based on professional methodology that prioritizes robustness over optimistic results.

Core Philosophy

Goal: Find strategies that "break the least", not strategies that "profit the most" on paper.

Principle: Add friction, stress test assumptions, and see what survives. If a strategy holds up under pessimistic conditions, it's more likely to work in live trading.

When to Use This Skill

Use this skill when:

Developing or validating systematic trading strategies
Evaluating whether a trading idea is robust enough for live implementation
Troubleshooting why a backtest might be misleading
Learning proper backtesting methodology
Avoiding common pitfalls (curve-fitting, look-ahead bias, survivorship bias)
Assessing parameter sensitivity and regime dependence
Setting realistic expectations for slippage and execution costs

Prerequisites

Python 3.9+ (for evaluation script)
No API keys required
No external data dependencies — metrics are user-provided

Workflow

1. State the Hypothesis

Define the edge in one sentence.

Example: "Stocks that gap up >3% on earnings and pull back to previous day's close within first hour provide mean-reversion opportunity."

If you can't articulate the edge clearly, don't proceed to testing.

2. Codify Rules with Zero Discretion

Define with complete specificity:

Entry: Exact conditions, timing, price type
Exit: Stop loss, profit target, time-based exit
Position sizing: Fixed $$, % of portfolio, volatility-adjusted
Filters: Market cap, volume, sector, volatility conditions
Universe: What instruments are eligible

Critical: No subjective judgment allowed. Every decision must be rule-based and unambiguous.

3. Run Initial Backtest

Test over:

Minimum 5 years (preferably 10+)
Multiple market regimes (bull, bear, high/low volatility)
Realistic costs: Commissions + conservative slippage

Examine initial results for basic viability. If fundamentally broken, iterate on hypothesis.

4. Stress Test the Strategy

This is where 80% of testing time should be spent.

Parameter sensitivity:

Test stop loss at 50%, 75%, 100%, 125%, 150% of baseline
Test profit target at 80%, 90%, 100%, 110%, 120% of baseline
Vary entry/exit timing by ±15-30 minutes
Look for "plateaus" of stable performance, not narrow spikes

Execution friction:

Increase slippage to 1.5-2x typical estimates
Model worst-case fills (buy at ask+1 tick, sell at bid-1 tick)
Add realistic order rejection scenarios
Test with pessimistic commission structures

Time robustness:

Analyze year-by-year performance
Require positive expectancy in majority of years
Ensure strategy doesn't rely on 1-2 exceptional periods
Test in different market regimes separately

Sample size:

Absolute minimum: 30 trades
Preferred: 100+ trades
High confidence: 200+ trades

5. Out-of-Sample Validation

Walk-forward analysis: 1. Optimize on training period (e.g., Year 1-3) 2. Test on validation period (Year 4) 3. Roll forward and repeat 4. Compare in-sample vs out-of-sample performance

Warning signs:

Out-of-sample <50% of in-sample performance
Need frequent parameter re-optimization
Parameters change dramatically between periods

6. Evaluate Results

Questions to answer:

Does edge survive pessimistic assumptions?
Is performance stable across parameter variations?
Does strategy work in multiple market regimes?
Is sample size sufficient for statistical confidence?
Are results realistic, not "too good to be true"?

Decision criteria:

✅ Deploy: Survives all stress tests with acceptable performance
🔄 Refine: Core logic sound but needs parameter adjustment
❌ Abandon: Fails stress tests or relies on fragile assumptions

Use the evaluation script for a structured, quantitative assessment:

python3 skills/backtest-expert/scripts/evaluate_backtest.py \
  --total-trades 150 \
  --win-rate 62 \
  --avg-win-pct 1.8 \
  --avg-loss-pct 1.2 \
  --max-drawdown-pct 15 \
  --years-tested 8 \
  --num-parameters 3 \
  --slippage-tested \
  --output-dir reports/

The script scores across 5 dimensions (Sample Size, Expectancy, Risk Management, Robustness, Execution Realism), detects red flags, and outputs a Deploy/Refine/Abandon verdict.

Key Testing Principles

Punish the Strategy

Add friction everywhere:

Commissions higher than reality
Slippage 1.5-2x typical
Worst-case fills
Order rejections
Partial fills

Rationale: Strategies that survive pessimistic assumptions often outperform in live trading.

Seek Plateaus, Not Peaks

Look for parameter ranges where performance is stable, not optimal values that create performance spikes.

Good: Strategy profitable with stop loss anywhere from 1.5% to 3.0% Bad: Strategy only works with stop loss at exactly 2.13%

Stable performance indicates genuine edge; narrow optima suggest curve-fitting.

Test All Cases, Not Cherry-Picked Examples

Wrong approach: Study hand-picked "market leaders" that worked Right approach: Test every stock that met criteria, including those that failed

Selective examples create survivorship bias and overestimate strategy quality.

Separate Idea Generation from Validation

Intuition: Useful for generating hypotheses Validation: Must be purely data-driven

Never let attachment to an idea influence interpretation of test results.

Common Failure Patterns

Recognize these patterns early to save time:

1. Parameter sensitivity: Only works with exact parameter values 2. Regime-specific: Great in some years, terrible in others 3. Slippage sensitivity: Unprofitable when realistic costs added 4. Small sample: Too few trades for statistical confidence 5. Look-ahead bias: "Too good to be true" results 6. Over-optimization: Many parameters, poor out-of-sample results

See references/failed_tests.md for detailed examples and diagnostic framework.

Output

reports/backtest_eval_<timestamp>.json — structured evaluation with per-dimension scores, red flags, and verdict
reports/backtest_eval_<timestamp>.md — human-readable report with dimension table, key metrics, and red flag details

Resources

Methodology Reference

File: references/methodology.md

When to read: For detailed guidance on specific testing techniques.

Contents:

Stress testing methods
Parameter sensitivity analysis
Slippage and friction modeling
Sample size requirements
Market regime classification
Common biases and pitfalls (survivorship, look-ahead, curve-fitting, etc.)

Failed Tests Reference

File: references/failed_tests.md

When to read: When strategy fails tests, or learning from past mistakes.

Contents:

Why failures are valuable
Common failure patterns with examples
Case study documentation framework
Red flags checklist for evaluating backtests

Critical Reminders

Time allocation: Spend 20% generating ideas, 80% trying to break them.

Context-free requirement: If strategy requires "perfect context" to work, it's not robust enough for systematic trading.

Red flag: If backtest results look too good (>90% win rate, minimal drawdowns, perfect timing), audit carefully for look-ahead bias or data issues.

Tool limitations: Understand your backtesting platform's quirks (interpolation methods, handling of low liquidity, data alignment issues).

Statistical significance: Small edges require large sample sizes to prove. 5% edge per trade needs 100+ trades to distinguish from luck.

Discretionary vs Systematic Differences

This skill focuses on systematic/quantitative backtesting where:

All rules are codified in advance
No discretion or "feel" in execution
Testing happens on all historical examples, not cherry-picked cases
Context (news, macro) is deliberately stripped out

Discretionary traders study differently—this skill may not apply to setups requiring subjective judgment.

Learning from Failed Backtests

1. Why Failed Ideas Are Valuable 2. Common Failure Patterns 3. Case Study Framework 4. Red Flags Checklist

1. Why Failed Ideas Are Valuable

The Value of Failures

Key insights:

Failed tests save capital by preventing live implementation
Failure patterns reveal which assumptions don't hold
Understanding what doesn't work narrows the search space
Failed tests build experience in recognizing fragile strategies

Documentation Discipline

Record for each failed idea:

The hypothesis being tested
Why you thought it would work
What the data showed
Specific breaking points
Lessons learned

Purpose: Build a library of "anti-patterns" to avoid repeating mistakes.

2. Common Failure Patterns

Pattern 1: Parameter Sensitivity

Symptom: Strategy only works with very specific parameter values.

Example scenario:

Strategy profitable with stop loss at exactly 2.5%
Increasing to 3% or decreasing to 2% causes significant performance drop
No "plateau" of stable performance

Why it fails: Real markets have noise; if small changes break the strategy, it likely captured noise, not signal.

Lesson: Seek strategies with stable performance across parameter ranges.

Pattern 2: Regime-Specific Performance

Symptom: Strategy works brilliantly in some years, terribly in others.

Example scenario:

Great performance in 2017-2019 (low volatility bull market)
Catastrophic losses in 2020 (high volatility)
Poor performance in 2022 (downtrend)

Why it fails: Strategy dependent on specific market conditions, not robust enough for diverse environments.

Lesson: Require acceptable (not necessarily best) performance across all regimes.

Pattern 3: Slippage Sensitivity

Symptom: Strategy becomes unprofitable when realistic trading costs added.

Example scenario:

Backtest shows 0.5% average gain per trade
Adding 0.1% slippage per side (0.2% round-trip) eliminates profits
Strategy requires unrealistic fills to be profitable

Why it fails: Edge too small to survive real-world friction.

Lesson: Edge must be large enough to survive pessimistic assumptions about costs.

Pattern 4: Sample Size Issues

Symptom: Strong results based on small number of trades.

Example scenario:

Backtest shows 80% win rate
Only 15 total trades in 5 years
A few different outcomes would dramatically change results

Why it fails: Insufficient data to distinguish edge from luck.

Lesson: Require minimum 100 trades for meaningful conclusions, preferably 200+.

Pattern 5: Look-Ahead Bias

Symptom: Perfect or near-perfect backtest results.

Example scenario:

Strategy shows 95%+ win rate
Unrealistically good entry/exit timing
Performance too good to be realistic

Why it fails: Likely using information not available at time of trade.

Lesson: Be suspicious of "too good to be true" results; audit data alignment carefully.

Pattern 6: Over-Optimization (Curve Fitting)

Symptom: Complex strategy with many parameters shows excellent in-sample results but poor out-of-sample.

Example scenario:

Strategy uses 8-10 different indicators with specific thresholds
In-sample performance: 40% annual return
Out-of-sample performance: -5% annual return
Parameters needed constant re-optimization

Why it fails: Fitted to historical noise rather than genuine market structure.

Lesson: Prefer simple strategies with fewer parameters; demand strong out-of-sample results.

3. Case Study Framework

Template for Documenting Failed Ideas

Use this framework when a backtest fails:

1. Initial Hypothesis

What edge were you trying to capture?
Why did you think this would work?
What was the logical basis?

2. Implementation Details

Entry rules (specific and complete)
Exit rules (stop loss, profit target, time-based)
Position sizing
Filters or conditions

3. Test Results

Basic metrics:
Total trades
Win rate
Average win/loss
Max drawdown
Annual returns by year

Parameter sensitivity:
How results changed with parameter variations
Whether "plateau" of stable performance existed

Regime analysis:
Performance in different market conditions
Which regimes caused problems

4. Breaking Points

What specifically caused the strategy to fail?
Slippage too high?
Parameter sensitivity?
Regime-specific?
Insufficient sample size?

5. Lessons Learned

What assumptions were wrong?
What would you test differently next time?
Are there salvageable elements?

Example: Failed Momentum Reversal Strategy

1. Initial Hypothesis

Tried to capture mean reversion after strong momentum moves. Hypothesis: Stocks that gap up 5%+ on earnings often pull back 2-3% before continuing, providing short-term reversal opportunity.

2. Implementation

Entry: Short when stock gaps up 5%+ on earnings at market open
Exit: Cover at 2% profit or 3% stop loss
Holding period: Maximum 3 days
Filters: Market cap >$2B, average volume >500K shares

3. Test Results

67 trades over 5 years
Win rate: 58%
Avg win: 2.1%, Avg loss: 3.2%
Max drawdown: 18%
2019-2021: Profitable
2022-2023: Significant losses

4. Breaking Points

Strategy failed during strong momentum environments (2021 meme stocks)
Stop losses hit frequently during continued upward momentum
Gap-ups that continued higher immediately caused outsized losses
Small sample size (67 trades) provided low statistical confidence
Slippage on short entries during high volatility eliminated thin edge

5. Lessons Learned

Mean reversion strategies vulnerable during momentum regimes
Need regime filter (e.g., only trade during high VIX or weak market)
5-year test insufficient for momentum strategies; need 10+ years
Edge too small (2% target vs 3% stop) to survive slippage
Better approach: Wait for actual pullback, then enter, rather than fade immediately

4. Red Flags Checklist

Use this checklist when evaluating any backtest:

Data Quality Issues

[ ] Has survivorship bias been addressed?
[ ] Are delisted stocks included in test?
[ ] Is data alignment correct (no look-ahead bias)?
[ ] Are corporate actions (splits, dividends) handled correctly?

Sample Size Concerns

[ ] At least 100 trades? (Preferably 200+)
[ ] At least 5 years of data? (Preferably 10+)
[ ] Includes full market cycle?
[ ] Tested across multiple market regimes?

Parameter Robustness

[ ] Does strategy work with nearby parameter values?
[ ] Are there "plateaus" of stable performance?
[ ] Minimal parameters (ideally <5)?
[ ] Parameters based on logical reasoning, not pure optimization?

Execution Realism

[ ] Realistic commissions included?
[ ] Slippage modeled conservatively (1.5-2x typical)?
[ ] Worst-case fills considered?
[ ] Order rejection/partial fills addressed?

Performance Characteristics

[ ] Positive expectancy in majority of years?
[ ] Acceptable performance in all major regimes?
[ ] No catastrophic drawdowns (>50%)?
[ ] Edge large enough to survive friction?

Bias Prevention

[ ] Strategy defined before testing?
[ ] Hypothesis has economic logic?
[ ] Results aren't "too good to be true"?
[ ] Out-of-sample testing performed?
[ ] No cherry-picking of examples?

Tool Limitations

[ ] Aware of testing platform's interpolation methods?
[ ] Understand how platform handles low-liquidity situations?
[ ] Know quirks specific to data provider?

If more than 2-3 items aren't checked, the backtest requires additional work before considering live implementation.

Backtesting Methodology Reference

1. Core Testing Techniques 2. Stress Testing Methods 3. Parameter Sensitivity Analysis 4. Slippage and Friction Modeling 5. Sample Size Guidelines 6. Market Regime Analysis 7. Common Pitfalls and Biases

1. Core Testing Techniques

"Beat Ideas to Death" Approach

Core principle: Add friction and punishment to find strategies that break the least, not those that profit the most on paper.

Key techniques:

Multiple stop loss variations
Different profit targets
Realistic + exaggerated commissions
Worst-case fills
Extended time periods
Multiple market regimes

The 80/20 Rule for R&D Time

20% generating and codifying ideas
80% stress testing and trying to break them

2. Stress Testing Methods

Execution Friction Tests

Required friction additions:

Realistic commissions (actual broker rates)
Pessimistic slippage (1.5-2x typical)
Worst-case entry fills (ask + 1-2 ticks)
Worst-case exit fills (bid - 1-2 ticks)
Order rejection scenarios
Partial fills

Parameter Robustness Tests

Test across multiple configurations:

Entry timing variations (±15-30 minutes)
Stop loss distances (50%, 75%, 100%, 125%, 150% of baseline)
Profit targets (80%, 90%, 100%, 110%, 120% of baseline)
Position sizing rules
Filter thresholds

Goal: Find "plateau" performance where small parameter changes don't drastically alter results.

Time-Based Robustness

Minimum requirements:

Test across at least 5-10 years
Include multiple market regimes:
Bull markets
Bear markets
High volatility periods
Low volatility periods
Trending markets
Range-bound markets

Year-by-year analysis: Strategy should show positive expectancy in majority of years, not rely on 1-2 exceptional years.

3. Parameter Sensitivity Analysis

Heat Map Analysis

Create 2D heat maps varying two parameters simultaneously:

Profit target (rows) × Stop loss (columns)
Entry time (rows) × Exit time (columns)
Volatility filter (rows) × Volume filter (columns)

Interpretation:

Robust strategies show "plateaus" of consistent performance
Fragile strategies show "spikes" or narrow optimal ranges
Avoid strategies with performance cliffs at parameter boundaries

Walk-Forward Analysis

1. Optimize parameters on training period (e.g., Year 1-2) 2. Test with those parameters on validation period (Year 3) 3. Roll forward and repeat 4. Compare in-sample vs out-of-sample performance

Warning signs:

Out-of-sample performance <50% of in-sample
Frequent need to re-optimize parameters
Parameters that change dramatically between periods

Profit Factor Scoring in Evaluation Script

The evaluation script scores profit factor (PF) as one component of the Risk Management dimension (0-8 points out of 20). The mapping uses continuous linear interpolation with integer truncation:

PF < 1.0 → 0 points (unprofitable)
PF 1.0 to 3.0 → linear 0 to 8 points: int((PF - 1.0) / 2.0 * 8)
PF >= 3.0 → 8 points (capped)

The int() truncation creates discrete 1-point steps (e.g., PF 1.25→1 pt, PF 1.50→2 pt). This is intentional — integer scoring avoids false precision from fractional differences in profit factor.

4. Slippage and Friction Modeling

Realistic Slippage Assumptions

By market capitalization:

Mega cap (>$200B): 0.01-0.02%
Large cap ($10B-$200B): 0.02-0.05%
Mid cap ($2B-$10B): 0.05-0.10%
Small cap ($300M-$2B): 0.10-0.20%
Micro cap (<$300M): 0.20-0.50%+

By order type:

Market orders: Higher slippage
Limit orders: Lower slippage but potential non-fills
Stop orders: Significant slippage in volatile conditions

Conservative Testing Approach

Use 1.5-2x typical slippage estimates for stress testing:

If typical slippage is 0.05%, test with 0.075-0.10%
If typical is 0.10%, test with 0.15-0.20%

Rationale: Strategies that survive pessimistic assumptions often perform better in practice than in backtests.

5. Sample Size Guidelines

Minimum Trade Requirements

Statistical significance thresholds:

Absolute minimum: 30 trades
Preferred minimum: 100 trades
High confidence: 200+ trades

Why large samples matter:

Reduces impact of outliers
Provides statistical confidence
Reveals true edge vs luck

Time Period Considerations

Minimum testing period: 5 years Preferred testing period: 10+ years

Must include:

At least one full market cycle
Multiple volatility regimes
Different Federal Reserve policy environments

6. Market Regime Analysis

Regime Classification

Volatility-based regimes:

Low volatility: VIX <15
Normal volatility: VIX 15-25
High volatility: VIX 25-35
Extreme volatility: VIX >35

Trend-based regimes:

Strong uptrend: Market +10%+ over 6 months
Moderate uptrend: Market +5% to +10% over 6 months
Sideways: Market -5% to +5% over 6 months
Downtrend: Market <-5% over 6 months

Performance Requirements by Regime

Robust strategy characteristics:

Positive expectancy in majority of regimes
Acceptable (not necessarily best) in all regimes
No catastrophic failures in any single regime
Understanding of which regime causes weakness

7. Common Pitfalls and Biases

Survivorship Bias

Issue: Testing only on currently-trading stocks ignores delisted/bankrupt companies.

Solution: Use survivorship-bias-free datasets that include historical delistings.

Look-Ahead Bias

Issue: Using information not available at the time of trade.

Examples:

Using EOD data for intraday decisions
Using next-day's open for today's close decisions
Calculating indicators with future data points

Prevention: Strict timestamp control and data alignment checks.

Curve-Fitting (Over-Optimization)

Warning signs:

Too many parameters (>=7 triggers over-optimization flag; <=4 is ideal, 5-6 acceptable)
Highly specific parameter values (e.g., RSI = 37.3)
Perfect backtest results
Large performance drop in validation period

Prevention techniques:

Limit parameters to essential ones only
Use round numbers when possible
Require out-of-sample testing
Analyze parameter sensitivity

Sample Selection Bias

Issue: Testing only on hand-picked examples (e.g., known market leaders).

Problem: Ignoring all stocks that met criteria but failed creates false impression of strategy quality.

Solution: Test on ALL historical examples meeting the criteria, not just successful outcomes.

Hindsight Bias

Issue: Using outcome knowledge to influence decisions.

Prevention for systematic trading:

Define all rules in advance
No manual intervention based on hindsight
Test rules across all cases, not cherry-picked examples

Data Mining Bias

Issue: Testing hundreds of strategies until finding one that "works" by random chance.

Risk: With enough attempts, random data will produce seemingly profitable patterns.

Mitigation:

Have hypothesis before testing
Require economic logic for the edge
Use Bonferroni correction for multiple comparisons
Demand higher significance thresholds (p < 0.01 instead of p < 0.05)

#!/usr/bin/env python3
"""Evaluate backtest quality using a 5-dimension scoring framework.

Dimensions (each 20 points, total 100):
  1. Sample Size   — total trades
  2. Expectancy    — win rate * avg win vs loss rate * avg loss
  3. Risk Mgmt     — max drawdown and profit factor
  4. Robustness    — years tested and parameter count
  5. Exec Realism  — slippage/friction tested flag

Based on methodology from skills/backtest-expert/references/methodology.md
and red-flag checklist from skills/backtest-expert/references/failed_tests.md.
"""

from __future__ import annotations

import argparse
import json
from datetime import datetime
from pathlib import Path

# ---------------------------------------------------------------------------
# Scoring functions (each returns 0-20)
# ---------------------------------------------------------------------------


def score_sample_size(total_trades: int) -> int:
    """Score based on number of trades.

    <30  -> 0
    30   -> 8
    100  -> 15
    200+ -> 20
    """
    if total_trades < 30:
        return 0
    if total_trades < 100:
        # Linear interpolation 8..14 for 30..99
        return 8 + int((total_trades - 30) / 70 * 7)
    if total_trades < 200:
        return 15 + int((total_trades - 100) / 100 * 5)
    return 20


def calc_profit_factor(win_rate: float, avg_win_pct: float, avg_loss_pct: float) -> float:
    """Calculate profit factor: (win_rate * avg_win) / (loss_rate * avg_loss).

    Returns float('inf') when loss component is zero.
    """
    wr = win_rate / 100.0
    loss_component = (1 - wr) * avg_loss_pct
    if loss_component == 0:
        return float("inf")
    return (wr * avg_win_pct) / loss_component


def calc_expectancy(win_rate: float, avg_win_pct: float, avg_loss_pct: float) -> float:
    """Calculate expectancy per trade in percent.

    E = win_rate * avg_win - loss_rate * avg_loss
    """
    wr = win_rate / 100.0
    return wr * avg_win_pct - (1 - wr) * avg_loss_pct


def score_expectancy(win_rate: float, avg_win_pct: float, avg_loss_pct: float) -> int:
    """Score based on expectancy value.

    <=0       -> 0
    0..0.5    -> 5..10  (linear)
    0.5..1.5  -> 10..18 (linear)
    >=1.5     -> 20
    """
    exp = calc_expectancy(win_rate, avg_win_pct, avg_loss_pct)
    if exp <= 0:
        return 0
    if exp < 0.5:
        return 5 + int(exp / 0.5 * 5)
    if exp < 1.5:
        return 10 + int((exp - 0.5) / 1.0 * 8)
    return 20


def score_risk_management(
    max_drawdown_pct: float,
    win_rate: float,
    avg_win_pct: float,
    avg_loss_pct: float,
) -> int:
    """Score based on max drawdown and profit factor.

    Drawdown component (0-12):
      <20%  -> 12
      20-50% -> linear 12..0
      >50%  -> 0

    Profit factor component (0-8):
      <1.0  -> 0
      1.0-3.0 -> linear 0..8
      3.0+  -> 8
    """
    # Drawdown component (0-12)
    # 50%+ drawdown is catastrophic — override total to 0
    if max_drawdown_pct >= 50:
        return 0
    if max_drawdown_pct < 20:
        dd_score = 12
    else:
        dd_score = int(12 * (50 - max_drawdown_pct) / 30)

    # Profit factor component (0-8)
    # Continuous: PF 1.0→3.0 maps linearly to 0→8, capped at 8 for PF≥3.0
    pf = calc_profit_factor(win_rate, avg_win_pct, avg_loss_pct)
    if pf < 1.0:
        pf_score = 0
    elif pf >= 3.0:
        pf_score = 8
    else:
        pf_score = int((pf - 1.0) / 2.0 * 8)

    total = dd_score + pf_score
    return min(20, total)


def score_robustness(years_tested: int, num_parameters: int) -> int:
    """Score based on test duration and parameter count.

    Years component (0-15):
      <5   -> 0
      5-9  -> linear 5..14
      10+  -> 15

    Parameter component (0-5):
      <=4  -> 5
      5-6  -> 3
      7    -> 1
      8+   -> 0
    """
    # Years component (0-15)
    if years_tested < 5:
        years_score = 0
    elif years_tested >= 10:
        years_score = 15
    else:
        years_score = 5 + int((years_tested - 5) / 5 * 10)

    # Parameter component (0-5)
    if num_parameters <= 4:
        param_score = 5
    elif num_parameters <= 6:
        param_score = 3
    elif num_parameters == 7:
        param_score = 1
    else:
        param_score = 0

    return min(20, years_score + param_score)


def score_execution_realism(slippage_tested: bool) -> int:
    """Score based on whether slippage/friction was tested.

    Tested   -> 20
    Untested -> 0
    """
    return 20 if slippage_tested else 0


# ---------------------------------------------------------------------------
# Verdict
# ---------------------------------------------------------------------------


def get_verdict(total_score: int) -> str:
    """Map total score to Deploy / Refine / Abandon."""
    if total_score >= 70:
        return "Deploy"
    if total_score >= 40:
        return "Refine"
    return "Abandon"


# ---------------------------------------------------------------------------
# Red flags
# ---------------------------------------------------------------------------


def detect_red_flags(
    total_trades: int,
    win_rate: float,
    avg_win_pct: float,
    avg_loss_pct: float,
    max_drawdown_pct: float,
    years_tested: int,
    num_parameters: int,
    slippage_tested: bool,
) -> list[dict]:
    """Detect red flags based on methodology checklist."""
    flags: list[dict] = []

    if total_trades < 30:
        flags.append(
            {
                "id": "small_sample",
                "severity": "high",
                "message": f"Only {total_trades} trades — minimum 30 required for statistical confidence.",
            }
        )

    if not slippage_tested:
        flags.append(
            {
                "id": "no_slippage_test",
                "severity": "high",
                "message": "Slippage/friction not tested — results may not survive real-world execution.",
            }
        )

    if max_drawdown_pct > 50:
        flags.append(
            {
                "id": "excessive_drawdown",
                "severity": "high",
                "message": f"Max drawdown {max_drawdown_pct}% exceeds 50% threshold — catastrophic risk.",
            }
        )

    if num_parameters >= 7:
        flags.append(
            {
                "id": "over_optimized",
                "severity": "medium",
                "message": f"{num_parameters} parameters suggests over-optimization / curve-fitting risk.",
            }
        )

    if years_tested < 5:
        flags.append(
            {
                "id": "short_test_period",
                "severity": "medium",
                "message": f"Only {years_tested} years tested — may miss regime changes (minimum 5 recommended).",
            }
        )

    exp = calc_expectancy(win_rate, avg_win_pct, avg_loss_pct)
    if exp < 0:
        flags.append(
            {
                "id": "negative_expectancy",
                "severity": "high",
                "message": f"Negative expectancy ({exp:.3f}%) — strategy loses money on average.",
            }
        )

    if win_rate > 90 and max_drawdown_pct < 5:
        flags.append(
            {
                "id": "too_good",
                "severity": "medium",
                "message": "Results look too good — audit for look-ahead bias or data issues.",
            }
        )

    return flags


# ---------------------------------------------------------------------------
# Main evaluation
# ---------------------------------------------------------------------------


def validate_inputs(
    total_trades: int,
    win_rate: float,
    avg_win_pct: float,
    avg_loss_pct: float,
    max_drawdown_pct: float,
    years_tested: int,
    num_parameters: int,
) -> None:
    """Validate evaluation inputs at system boundary. Raises ValueError."""
    if total_trades < 0:
        raise ValueError("total_trades must be >= 0")
    if not (0 <= win_rate <= 100):
        raise ValueError("win_rate must be between 0 and 100")
    if avg_win_pct < 0:
        raise ValueError("avg_win_pct must be >= 0")
    if avg_loss_pct < 0:
        raise ValueError("avg_loss_pct must be >= 0")
    if max_drawdown_pct < 0:
        raise ValueError("max_drawdown_pct must be >= 0")
    if years_tested < 0:
        raise ValueError("years_tested must be >= 0")
    if num_parameters < 0:
        raise ValueError("num_parameters must be >= 0")


def evaluate(
    total_trades: int,
    win_rate: float,
    avg_win_pct: float,
    avg_loss_pct: float,
    max_drawdown_pct: float,
    years_tested: int,
    num_parameters: int,
    slippage_tested: bool,
) -> dict:
    """Run full 5-dimension evaluation and return structured result."""
    validate_inputs(
        total_trades,
        win_rate,
        avg_win_pct,
        avg_loss_pct,
        max_drawdown_pct,
        years_tested,
        num_parameters,
    )
    d1 = score_sample_size(total_trades)
    d2 = score_expectancy(win_rate, avg_win_pct, avg_loss_pct)
    d3 = score_risk_management(max_drawdown_pct, win_rate, avg_win_pct, avg_loss_pct)
    d4 = score_robustness(years_tested, num_parameters)
    d5 = score_execution_realism(slippage_tested)

    total = d1 + d2 + d3 + d4 + d5
    total = max(0, min(100, total))

    return {
        "total_score": total,
        "verdict": get_verdict(total),
        "dimensions": [
            {"name": "Sample Size", "score": d1, "max_score": 20},
            {"name": "Expectancy", "score": d2, "max_score": 20},
            {"name": "Risk Management", "score": d3, "max_score": 20},
            {"name": "Robustness", "score": d4, "max_score": 20},
            {"name": "Execution Realism", "score": d5, "max_score": 20},
        ],
        "red_flags": detect_red_flags(
            total_trades,
            win_rate,
            avg_win_pct,
            avg_loss_pct,
            max_drawdown_pct,
            years_tested,
            num_parameters,
            slippage_tested,
        ),
        "profit_factor": calc_profit_factor(win_rate, avg_win_pct, avg_loss_pct),
        "expectancy": calc_expectancy(win_rate, avg_win_pct, avg_loss_pct),
        "inputs": {
            "total_trades": total_trades,
            "win_rate": win_rate,
            "avg_win_pct": avg_win_pct,
            "avg_loss_pct": avg_loss_pct,
            "max_drawdown_pct": max_drawdown_pct,
            "years_tested": years_tested,
            "num_parameters": num_parameters,
            "slippage_tested": slippage_tested,
        },
    }


# ---------------------------------------------------------------------------
# Output writers
# ---------------------------------------------------------------------------


def to_markdown(result: dict) -> str:
    """Render evaluation result as markdown report."""
    lines = [
        "# Backtest Evaluation Report",
        "",
        f"**Generated**: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
        "",
        f"## Verdict: {result['verdict']}",
        "",
        f"**Total Score: {result['total_score']} / 100**",
        "",
        "## Dimension Scores",
        "",
        "| Dimension | Score | Max |",
        "|-----------|------:|----:|",
    ]
    for dim in result["dimensions"]:
        lines.append(f"| {dim['name']} | {dim['score']} | {dim['max_score']} |")

    lines.extend(
        [
            "",
            "## Key Metrics",
            "",
            f"- **Profit Factor**: {result['profit_factor']:.2f}"
            if result["profit_factor"] != float("inf")
            else "- **Profit Factor**: Inf (no losing trades)",
            f"- **Expectancy**: {result['expectancy']:.3f}% per trade",
        ]
    )

    if result["red_flags"]:
        lines.extend(["", "## Red Flags", ""])
        for flag in result["red_flags"]:
            icon = "🔴" if flag["severity"] == "high" else "🟡"
            lines.append(f"- {icon} **{flag['id']}**: {flag['message']}")
    else:
        lines.extend(["", "## Red Flags", "", "No red flags detected."])

    lines.extend(
        [
            "",
            "## Input Parameters",
            "",
        ]
    )
    for key, value in result["inputs"].items():
        lines.append(f"- **{key}**: {value}")

    lines.append("")
    return "\n".join(lines)


def write_outputs(result: dict, output_dir: Path) -> tuple[Path, Path]:
    """Write JSON and Markdown reports to output_dir. Returns (json_path, md_path)."""
    output_dir.mkdir(parents=True, exist_ok=True)
    timestamp = datetime.now().strftime("%Y-%m-%d_%H%M%S")
    stem = f"backtest_eval_{timestamp}"

    json_path = output_dir / f"{stem}.json"
    md_path = output_dir / f"{stem}.md"

    json_path.write_text(
        json.dumps(result, ensure_ascii=False, indent=2, default=str),
        encoding="utf-8",
    )
    md_path.write_text(to_markdown(result), encoding="utf-8")

    return json_path, md_path


# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(
        description="Evaluate backtest quality using a 5-dimension scoring framework."
    )
    parser.add_argument(
        "--total-trades", type=int, required=True, help="Number of trades in backtest"
    )
    parser.add_argument(
        "--win-rate", type=float, required=True, help="Win rate in percent (e.g. 58)"
    )
    parser.add_argument(
        "--avg-win-pct", type=float, required=True, help="Average winning trade in percent"
    )
    parser.add_argument(
        "--avg-loss-pct",
        type=float,
        required=True,
        help="Average losing trade in percent (positive number)",
    )
    parser.add_argument(
        "--max-drawdown-pct", type=float, required=True, help="Maximum drawdown in percent"
    )
    parser.add_argument(
        "--years-tested", type=int, required=True, help="Number of years in backtest period"
    )
    parser.add_argument(
        "--num-parameters", type=int, required=True, help="Number of tunable parameters in strategy"
    )
    parser.add_argument(
        "--slippage-tested", action="store_true", help="Whether slippage/friction was modeled"
    )
    parser.add_argument(
        "--output-dir", default="reports/", help="Output directory (default: reports/)"
    )
    return parser.parse_args()


def main() -> int:
    args = parse_args()

    result = evaluate(
        total_trades=args.total_trades,
        win_rate=args.win_rate,
        avg_win_pct=args.avg_win_pct,
        avg_loss_pct=args.avg_loss_pct,
        max_drawdown_pct=args.max_drawdown_pct,
        years_tested=args.years_tested,
        num_parameters=args.num_parameters,
        slippage_tested=args.slippage_tested,
    )

    output_dir = Path(args.output_dir)
    json_path, md_path = write_outputs(result, output_dir)

    print(f"Score: {result['total_score']}/100 — Verdict: {result['verdict']}")
    if result["red_flags"]:
        print(f"Red flags: {len(result['red_flags'])}")
        for flag in result["red_flags"]:
            print(f"  [{flag['severity'].upper()}] {flag['message']}")
    print(f"JSON: {json_path}")
    print(f"Markdown: {md_path}")
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

"""Test fixtures for backtest evaluator."""

from __future__ import annotations

import importlib.util
import sys
from pathlib import Path

import pytest


@pytest.fixture(scope="module")
def evaluator_module():
    """Load evaluate_backtest.py as a module for unit tests."""
    script_path = Path(__file__).resolve().parents[1] / "evaluate_backtest.py"
    spec = importlib.util.spec_from_file_location("evaluate_backtest", script_path)
    if spec is None or spec.loader is None:
        raise RuntimeError("Failed to load evaluate_backtest.py")
    module = importlib.util.module_from_spec(spec)
    sys.modules[spec.name] = module
    spec.loader.exec_module(module)
    return module

"""Tests for evaluate_backtest.py — backtest quality evaluation tool."""

from __future__ import annotations

import json
from pathlib import Path

import pytest

# ---------------------------------------------------------------------------
# 0. Input validation
# ---------------------------------------------------------------------------


class TestInputValidation:
    def test_win_rate_above_100(self, evaluator_module):
        """win_rate > 100 raises ValueError."""
        with pytest.raises(ValueError, match="win_rate"):
            evaluator_module.evaluate(
                total_trades=100,
                win_rate=110,
                avg_win_pct=2.0,
                avg_loss_pct=1.0,
                max_drawdown_pct=15,
                years_tested=5,
                num_parameters=3,
                slippage_tested=True,
            )

    def test_win_rate_negative(self, evaluator_module):
        """win_rate < 0 raises ValueError."""
        with pytest.raises(ValueError, match="win_rate"):
            evaluator_module.evaluate(
                total_trades=100,
                win_rate=-5,
                avg_win_pct=2.0,
                avg_loss_pct=1.0,
                max_drawdown_pct=15,
                years_tested=5,
                num_parameters=3,
                slippage_tested=True,
            )

    def test_negative_avg_win(self, evaluator_module):
        """Negative avg_win_pct raises ValueError."""
        with pytest.raises(ValueError, match="avg_win_pct"):
            evaluator_module.evaluate(
                total_trades=100,
                win_rate=60,
                avg_win_pct=-1.0,
                avg_loss_pct=1.0,
                max_drawdown_pct=15,
                years_tested=5,
                num_parameters=3,
                slippage_tested=True,
            )

    def test_negative_avg_loss(self, evaluator_module):
        """Negative avg_loss_pct raises ValueError."""
        with pytest.raises(ValueError, match="avg_loss_pct"):
            evaluator_module.evaluate(
                total_trades=100,
                win_rate=60,
                avg_win_pct=2.0,
                avg_loss_pct=-1.0,
                max_drawdown_pct=15,
                years_tested=5,
                num_parameters=3,
                slippage_tested=True,
            )

    def test_negative_total_trades(self, evaluator_module):
        """Negative total_trades raises ValueError."""
        with pytest.raises(ValueError, match="total_trades"):
            evaluator_module.evaluate(
                total_trades=-10,
                win_rate=60,
                avg_win_pct=2.0,
                avg_loss_pct=1.0,
                max_drawdown_pct=15,
                years_tested=5,
                num_parameters=3,
                slippage_tested=True,
            )

    def test_negative_max_drawdown(self, evaluator_module):
        """Negative max_drawdown_pct raises ValueError."""
        with pytest.raises(ValueError, match="max_drawdown_pct"):
            evaluator_module.evaluate(
                total_trades=100,
                win_rate=60,
                avg_win_pct=2.0,
                avg_loss_pct=1.0,
                max_drawdown_pct=-5,
                years_tested=5,
                num_parameters=3,
                slippage_tested=True,
            )

    def test_negative_years_tested(self, evaluator_module):
        """Negative years_tested raises ValueError."""
        with pytest.raises(ValueError, match="years_tested"):
            evaluator_module.evaluate(
                total_trades=100,
                win_rate=60,
                avg_win_pct=2.0,
                avg_loss_pct=1.0,
                max_drawdown_pct=15,
                years_tested=-1,
                num_parameters=3,
                slippage_tested=True,
            )

    def test_negative_num_parameters(self, evaluator_module):
        """Negative num_parameters raises ValueError."""
        with pytest.raises(ValueError, match="num_parameters"):
            evaluator_module.evaluate(
                total_trades=100,
                win_rate=60,
                avg_win_pct=2.0,
                avg_loss_pct=1.0,
                max_drawdown_pct=15,
                years_tested=5,
                num_parameters=-2,
                slippage_tested=True,
            )

    def test_boundary_win_rate_zero(self, evaluator_module):
        """win_rate=0 is valid (all losses)."""
        result = evaluator_module.evaluate(
            total_trades=100,
            win_rate=0,
            avg_win_pct=2.0,
            avg_loss_pct=1.0,
            max_drawdown_pct=15,
            years_tested=5,
            num_parameters=3,
            slippage_tested=True,
        )
        assert result["total_score"] >= 0

    def test_boundary_win_rate_100(self, evaluator_module):
        """win_rate=100 is valid (all wins)."""
        result = evaluator_module.evaluate(
            total_trades=100,
            win_rate=100,
            avg_win_pct=2.0,
            avg_loss_pct=1.0,
            max_drawdown_pct=15,
            years_tested=5,
            num_parameters=3,
            slippage_tested=True,
        )
        assert result["total_score"] >= 0


# ---------------------------------------------------------------------------
# 1. Sample Size scoring
# ---------------------------------------------------------------------------


class TestSampleSizeScoring:
    def test_below_minimum(self, evaluator_module):
        """<30 trades -> 0 points."""
        assert evaluator_module.score_sample_size(20) == 0

    def test_at_minimum(self, evaluator_module):
        """30 trades -> partial credit."""
        score = evaluator_module.score_sample_size(30)
        assert 0 < score < 15

    def test_good_sample(self, evaluator_module):
        """100 trades -> 15 points."""
        assert evaluator_module.score_sample_size(100) == 15

    def test_excellent_sample(self, evaluator_module):
        """200+ trades -> full 20 points."""
        assert evaluator_module.score_sample_size(200) == 20
        assert evaluator_module.score_sample_size(500) == 20


# ---------------------------------------------------------------------------
# 2. Expectancy calculation
# ---------------------------------------------------------------------------


class TestExpectancyCalculation:
    def test_positive_expectancy(self, evaluator_module):
        """Positive expectancy -> positive score."""
        score = evaluator_module.score_expectancy(win_rate=60, avg_win_pct=2.0, avg_loss_pct=1.0)
        assert score > 0

    def test_negative_expectancy(self, evaluator_module):
        """Negative expectancy -> 0 points."""
        score = evaluator_module.score_expectancy(win_rate=30, avg_win_pct=1.0, avg_loss_pct=2.0)
        assert score == 0

    def test_zero_expectancy(self, evaluator_module):
        """Break-even -> 0 points."""
        # win_rate=50, avg_win=1, avg_loss=1 => expectancy = 0
        score = evaluator_module.score_expectancy(win_rate=50, avg_win_pct=1.0, avg_loss_pct=1.0)
        assert score == 0

    def test_high_expectancy_capped(self, evaluator_module):
        """Very high expectancy still capped at 20."""
        score = evaluator_module.score_expectancy(win_rate=90, avg_win_pct=5.0, avg_loss_pct=0.5)
        assert score == 20


# ---------------------------------------------------------------------------
# 3. Risk Management scoring
# ---------------------------------------------------------------------------


class TestRiskManagementScoring:
    def test_low_drawdown(self, evaluator_module):
        """<20% drawdown + high PF -> full points."""
        # PF = (0.75 * 2.0) / (0.25 * 1.0) = 6.0 (well above 3.0 threshold)
        score = evaluator_module.score_risk_management(
            max_drawdown_pct=10, win_rate=75, avg_win_pct=2.0, avg_loss_pct=1.0
        )
        assert score == 20

    def test_moderate_drawdown(self, evaluator_module):
        """30% drawdown -> partial points."""
        score = evaluator_module.score_risk_management(
            max_drawdown_pct=30, win_rate=60, avg_win_pct=2.0, avg_loss_pct=1.0
        )
        assert 0 < score < 20

    def test_extreme_drawdown(self, evaluator_module):
        """50%+ drawdown -> 0 points."""
        score = evaluator_module.score_risk_management(
            max_drawdown_pct=55, win_rate=60, avg_win_pct=2.0, avg_loss_pct=1.0
        )
        assert score == 0


# ---------------------------------------------------------------------------
# 4. Robustness scoring — years tested
# ---------------------------------------------------------------------------


class TestRobustnessYears:
    def test_short_period(self, evaluator_module):
        """<5 years -> 0 for years component."""
        score = evaluator_module.score_robustness(years_tested=3, num_parameters=3)
        assert score < 15  # years component is 0

    def test_minimum_years(self, evaluator_module):
        """5 years -> partial credit."""
        score = evaluator_module.score_robustness(years_tested=5, num_parameters=3)
        assert score > 0

    def test_long_period(self, evaluator_module):
        """10+ years -> full years component (15 pts)."""
        score = evaluator_module.score_robustness(years_tested=10, num_parameters=3)
        assert score >= 15


# ---------------------------------------------------------------------------
# 5. Robustness scoring — parameters
# ---------------------------------------------------------------------------


class TestRobustnessParameters:
    def test_few_parameters(self, evaluator_module):
        """3 parameters -> full parameter component (5 pts)."""
        score = evaluator_module.score_robustness(years_tested=10, num_parameters=3)
        assert score == 20  # 15 (years) + 5 (params)

    def test_moderate_parameters(self, evaluator_module):
        """5 parameters -> partial deduction."""
        score = evaluator_module.score_robustness(years_tested=10, num_parameters=5)
        assert 15 <= score < 20

    def test_many_parameters(self, evaluator_module):
        """8+ parameters -> heavy deduction."""
        score = evaluator_module.score_robustness(years_tested=10, num_parameters=8)
        assert score < 18


# ---------------------------------------------------------------------------
# 6. Slippage flag
# ---------------------------------------------------------------------------


class TestSlippageFlag:
    def test_slippage_tested(self, evaluator_module):
        """Slippage tested -> full 20 points."""
        score = evaluator_module.score_execution_realism(slippage_tested=True)
        assert score == 20

    def test_slippage_not_tested(self, evaluator_module):
        """Slippage not tested -> 0 points."""
        score = evaluator_module.score_execution_realism(slippage_tested=False)
        assert score == 0


# ---------------------------------------------------------------------------
# 7. Overall verdict
# ---------------------------------------------------------------------------


class TestOverallVerdict:
    def test_deploy_verdict(self, evaluator_module):
        """Score >= 70 -> Deploy."""
        assert evaluator_module.get_verdict(75) == "Deploy"
        assert evaluator_module.get_verdict(100) == "Deploy"

    def test_refine_verdict(self, evaluator_module):
        """40 <= score < 70 -> Refine."""
        assert evaluator_module.get_verdict(50) == "Refine"
        assert evaluator_module.get_verdict(69) == "Refine"

    def test_abandon_verdict(self, evaluator_module):
        """Score < 40 -> Abandon."""
        assert evaluator_module.get_verdict(30) == "Abandon"
        assert evaluator_module.get_verdict(0) == "Abandon"


# ---------------------------------------------------------------------------
# 8. Profit factor calculation
# ---------------------------------------------------------------------------


class TestProfitFactor:
    def test_positive_profit_factor(self, evaluator_module):
        """win_rate=60, avg_win=2, avg_loss=1 -> PF = 3.0."""
        pf = evaluator_module.calc_profit_factor(win_rate=60, avg_win_pct=2.0, avg_loss_pct=1.0)
        assert abs(pf - 3.0) < 0.01

    def test_breakeven_profit_factor(self, evaluator_module):
        """win_rate=50, avg_win=1, avg_loss=1 -> PF = 1.0."""
        pf = evaluator_module.calc_profit_factor(win_rate=50, avg_win_pct=1.0, avg_loss_pct=1.0)
        assert abs(pf - 1.0) < 0.01

    def test_zero_loss_profit_factor(self, evaluator_module):
        """100% win rate -> PF = inf (capped)."""
        pf = evaluator_module.calc_profit_factor(win_rate=100, avg_win_pct=2.0, avg_loss_pct=1.0)
        assert pf == float("inf")


# ---------------------------------------------------------------------------
# 8b. PF score boundary smoothness
# ---------------------------------------------------------------------------


class TestProfitFactorScoreSmoothness:
    def test_pf_boundary_no_large_jump(self, evaluator_module):
        """PF score should not jump more than 2 points at PF=2.0 boundary."""
        # Use drawdown <20% so dd_score is constant at 12
        score_below = evaluator_module.score_risk_management(
            max_drawdown_pct=10, win_rate=50, avg_win_pct=3.98, avg_loss_pct=2.0
        )  # PF ~= 1.99
        score_at = evaluator_module.score_risk_management(
            max_drawdown_pct=10, win_rate=50, avg_win_pct=4.0, avg_loss_pct=2.0
        )  # PF = 2.0
        # The jump should be at most 2 points (not the previous 4-point gap)
        assert abs(score_at - score_below) <= 2

    def test_pf_monotonically_increasing(self, evaluator_module):
        """Higher PF should give equal or higher risk management score."""
        # All with same low drawdown to isolate PF component
        scores = []
        for avg_win in [1.0, 1.5, 2.0, 3.0, 5.0]:
            s = evaluator_module.score_risk_management(
                max_drawdown_pct=10, win_rate=60, avg_win_pct=avg_win, avg_loss_pct=1.0
            )
            scores.append(s)
        for i in range(len(scores) - 1):
            assert scores[i] <= scores[i + 1], (
                f"Score decreased: PF step {i} gave {scores[i]} "
                f"but step {i + 1} gave {scores[i + 1]}"
            )


# ---------------------------------------------------------------------------
# 9. Output JSON structure
# ---------------------------------------------------------------------------


class TestOutputJsonStructure:
    def test_evaluate_returns_all_keys(self, evaluator_module):
        """evaluate() result must contain all required keys."""
        result = evaluator_module.evaluate(
            total_trades=150,
            win_rate=62,
            avg_win_pct=1.8,
            avg_loss_pct=1.2,
            max_drawdown_pct=15,
            years_tested=8,
            num_parameters=3,
            slippage_tested=True,
        )
        required_keys = {
            "total_score",
            "verdict",
            "dimensions",
            "red_flags",
            "profit_factor",
            "expectancy",
        }
        assert required_keys.issubset(result.keys())

    def test_dimensions_structure(self, evaluator_module):
        """Each dimension must have name, score, max_score."""
        result = evaluator_module.evaluate(
            total_trades=100,
            win_rate=55,
            avg_win_pct=1.5,
            avg_loss_pct=1.0,
            max_drawdown_pct=20,
            years_tested=5,
            num_parameters=4,
            slippage_tested=True,
        )
        for dim in result["dimensions"]:
            assert "name" in dim
            assert "score" in dim
            assert "max_score" in dim
            assert dim["max_score"] == 20

    def test_total_score_range(self, evaluator_module):
        """Total score must be 0-100."""
        result = evaluator_module.evaluate(
            total_trades=200,
            win_rate=65,
            avg_win_pct=2.5,
            avg_loss_pct=1.0,
            max_drawdown_pct=12,
            years_tested=10,
            num_parameters=3,
            slippage_tested=True,
        )
        assert 0 <= result["total_score"] <= 100


# ---------------------------------------------------------------------------
# 10. Red flags detection
# ---------------------------------------------------------------------------


class TestRedFlagsDetection:
    def test_small_sample_flag(self, evaluator_module):
        """<30 trades triggers red flag."""
        result = evaluator_module.evaluate(
            total_trades=20,
            win_rate=80,
            avg_win_pct=3.0,
            avg_loss_pct=1.0,
            max_drawdown_pct=10,
            years_tested=10,
            num_parameters=3,
            slippage_tested=True,
        )
        flags = [f["id"] for f in result["red_flags"]]
        assert "small_sample" in flags

    def test_no_slippage_flag(self, evaluator_module):
        """Slippage not tested triggers red flag."""
        result = evaluator_module.evaluate(
            total_trades=200,
            win_rate=60,
            avg_win_pct=2.0,
            avg_loss_pct=1.0,
            max_drawdown_pct=15,
            years_tested=10,
            num_parameters=3,
            slippage_tested=False,
        )
        flags = [f["id"] for f in result["red_flags"]]
        assert "no_slippage_test" in flags

    def test_excessive_drawdown_flag(self, evaluator_module):
        """>50% drawdown triggers red flag."""
        result = evaluator_module.evaluate(
            total_trades=200,
            win_rate=60,
            avg_win_pct=2.0,
            avg_loss_pct=1.0,
            max_drawdown_pct=55,
            years_tested=10,
            num_parameters=3,
            slippage_tested=True,
        )
        flags = [f["id"] for f in result["red_flags"]]
        assert "excessive_drawdown" in flags

    def test_over_optimized_flag_at_8(self, evaluator_module):
        """8 parameters triggers red flag."""
        result = evaluator_module.evaluate(
            total_trades=200,
            win_rate=60,
            avg_win_pct=2.0,
            avg_loss_pct=1.0,
            max_drawdown_pct=15,
            years_tested=10,
            num_parameters=8,
            slippage_tested=True,
        )
        flags = [f["id"] for f in result["red_flags"]]
        assert "over_optimized" in flags

    def test_over_optimized_flag_at_7(self, evaluator_module):
        """7 parameters also triggers red flag (already penalized in scoring)."""
        result = evaluator_module.evaluate(
            total_trades=200,
            win_rate=60,
            avg_win_pct=2.0,
            avg_loss_pct=1.0,
            max_drawdown_pct=15,
            years_tested=10,
            num_parameters=7,
            slippage_tested=True,
        )
        flags = [f["id"] for f in result["red_flags"]]
        assert "over_optimized" in flags

    def test_no_over_optimized_flag_at_6(self, evaluator_module):
        """6 parameters does NOT trigger over_optimized flag."""
        result = evaluator_module.evaluate(
            total_trades=200,
            win_rate=60,
            avg_win_pct=2.0,
            avg_loss_pct=1.0,
            max_drawdown_pct=15,
            years_tested=10,
            num_parameters=6,
            slippage_tested=True,
        )
        flags = [f["id"] for f in result["red_flags"]]
        assert "over_optimized" not in flags

    def test_short_test_period_flag(self, evaluator_module):
        """<5 years triggers red flag."""
        result = evaluator_module.evaluate(
            total_trades=200,
            win_rate=60,
            avg_win_pct=2.0,
            avg_loss_pct=1.0,
            max_drawdown_pct=15,
            years_tested=3,
            num_parameters=3,
            slippage_tested=True,
        )
        flags = [f["id"] for f in result["red_flags"]]
        assert "short_test_period" in flags

    def test_clean_backtest_no_flags(self, evaluator_module):
        """Well-constructed backtest has no red flags."""
        result = evaluator_module.evaluate(
            total_trades=200,
            win_rate=60,
            avg_win_pct=2.0,
            avg_loss_pct=1.0,
            max_drawdown_pct=15,
            years_tested=10,
            num_parameters=3,
            slippage_tested=True,
        )
        assert result["red_flags"] == []


# ---------------------------------------------------------------------------
# 11. File output (JSON + Markdown)
# ---------------------------------------------------------------------------


class TestFileOutput:
    def test_write_outputs(self, evaluator_module, tmp_path: Path):
        """write_outputs creates .json and .md files."""
        result = evaluator_module.evaluate(
            total_trades=150,
            win_rate=62,
            avg_win_pct=1.8,
            avg_loss_pct=1.2,
            max_drawdown_pct=15,
            years_tested=8,
            num_parameters=3,
            slippage_tested=True,
        )
        json_path, md_path = evaluator_module.write_outputs(result, tmp_path)

        assert json_path.exists()
        assert md_path.exists()

        data = json.loads(json_path.read_text(encoding="utf-8"))
        assert data["total_score"] == result["total_score"]

        md_text = md_path.read_text(encoding="utf-8")
        assert "# Backtest Evaluation Report" in md_text
        assert result["verdict"] in md_text

Related skills

Wind Mcp SkillGive their AI coding agent instant access to comprehensive Chinese and global equity, fund, bond, and macroeconomic data from Wind.56.4k67

Upgrade StripeSafely upgrade their Stripe API version and SDKs without introducing breaking changes.51.9k1.7k

Recipe Create Expense TrackerInstantly create a ready-to-use Google Sheets expense tracker with headers, sample data, and sharing permissions.23.9k30k

Backtesting FrameworksCreate realistic event-driven backtesters that simulate order execution, slippage, commissions, and position tracking before committing capital to a trading strategy.13.2k38.2k

Grimoire PolymarketQuery live Polymarket odds, search events, and manage conditional order flow directly from Claude or Cursor without leaving the IDE.11.9k6

CoinglassGive their AI coding agents real-time access to crypto derivatives market data such as funding rates, open interest, long/short ratios, liquidations, and futures OHLC h11.3k18

Forks & variants (1)

Backtest Expert has 1 known copy in the catalog totaling 270 installs. They canonicalize to this original listing.

wind-information-co-ltd - 270 installs

How it compares

Use backtest-expert after backtests fail when you need structured postmortems and anti-patterns, not when you need payoff chart prototypes or live trading connectors.

FAQ

What does backtest-expert do?

Systematically backtest and stress-test quantitative trading strategies before live use.

When should I use backtest-expert?

User develops, tests, or stress-tests quantitative trading strategy backtests.

Is backtest-expert safe to install?

Review the Security Audits panel on this page before installing in production.

Finance & Tradingfinance

About

Backtest Expert by the numbers

backtest-expert capabilities & compatibility

What backtest-expert says it does

Add your badge

How do I systematically backtest and stress-test quantitative trading strategies before live use?

Who is it for?

When should I use this skill?

What you get

Files

Backtest Expert

Core Philosophy

When to Use This Skill

Prerequisites

Workflow

1. State the Hypothesis

2. Codify Rules with Zero Discretion

3. Run Initial Backtest

4. Stress Test the Strategy

5. Out-of-Sample Validation

6. Evaluate Results

Key Testing Principles

Punish the Strategy

Seek Plateaus, Not Peaks

Test All Cases, Not Cherry-Picked Examples

Separate Idea Generation from Validation

Common Failure Patterns

Output

Resources

Methodology Reference

Failed Tests Reference

Critical Reminders

Discretionary vs Systematic Differences

Learning from Failed Backtests

Table of Contents

1. Why Failed Ideas Are Valuable

The Value of Failures

Documentation Discipline

2. Common Failure Patterns

Pattern 1: Parameter Sensitivity

Pattern 2: Regime-Specific Performance

Pattern 3: Slippage Sensitivity

Pattern 4: Sample Size Issues

Pattern 5: Look-Ahead Bias

Pattern 6: Over-Optimization (Curve Fitting)

3. Case Study Framework

Template for Documenting Failed Ideas

1. Initial Hypothesis

2. Implementation Details

3. Test Results

4. Breaking Points

5. Lessons Learned

Example: Failed Momentum Reversal Strategy

1. Initial Hypothesis

2. Implementation

3. Test Results

4. Breaking Points

5. Lessons Learned

4. Red Flags Checklist

Data Quality Issues

Sample Size Concerns

Parameter Robustness

Execution Realism

Performance Characteristics

Bias Prevention

Tool Limitations

Backtesting Methodology Reference

Table of Contents

1. Core Testing Techniques

"Beat Ideas to Death" Approach

The 80/20 Rule for R&D Time

2. Stress Testing Methods

Execution Friction Tests

Parameter Robustness Tests

Time-Based Robustness

3. Parameter Sensitivity Analysis

Heat Map Analysis

Walk-Forward Analysis

Profit Factor Scoring in Evaluation Script

4. Slippage and Friction Modeling