Evaluating Code Models

Name: Evaluating Code Models
Author: orchestra-research

orchestra-research/ai-research-skills

392 installs
11.2k repo stars
Updated June 16, 2026
orchestra-research/ai-research-skills

evaluating-code-models is an agent skill that runs BigCode Evaluation Harness benchmarks including HumanEval, HumanEval+, and pass@k metrics to compare code LLMs for developers who must validate a codegen or agent stack

About

evaluating-code-models is an agent skill that wraps the BigCode Evaluation Harness to benchmark code-generation models across 15+ standardized suites before teams adopt a codegen or agent stack. It documents HumanEval with 164 Python problems, MBPP with 500 entry-level tasks, HumanEval+ with stricter test expansion, and MultiPL-E spanning 18 languages, all scored with pass@k at k=1, 10, and 100. Workflows cover accelerate launch commands, multi-language evaluation, instruction-tuned model runs, and head-to-head model comparisons with configurable temperature, n_samples, and max_length_generation. Developers reach for evaluating-code-models when they need reproducible functional-correctness numbers comparable to HuggingFace leaderboards instead of anecdotal code samples, including optional Docker-isolated code execution for untrusted model output.

Documents HumanEval (164 problems) and HumanEval+ with pass@k and recommended temperature/n_samples settings
Covers code-generation benchmarks that execute generated code against unit tests via --allow_code_execution
Includes accelerate launch CLI patterns for batch_size, n_samples, and max_length_generation tuning
Maps dataset IDs on HuggingFace (e.g. openai_humaneval, evalplus/humanevalplus) to harness task names
Oriented to functional correctness metrics, not subjective chat quality

Evaluating Code Models by the numbers

392 all-time installs (skills.sh)
+35 installs in the week ending Jul 18, 2026 (Skillselion tracking)
Ranked #515 of 2,066 Data Science & ML skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/orchestra-research/ai-research-skills --skill evaluating-code-models

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/orchestra-research/ai-research-skills/evaluating-code-models.svg)](https://skillselion.com/skills/orchestra-research/ai-research-skills/evaluating-code-models)

Installs	392
repo stars	★ 11.2k
Security audit	1 / 3 scanners passed
Last updated	June 16, 2026
Repository	orchestra-research/ai-research-skills ↗

How do you benchmark code LLMs with HumanEval pass@k?

Run BigCode Evaluation Harness benchmarks (HumanEval, HumanEval+, pass@k) to compare code LLMs before you commit an agent or codegen stack.

Who is it for?

ML engineers and agent builders comparing code-generation models with industry-standard HumanEval, MBPP, and MultiPL-E benchmarks before production adoption.

Skip if: Teams needing general text LLM benchmarks without code execution, where lm-evaluation-harness text tasks are the better fit.

When should I use this skill?

Trigger evaluating-code-models when selecting a code LLM for an agent, fine-tuning a codegen model, or preparing leaderboard-comparable benchmark numbers.

What you get

pass@k benchmark scores, per-task generation logs, multi-benchmark comparison tables, and reproducible evaluation configs for HumanEval and MBPP suites.

pass@k score tables
model generation logs
benchmark comparison report

By the numbers

Covers 15+ BigCode Evaluation Harness code benchmarks
Includes HumanEval with 164 Python programming problems
Supports MultiPL-E evaluation across 18 programming languages

Files

SKILL.mdMarkdownGitHub ↗

BigCode Evaluation Harness - Code Model Benchmarking

Quick Start

BigCode Evaluation Harness evaluates code generation models across 15+ benchmarks including HumanEval, MBPP, and MultiPL-E (18 languages).

Installation:

git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness
pip install -e .
accelerate config

Evaluate on HumanEval:

accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks humaneval \
  --max_length_generation 512 \
  --temperature 0.2 \
  --n_samples 20 \
  --batch_size 10 \
  --allow_code_execution \
  --save_generations

View available tasks:

python -c "from bigcode_eval.tasks import ALL_TASKS; print(ALL_TASKS)"

Common Workflows

Workflow 1: Standard Code Benchmark Evaluation

Evaluate model on core code benchmarks (HumanEval, MBPP, HumanEval+).

Checklist:

Code Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model and generation
- [ ] Step 3: Run evaluation with code execution
- [ ] Step 4: Analyze pass@k results

Step 1: Choose benchmark suite

Python code generation (most common):

HumanEval: 164 handwritten problems, function completion
HumanEval+: Same 164 problems with 80× more tests (stricter)
MBPP: 500 crowd-sourced problems, entry-level difficulty
MBPP+: 399 curated problems with 35× more tests

Multi-language (18 languages):

MultiPL-E: HumanEval/MBPP translated to C++, Java, JavaScript, Go, Rust, etc.

Advanced:

APPS: 10,000 problems (introductory/interview/competition)
DS-1000: 1,000 data science problems across 7 libraries

Step 2: Configure model and generation

# Standard HuggingFace model
accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks humaneval \
  --max_length_generation 512 \
  --temperature 0.2 \
  --do_sample True \
  --n_samples 200 \
  --batch_size 50 \
  --allow_code_execution

# Quantized model (4-bit)
accelerate launch main.py \
  --model codellama/CodeLlama-34b-hf \
  --tasks humaneval \
  --load_in_4bit \
  --max_length_generation 512 \
  --allow_code_execution

# Custom/private model
accelerate launch main.py \
  --model /path/to/my-code-model \
  --tasks humaneval \
  --trust_remote_code \
  --use_auth_token \
  --allow_code_execution

Step 3: Run evaluation

# Full evaluation with pass@k estimation (k=1,10,100)
accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks humaneval \
  --temperature 0.8 \
  --n_samples 200 \
  --batch_size 50 \
  --allow_code_execution \
  --save_generations \
  --metric_output_path results/starcoder2-humaneval.json

Step 4: Analyze results

Results in results/starcoder2-humaneval.json:

{
  "humaneval": {
    "pass@1": 0.354,
    "pass@10": 0.521,
    "pass@100": 0.689
  },
  "config": {
    "model": "bigcode/starcoder2-7b",
    "temperature": 0.8,
    "n_samples": 200
  }
}

Workflow 2: Multi-Language Evaluation (MultiPL-E)

Evaluate code generation across 18 programming languages.

Checklist:

Multi-Language Evaluation:
- [ ] Step 1: Generate solutions (host machine)
- [ ] Step 2: Run evaluation in Docker (safe execution)
- [ ] Step 3: Compare across languages

Step 1: Generate solutions on host

# Generate without execution (safe)
accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
  --max_length_generation 650 \
  --temperature 0.8 \
  --n_samples 50 \
  --batch_size 50 \
  --generation_only \
  --save_generations \
  --save_generations_path generations_multi.json

Step 2: Evaluate in Docker container

# Pull the MultiPL-E Docker image
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple

# Run evaluation inside container
docker run -v $(pwd)/generations_multi.json:/app/generations.json:ro \
  -it evaluation-harness-multiple python3 main.py \
  --model bigcode/starcoder2-7b \
  --tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
  --load_generations_path /app/generations.json \
  --allow_code_execution \
  --n_samples 50

Supported languages: Python, JavaScript, Java, C++, Go, Rust, TypeScript, C#, PHP, Ruby, Swift, Kotlin, Scala, Perl, Julia, Lua, R, Racket

Workflow 3: Instruction-Tuned Model Evaluation

Evaluate chat/instruction models with proper formatting.

Checklist:

Instruction Model Evaluation:
- [ ] Step 1: Use instruction-tuned tasks
- [ ] Step 2: Configure instruction tokens
- [ ] Step 3: Run evaluation

Step 1: Choose instruction tasks

instruct-humaneval: HumanEval with instruction prompts
humanevalsynthesize-{lang}: HumanEvalPack synthesis tasks

Step 2: Configure instruction tokens

# For models with chat templates (e.g., CodeLlama-Instruct)
accelerate launch main.py \
  --model codellama/CodeLlama-7b-Instruct-hf \
  --tasks instruct-humaneval \
  --instruction_tokens "<s>[INST],</s>,[/INST]" \
  --max_length_generation 512 \
  --allow_code_execution

Step 3: HumanEvalPack for instruction models

# Test code synthesis across 6 languages
accelerate launch main.py \
  --model codellama/CodeLlama-7b-Instruct-hf \
  --tasks humanevalsynthesize-python,humanevalsynthesize-js \
  --prompt instruct \
  --max_length_generation 512 \
  --allow_code_execution

Workflow 4: Compare Multiple Models

Benchmark suite for model comparison.

Step 1: Create evaluation script

#!/bin/bash
# eval_models.sh

MODELS=(
  "bigcode/starcoder2-7b"
  "codellama/CodeLlama-7b-hf"
  "deepseek-ai/deepseek-coder-6.7b-base"
)
TASKS="humaneval,mbpp"

for model in "${MODELS[@]}"; do
  model_name=$(echo $model | tr '/' '-')
  echo "Evaluating $model"

  accelerate launch main.py \
    --model $model \
    --tasks $TASKS \
    --temperature 0.2 \
    --n_samples 20 \
    --batch_size 20 \
    --allow_code_execution \
    --metric_output_path results/${model_name}.json
done

Step 2: Generate comparison table

import json
import pandas as pd

models = ["bigcode-starcoder2-7b", "codellama-CodeLlama-7b-hf", "deepseek-ai-deepseek-coder-6.7b-base"]
results = []

for model in models:
    with open(f"results/{model}.json") as f:
        data = json.load(f)
        results.append({
            "Model": model,
            "HumanEval pass@1": f"{data['humaneval']['pass@1']:.3f}",
            "MBPP pass@1": f"{data['mbpp']['pass@1']:.3f}"
        })

df = pd.DataFrame(results)
print(df.to_markdown(index=False))

When to Use vs Alternatives

Use BigCode Evaluation Harness when:

Evaluating code generation models specifically
Need multi-language evaluation (18 languages via MultiPL-E)
Testing functional correctness with unit tests (pass@k)
Benchmarking for BigCode/HuggingFace leaderboards
Evaluating fill-in-the-middle (FIM) capabilities

Use alternatives instead:

lm-evaluation-harness: General LLM benchmarks (MMLU, GSM8K, HellaSwag)
EvalPlus: Stricter HumanEval+/MBPP+ with more test cases
SWE-bench: Real-world GitHub issue resolution
LiveCodeBench: Contamination-free, continuously updated problems
CodeXGLUE: Code understanding tasks (clone detection, defect prediction)

Supported Benchmarks

Benchmark	Problems	Languages	Metric	Use Case
HumanEval	164	Python	pass@k	Standard code completion
HumanEval+	164	Python	pass@k	Stricter evaluation (80× tests)
MBPP	500	Python	pass@k	Entry-level problems
MBPP+	399	Python	pass@k	Stricter evaluation (35× tests)
MultiPL-E	164×18	18 languages	pass@k	Multi-language evaluation
APPS	10,000	Python	pass@k	Competition-level
DS-1000	1,000	Python	pass@k	Data science (pandas, numpy, etc.)
HumanEvalPack	164×3×6	6 languages	pass@k	Synthesis/fix/explain
Mercury	1,889	Python	Efficiency	Computational efficiency

Common Issues

Issue: Different results than reported in papers

Check these factors:

# 1. Verify n_samples (need 200 for accurate pass@k)
--n_samples 200

# 2. Check temperature (0.2 for greedy-ish, 0.8 for sampling)
--temperature 0.8

# 3. Verify task name matches exactly
--tasks humaneval  # Not "human_eval" or "HumanEval"

# 4. Check max_length_generation
--max_length_generation 512  # Increase for longer problems

Issue: CUDA out of memory

# Use quantization
--load_in_8bit
# OR
--load_in_4bit

# Reduce batch size
--batch_size 1

# Set memory limit
--max_memory_per_gpu "20GiB"

Issue: Code execution hangs or times out

Use Docker for safe execution:

# Generate on host (no execution)
--generation_only --save_generations

# Evaluate in Docker
docker run ... --allow_code_execution --load_generations_path ...

Issue: Low scores on instruction models

Ensure proper instruction formatting:

# Use instruction-specific tasks
--tasks instruct-humaneval

# Set instruction tokens for your model
--instruction_tokens "<s>[INST],</s>,[/INST]"

Issue: MultiPL-E language failures

Use the dedicated Docker image:

docker pull ghcr.io/bigcode-project/evaluation-harness-multiple

Command Reference

Argument	Default	Description
`--model`	-	HuggingFace model ID or local path
`--tasks`	-	Comma-separated task names
`--n_samples`	1	Samples per problem (200 for pass@k)
`--temperature`	0.2	Sampling temperature
`--max_length_generation`	512	Max tokens (prompt + generation)
`--batch_size`	1	Batch size per GPU
`--allow_code_execution`	False	Enable code execution (required)
`--generation_only`	False	Generate without evaluation
`--load_generations_path`	-	Load pre-generated solutions
`--save_generations`	False	Save generated code
`--metric_output_path`	results.json	Output file for metrics
`--load_in_8bit`	False	8-bit quantization
`--load_in_4bit`	False	4-bit quantization
`--trust_remote_code`	False	Allow custom model code
`--precision`	fp32	Model precision (fp32/fp16/bf16)

Hardware Requirements

Model Size	VRAM (fp16)	VRAM (4-bit)	Time (HumanEval, n=200)
7B	14GB	6GB	~30 min (A100)
13B	26GB	10GB	~1 hour (A100)
34B	68GB	20GB	~2 hours (A100)

Resources

GitHub: https://github.com/bigcode-project/bigcode-evaluation-harness
Documentation: https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/docs
BigCode Leaderboard: https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard
HumanEval Dataset: https://huggingface.co/datasets/openai/openai_humaneval
MultiPL-E: https://github.com/nuprl/MultiPL-E

BigCode Evaluation Harness - Benchmark Guide

Comprehensive guide to all benchmarks supported by BigCode Evaluation Harness.

Code Generation with Unit Tests

These benchmarks test functional correctness by executing generated code against unit tests.

HumanEval

Overview: 164 handwritten Python programming problems created by OpenAI.

Dataset: openai_humaneval on HuggingFace Metric: pass@k (k=1, 10, 100) Problems: Function completion with docstrings

Example problem structure:

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """Check if in given list of numbers, are any two numbers closer to each other than given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """

Usage:

accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks humaneval \
  --temperature 0.2 \
  --n_samples 200 \
  --batch_size 50 \
  --allow_code_execution

Recommended settings:

temperature: 0.8 for pass@k with large n_samples, 0.2 for greedy
n_samples: 200 for accurate pass@k estimation
max_length_generation: 512 (sufficient for most problems)

HumanEval+

Overview: Extended HumanEval with 80× more test cases per problem.

Dataset: evalplus/humanevalplus on HuggingFace Why use it: Catches solutions that pass original tests but fail on edge cases

Usage:

accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks humanevalplus \
  --temperature 0.2 \
  --n_samples 200 \
  --allow_code_execution

Note: Execution takes longer due to additional tests. Timeout may need adjustment.

MBPP (Mostly Basic Python Problems)

Overview: 1,000 crowd-sourced Python problems designed for entry-level programmers.

Dataset: mbpp on HuggingFace Test split: 500 problems (indices 11-511) Metric: pass@k

Problem structure:

Task description in English
3 automated test cases per problem
Code solution (ground truth)

Usage:

accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks mbpp \
  --temperature 0.2 \
  --n_samples 200 \
  --allow_code_execution

MBPP+

Overview: 399 curated MBPP problems with 35× more test cases.

Dataset: evalplus/mbppplus on HuggingFace

Usage:

accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks mbppplus \
  --allow_code_execution

MultiPL-E (18 Languages)

Overview: HumanEval and MBPP translated to 18 programming languages.

Languages: Python, JavaScript, Java, C++, Go, Rust, TypeScript, C#, PHP, Ruby, Swift, Kotlin, Scala, Perl, Julia, Lua, R, Racket

Task naming: multiple-{lang} where lang is file extension:

multiple-py (Python)
multiple-js (JavaScript)
multiple-java (Java)
multiple-cpp (C++)
multiple-go (Go)
multiple-rs (Rust)
multiple-ts (TypeScript)
multiple-cs (C#)
multiple-php (PHP)
multiple-rb (Ruby)
multiple-swift (Swift)
multiple-kt (Kotlin)
multiple-scala (Scala)
multiple-pl (Perl)
multiple-jl (Julia)
multiple-lua (Lua)
multiple-r (R)
multiple-rkt (Racket)

Usage with Docker (recommended for safe execution):

# Step 1: Generate on host
accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks multiple-js,multiple-java,multiple-cpp \
  --generation_only \
  --save_generations \
  --save_generations_path generations.json

# Step 2: Evaluate in Docker
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple
docker run -v $(pwd)/generations.json:/app/generations.json:ro \
  -it evaluation-harness-multiple python3 main.py \
  --tasks multiple-js,multiple-java,multiple-cpp \
  --load_generations_path /app/generations.json \
  --allow_code_execution

APPS

Overview: 10,000 Python problems across three difficulty levels.

Difficulty levels:

Introductory: Basic programming
Interview: Technical interview level
Competition: Competitive programming

Tasks:

apps-introductory
apps-interview
apps-competition

Usage:

accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks apps-introductory \
  --max_length_generation 1024 \
  --allow_code_execution

DS-1000

Overview: 1,000 data science problems across 7 Python libraries.

Libraries: NumPy, Pandas, SciPy, Scikit-learn, PyTorch, TensorFlow, Matplotlib

Requirements:

Python 3.7.10 specifically
pip install -e ".[ds1000]"
PyTorch 1.12.1

Usage:

accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks ds1000-all-completion \
  --allow_code_execution

Mercury

Overview: 1,889 tasks for evaluating computational efficiency of generated code.

Requirements: pip install lctk sortedcontainers

Metric: Beyond@k (efficiency-based)

Usage:

accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks mercury \
  --allow_code_execution

Code Generation Without Unit Tests

These benchmarks use text-based metrics (BLEU, Exact Match).

SantaCoder-FIM (Fill-in-the-Middle)

Overview: 4,792 fill-in-the-middle tasks for Python, JavaScript, Java.

Metric: Exact Match Use case: Evaluating FIM/infilling capabilities

Tasks:

santacoder_fim
starcoder_fim

Usage:

accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks santacoder_fim \
  --n_samples 1 \
  --batch_size 1

CoNaLa

Overview: Natural language to Python code generation.

Metric: BLEU score Setting: Two-shot

Usage:

accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks conala \
  --do_sample False \
  --n_samples 1

Concode

Overview: Natural language to Java code generation.

Metric: BLEU score

Usage:

accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks concode \
  --do_sample False \
  --n_samples 1

Instruction-Tuned Model Evaluation

InstructHumanEval

Overview: HumanEval reformatted for instruction-following models.

Usage:

accelerate launch main.py \
  --model codellama/CodeLlama-7b-Instruct-hf \
  --tasks instruct-humaneval \
  --instruction_tokens "<s>[INST],</s>,[/INST]" \
  --allow_code_execution

HumanEvalPack

Overview: Extends HumanEval to 3 scenarios across 6 languages.

Scenarios:

Synthesize: Generate code from docstring
Fix: Fix buggy code
Explain: Generate docstring from code

Languages: Python, JavaScript, Java, Go, C++, Rust

Tasks:

humanevalsynthesize-{lang}
humanevalfix-{lang}
humanevalexplain-{lang}

Usage:

accelerate launch main.py \
  --model codellama/CodeLlama-7b-Instruct-hf \
  --tasks humanevalsynthesize-python,humanevalfix-python \
  --prompt instruct \
  --allow_code_execution

Math and Reasoning

PAL (Program-Aided Language Models)

Overview: Solve math problems by generating Python code.

Datasets: GSM8K, GSM-HARD

Tasks:

pal-gsm8k-greedy: Greedy decoding
pal-gsm8k-majority_voting: k=40 majority voting
pal-gsmhard-greedy
pal-gsmhard-majority_voting

Usage:

accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks pal-gsm8k-greedy \
  --max_length_generation 2048 \
  --do_sample False \
  --allow_code_execution

Note: Requires max_length_generation >= 2048 due to 8-shot prompts (~1500 tokens).

Documentation Generation

CodeXGLUE Code-to-Text

Overview: Generate documentation from code.

Languages: Python, Go, Ruby, Java, JavaScript, PHP

Tasks: codexglue_code_to_text-{lang}

Usage:

accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks codexglue_code_to_text-python \
  --do_sample False \
  --n_samples 1 \
  --batch_size 1

Classification Tasks

Java Complexity Prediction

Task: java-complexity

Code Equivalence Detection

Task: java-clone-detection

C Defect Prediction

Task: c-defect-detection

Benchmark Selection Guide

Goal	Recommended Benchmarks
Quick sanity check	HumanEval (n_samples=20)
Standard evaluation	HumanEval + MBPP
Rigorous evaluation	HumanEval+ + MBPP+
Multi-language	MultiPL-E
Instruction models	InstructHumanEval, HumanEvalPack
FIM/Infilling	SantaCoder-FIM, StarCoder-FIM
Data science	DS-1000
Competition-level	APPS
Efficiency	Mercury
Math reasoning	PAL-GSM8K

pass@k Calculation

pass@k estimates probability that at least one of k samples passes all tests:

pass@k = E[1 - C(n-c, k) / C(n, k)]

Where:

n = total samples generated
c = samples that pass all tests
k = number of samples allowed

Recommended n_samples by k:

pass@1: n >= 20
pass@10: n >= 100
pass@100: n >= 200

Temperature recommendations:

pass@1: temperature = 0.2 (near-greedy)
pass@10, pass@100: temperature = 0.8 (more diverse sampling)

Creating Custom Tasks in BigCode Evaluation Harness

Guide to implementing custom evaluation tasks for code generation models.

Task Architecture

All tasks inherit from a base Task class and implement standard methods:

class Task:
    DATASET_PATH: str  # HuggingFace dataset ID
    DATASET_NAME: str  # Dataset configuration (or None)

    def __init__(self, stop_words, requires_execution):
        """Initialize task with stop words and execution flag."""

    def get_dataset(self):
        """Return the evaluation dataset."""

    def get_prompt(self, doc):
        """Format document into model prompt."""

    def get_reference(self, doc):
        """Extract reference solution from document."""

    def postprocess_generation(self, generation, idx):
        """Clean up model output."""

    def process_results(self, generations, references):
        """Evaluate and return metrics."""

Step-by-Step Implementation

Step 1: Create Task File

Copy template to bigcode_eval/tasks/<task_name>.py:

"""
<Paper Title>
<Paper URL>

<Task Description>

Homepage: <Homepage URL>
"""

import json
from evaluate import load
from bigcode_eval.base import Task

class MyCustomTask(Task):
    """Custom code evaluation task."""

    DATASET_PATH = "username/dataset-name"  # HuggingFace dataset
    DATASET_NAME = None  # or specific config name

    def __init__(self):
        super().__init__(
            stop_words=["\nclass", "\ndef", "\n#", "\nif", "\nprint"],
            requires_execution=True,  # Set True if running unit tests
        )

    def get_dataset(self):
        """Load evaluation split."""
        from datasets import load_dataset
        return load_dataset(
            self.DATASET_PATH,
            self.DATASET_NAME,
            split="test"
        )

    def get_prompt(self, doc):
        """Format problem into prompt for model."""
        return doc["prompt"]

    def get_reference(self, doc):
        """Return test cases or reference solution."""
        return doc["test"]

    def postprocess_generation(self, generation, idx):
        """Clean model output (remove extra text after solution)."""
        # Common: stop at first occurrence of stop words
        for stop_word in self.stop_words:
            if stop_word in generation:
                generation = generation[:generation.index(stop_word)]
        return generation

    def process_results(self, generations, references):
        """Execute tests and compute pass@k."""
        code_metric = load("code_eval")
        results, _ = code_metric.compute(
            references=references,
            predictions=generations,
            k=[1, 10, 100]
        )
        return results

Step 2: Register Task

Add to bigcode_eval/tasks/__init__.py:

from bigcode_eval.tasks import my_custom_task

TASK_REGISTRY = {
    # ... existing tasks ...
    "my-custom-task": my_custom_task.MyCustomTask,
}

Step 3: Test Task

# Verify task loads correctly
python -c "from bigcode_eval.tasks import get_task; t = get_task('my-custom-task'); print(t)"

# Run small evaluation
accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks my-custom-task \
  --limit 5 \
  --allow_code_execution

Implementation Patterns

Pattern 1: Code Execution with Unit Tests

For benchmarks that verify functional correctness:

class CodeExecutionTask(Task):
    def __init__(self):
        super().__init__(
            stop_words=["\nclass", "\ndef", "\n#"],
            requires_execution=True,  # CRITICAL: Enable execution
        )

    def get_reference(self, doc):
        """Return test code to execute."""
        return f"\n{doc['test']}\ncheck({doc['entry_point']})"

    def process_results(self, generations, references):
        code_metric = load("code_eval")
        results, details = code_metric.compute(
            references=references,
            predictions=generations,
            k=[1, 10, 100],
            timeout=10.0,  # Seconds per test
        )
        return results

Pattern 2: BLEU Score Evaluation

For benchmarks without executable tests:

class BLEUTask(Task):
    def __init__(self):
        super().__init__(
            stop_words=["\n\n"],
            requires_execution=False,  # No code execution
        )

    def get_reference(self, doc):
        """Return reference code string."""
        return doc["canonical_solution"]

    def process_results(self, generations, references):
        from evaluate import load
        bleu = load("bleu")

        # Flatten generations (one per problem for BLEU)
        predictions = [g[0] for g in generations]

        results = bleu.compute(
            predictions=predictions,
            references=[[r] for r in references]
        )
        return {"bleu": results["bleu"]}

Pattern 3: Few-Shot Prompting

For tasks requiring in-context examples:

class FewShotTask(Task):
    def __init__(self):
        super().__init__(stop_words=["\n\n"], requires_execution=True)
        self.examples = self._load_examples()

    def _load_examples(self):
        """Load few-shot examples from JSON."""
        import os
        path = os.path.join(
            os.path.dirname(__file__),
            "few_shot_examples",
            "my_task_examples.json"
        )
        with open(path) as f:
            return json.load(f)

    def get_prompt(self, doc):
        """Build few-shot prompt."""
        prompt = ""
        for ex in self.examples[:3]:  # 3-shot
            prompt += f"Problem: {ex['problem']}\nSolution: {ex['solution']}\n\n"
        prompt += f"Problem: {doc['problem']}\nSolution:"
        return prompt

Pattern 4: Fill-in-the-Middle (FIM)

For infilling tasks:

class FIMTask(Task):
    FIM_PREFIX = "<fim_prefix>"
    FIM_MIDDLE = "<fim_middle>"
    FIM_SUFFIX = "<fim_suffix>"

    def __init__(self):
        super().__init__(
            stop_words=["<|endoftext|>", self.FIM_MIDDLE],
            requires_execution=False,
        )

    def get_prompt(self, doc):
        """Format as FIM prompt."""
        prefix = doc["prefix"]
        suffix = doc["suffix"]
        return f"{self.FIM_PREFIX}{prefix}{self.FIM_SUFFIX}{suffix}{self.FIM_MIDDLE}"

    def postprocess_generation(self, generation, idx):
        """Extract middle portion."""
        if self.FIM_MIDDLE in generation:
            generation = generation.split(self.FIM_MIDDLE)[0]
        return generation.strip()

Pattern 5: Instruction-Tuned Models

For chat/instruction models:

class InstructTask(Task):
    def __init__(self):
        super().__init__(
            stop_words=["</s>", "[/INST]", "```\n"],
            requires_execution=True,
        )

    def get_prompt(self, doc):
        """Format as instruction prompt."""
        instruction = f"""Write a Python function that {doc['description']}.

Function signature: {doc['signature']}

Examples:
{doc['examples']}

Write only the function implementation:"""
        return instruction

Dataset Format Requirements

For HuggingFace Datasets

Your dataset should include:

{
    "prompt": "def function_name(args):\n    '''Docstring'''",
    "canonical_solution": "    return result",
    "test": "assert function_name(input) == expected",
    "entry_point": "function_name"
}

Creating Dataset Factories

For tasks with multiple configurations:

def create_all_tasks():
    """Create task variants for all languages."""
    tasks = {}
    for lang in ["python", "javascript", "java", "cpp"]:
        tasks[f"my-task-{lang}"] = create_task_class(lang)
    return tasks

def create_task_class(language):
    class LanguageTask(Task):
        DATASET_PATH = "username/dataset"
        DATASET_NAME = language
        # ... implementation
    return LanguageTask

# In __init__.py:
TASK_REGISTRY = {
    **my_module.create_all_tasks(),
}

Testing Your Task

Unit Tests

Create tests/test_my_task.py:

import pytest
from bigcode_eval.tasks import get_task

def test_task_loads():
    task = get_task("my-custom-task")
    assert task is not None

def test_dataset_loads():
    task = get_task("my-custom-task")
    dataset = task.get_dataset()
    assert len(dataset) > 0

def test_prompt_format():
    task = get_task("my-custom-task")
    dataset = task.get_dataset()
    prompt = task.get_prompt(dataset[0])
    assert isinstance(prompt, str)
    assert len(prompt) > 0

def test_postprocess():
    task = get_task("my-custom-task")
    raw = "def foo():\n    return 1\n\nclass Bar:"
    processed = task.postprocess_generation(raw, 0)
    assert "class Bar" not in processed

Run tests:

pytest tests/test_my_task.py -v

Integration Test

# Small-scale evaluation
accelerate launch main.py \
  --model bigcode/santacoder \
  --tasks my-custom-task \
  --limit 10 \
  --n_samples 5 \
  --allow_code_execution \
  --save_generations

Common Pitfalls

1. Missing `requires_execution=True`

If your task uses unit tests, you MUST set:

super().__init__(requires_execution=True, ...)

2. Incorrect Stop Words

Stop words should match your programming language:

# Python
stop_words=["\nclass", "\ndef", "\n#", "\nif __name__"]

# JavaScript
stop_words=["\nfunction", "\nconst", "\nlet", "\n//"]

# Java
stop_words=["\npublic", "\nprivate", "\nclass", "\n//"]

3. Not Handling Edge Cases in Postprocessing

def postprocess_generation(self, generation, idx):
    # Handle empty generation
    if not generation or not generation.strip():
        return ""

    # Handle multiple stop words
    for sw in self.stop_words:
        if sw in generation:
            generation = generation[:generation.index(sw)]

    # Remove trailing whitespace
    return generation.rstrip()

4. Timeout Issues

For complex tests, increase timeout:

results, _ = code_metric.compute(
    references=references,
    predictions=generations,
    timeout=30.0,  # Increase from default
)

Contributing Your Task

1. Fork the repository 2. Create feature branch 3. Implement task following patterns above 4. Add tests 5. Update documentation 6. Submit PR with:

Task description
Example usage
Expected results range

Common Issues and Troubleshooting

Solutions to frequently encountered problems with BigCode Evaluation Harness.

Installation Issues

Issue: PyTorch Version Conflicts

Symptom: Import errors or CUDA incompatibility after installation.

Solution: Install PyTorch separately BEFORE installing the harness:

# Check your CUDA version
nvidia-smi

# Install matching PyTorch (example for CUDA 11.8)
pip install torch --index-url https://download.pytorch.org/whl/cu118

# Then install harness
pip install -e .

Issue: DS-1000 Specific Requirements

Symptom: Errors when running DS-1000 benchmark.

Solution: DS-1000 requires Python 3.7.10 specifically:

# Create conda environment
conda create -n ds1000 python=3.7.10
conda activate ds1000

# Install specific dependencies
pip install -e ".[ds1000]"
pip install torch==1.12.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116

# Set environment variables
export TF_CPP_MIN_LOG_LEVEL=3
export TF_FORCE_GPU_ALLOW_GROWTH=true

Issue: HuggingFace Authentication

Symptom: 401 Unauthorized when accessing gated models/datasets.

Solution:

# Login to HuggingFace
huggingface-cli login

# Use auth token in command
accelerate launch main.py \
  --model meta-llama/CodeLlama-7b-hf \
  --use_auth_token \
  ...

Memory Issues

Issue: CUDA Out of Memory

Symptom: torch.cuda.OutOfMemoryError: CUDA out of memory

Solutions:

1. Use quantization:

# 8-bit quantization (saves ~50% memory)
accelerate launch main.py \
  --model bigcode/starcoder2-15b \
  --load_in_8bit \
  ...

# 4-bit quantization (saves ~75% memory)
accelerate launch main.py \
  --model bigcode/starcoder2-15b \
  --load_in_4bit \
  ...

2. Reduce batch size:

--batch_size 1

3. Set memory limits:

--max_memory_per_gpu "20GiB"
# OR
--max_memory_per_gpu auto

4. Use half precision:

--precision fp16
# OR
--precision bf16

Issue: Running Out of RAM During Evaluation

Symptom: Process killed, system becomes unresponsive.

Solution: Reduce number of samples being held in memory:

# Save intermediate results
--save_every_k_tasks 10

# Evaluate subset at a time
--limit 50 --limit_start 0
# Then
--limit 50 --limit_start 50

Execution Issues

Issue: Code Execution Not Allowed

Symptom: Error about code execution being disabled.

Solution: Add the execution flag:

accelerate launch main.py \
  --model ... \
  --tasks humaneval \
  --allow_code_execution  # Required for unit test benchmarks

Issue: Execution Timeout/Hang

Symptom: Evaluation hangs indefinitely or times out.

Solutions:

1. Use Docker for isolation:

# Generate without execution
accelerate launch main.py \
  --model ... \
  --tasks humaneval \
  --generation_only \
  --save_generations \
  --save_generations_path generations.json

# Evaluate in Docker
docker run -v $(pwd)/generations.json:/app/generations.json:ro \
  -it evaluation-harness python3 main.py \
  --tasks humaneval \
  --load_generations_path /app/generations.json \
  --allow_code_execution

2. Use subsets for debugging:

--limit 10  # Only evaluate first 10 problems

Issue: MultiPL-E Language Runtime Errors

Symptom: Errors executing code in non-Python languages.

Solution: Use the MultiPL-E specific Docker image:

docker pull ghcr.io/bigcode-project/evaluation-harness-multiple
docker run -it evaluation-harness-multiple ...

Result Discrepancies

Issue: Results Don't Match Paper/Leaderboard

Symptom: Your pass@k scores differ from reported values.

Common causes and fixes:

1. Wrong n_samples:

# For accurate pass@k estimation, use n_samples >= 200
--n_samples 200

2. Wrong temperature:

# Papers often use different temperatures
# For pass@1: temperature 0.2 (near-greedy)
# For pass@10, pass@100: temperature 0.8 (more sampling)
--temperature 0.8

3. Task name mismatch:

# Use exact task names
--tasks humaneval      # Correct
--tasks human_eval     # Wrong
--tasks HumanEval      # Wrong

4. Prompting differences:

# Some models need instruction formatting
--instruction_tokens "<s>[INST],</s>,[/INST]"

# Or specific prompt types for HumanEvalPack
--prompt instruct

5. Postprocessing differences:

# Enable/disable postprocessing
--postprocess True  # Default

Issue: Inconsistent Results Across Runs

Symptom: Different scores each time you run.

Solution: For reproducibility:

# Use greedy decoding for deterministic results
--do_sample False
--temperature 0.0

# OR set seeds (if using sampling)
# Note: Sampling inherently has variance
# Use high n_samples to reduce noise
--n_samples 200

Model Loading Issues

Issue: Model with Custom Code

Symptom: ValueError: ... requires you to execute the configuration file

Solution:

--trust_remote_code

Issue: Private/Gated Model Access

Symptom: 401 Unauthorized or 403 Forbidden

Solution:

# First login
huggingface-cli login

# Then use auth token
--use_auth_token

Issue: PEFT/LoRA Adapter Loading

Symptom: Can't load fine-tuned adapter.

Solution:

--model base-model-name \
--peft_model path/to/adapter

Issue: Seq2Seq Model Not Generating

Symptom: Empty or truncated outputs with encoder-decoder models.

Solution:

--modeltype seq2seq

Task-Specific Issues

Issue: Low MBPP Scores with Instruction Models

Symptom: Instruction-tuned models score poorly on MBPP.

Solution: MBPP prompts are plain text, not instruction format. Consider: 1. Using instruct-humaneval for instruction models 2. Creating custom instruction-formatted prompts

Issue: APPS Taking Too Long

Symptom: APPS evaluation runs for hours.

Solutions:

# Use subset
--limit 100

# Reduce samples
--n_samples 10

# Use introductory level only
--tasks apps-introductory

Issue: GSM8K Wrong max_length

Symptom: Truncated outputs, low scores on math tasks.

Solution: GSM8K needs longer context for 8-shot prompts:

--max_length_generation 2048  # Not default 512

Docker Issues

Issue: Docker Image Pull Fails

Symptom: Error response from daemon: manifest unknown

Solution: Build locally:

# Clone repo
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness

# Build image
sudo make DOCKERFILE=Dockerfile all

# For MultiPL-E
sudo make DOCKERFILE=Dockerfile-multiple all

Issue: Docker Can't Access GPU

Symptom: No GPU available inside container.

Solution: Use nvidia-docker:

docker run --gpus all -it evaluation-harness ...

Debugging Tips

Enable Verbose Output

# Check what's being generated
--save_generations
--save_references

# Inspect a few samples
--limit 5

Test Reference Solutions

# Verify test cases pass with ground truth
--check_references

Inspect Intermediate Results

# Save progress periodically
--save_every_k_tasks 10
--save_generations_path intermediate_generations.json

Common Debug Workflow

# 1. Test with tiny subset
accelerate launch main.py \
  --model your-model \
  --tasks humaneval \
  --limit 3 \
  --n_samples 1 \
  --save_generations \
  --allow_code_execution

# 2. Inspect generations
cat generations.json | python -m json.tool | head -100

# 3. If looks good, scale up
accelerate launch main.py \
  --model your-model \
  --tasks humaneval \
  --n_samples 200 \
  --allow_code_execution

Getting Help

1. Check existing issues: https://github.com/bigcode-project/bigcode-evaluation-harness/issues 2. Search closed issues: Often contains solutions 3. Open new issue with:

Full command used
Error message
Environment details (Python version, PyTorch version, GPU)
Model being evaluated

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

How it compares

Pick evaluating-code-models over lm-evaluation-harness skills when the task is executable code correctness with pass@k, not general language-model perplexity or QA benchmarks.

FAQ

Which benchmarks does evaluating-code-models support?

evaluating-code-models supports 15+ BigCode Evaluation Harness benchmarks including HumanEval with 164 Python problems, MBPP with 500 tasks, HumanEval+ with expanded tests, and MultiPL-E across 18 languages. Metrics use pass@k at k values of 1, 10, and 100.

How do you run a HumanEval benchmark with evaluating-code-models?

evaluating-code-models guides running accelerate launch main.py with --model, --tasks humaneval, and --allow_code_execution flags through the BigCode Evaluation Harness. Configure n_samples near 200, temperature, and max_length_generation for reproducible pass@k scores.

When should developers pick evaluating-code-models?

Developers should pick evaluating-code-models when comparing code-generation models for agents or codegen pipelines and needing leaderboard-comparable pass@k numbers. The skill focuses on executable code benchmarks rather than general text LLM evaluation harnesses.

Is Evaluating Code Models safe to install?

skills.sh reports 1 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Data Science & MLllmresearch

About

Evaluating Code Models by the numbers

Add your badge

How do you benchmark code LLMs with HumanEval pass@k?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

BigCode Evaluation Harness - Code Model Benchmarking

Quick Start

Common Workflows

Workflow 1: Standard Code Benchmark Evaluation

Workflow 2: Multi-Language Evaluation (MultiPL-E)

Workflow 3: Instruction-Tuned Model Evaluation

Workflow 4: Compare Multiple Models

When to Use vs Alternatives

Supported Benchmarks

Common Issues

Command Reference

Hardware Requirements

Resources

BigCode Evaluation Harness - Benchmark Guide

Code Generation with Unit Tests

HumanEval

HumanEval+

MBPP (Mostly Basic Python Problems)

MBPP+

MultiPL-E (18 Languages)

APPS

DS-1000

Mercury

Code Generation Without Unit Tests

SantaCoder-FIM (Fill-in-the-Middle)

CoNaLa

Concode

Instruction-Tuned Model Evaluation

InstructHumanEval

HumanEvalPack

Math and Reasoning

PAL (Program-Aided Language Models)

Documentation Generation

CodeXGLUE Code-to-Text

Classification Tasks

Java Complexity Prediction

Code Equivalence Detection

C Defect Prediction

Benchmark Selection Guide

pass@k Calculation

Creating Custom Tasks in BigCode Evaluation Harness

Task Architecture

Step-by-Step Implementation

Step 1: Create Task File

Step 2: Register Task

Step 3: Test Task

Implementation Patterns

Pattern 1: Code Execution with Unit Tests

Pattern 2: BLEU Score Evaluation

Pattern 3: Few-Shot Prompting

Pattern 4: Fill-in-the-Middle (FIM)

Pattern 5: Instruction-Tuned Models

Dataset Format Requirements

For HuggingFace Datasets

Creating Dataset Factories

Testing Your Task

Unit Tests

Integration Test

Common Pitfalls

1. Missing requires_execution=True

2. Incorrect Stop Words

3. Not Handling Edge Cases in Postprocessing

4. Timeout Issues

Contributing Your Task

Common Issues and Troubleshooting

Installation Issues

Issue: PyTorch Version Conflicts

Issue: DS-1000 Specific Requirements

Issue: HuggingFace Authentication

Memory Issues

Issue: CUDA Out of Memory

1. Missing `requires_execution=True`