
Evaluating Code Models
Run BigCode Evaluation Harness benchmarks (HumanEval, HumanEval+, pass@k) to compare code LLMs before you commit an agent or codegen stack.
Overview
Evaluating-code-models is an agent skill most often used in Validate (also Build agent-tooling, Ship testing) that guides BigCode Evaluation Harness runs for code LLM benchmarks.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill evaluating-code-modelsWhat is this skill?
- Documents HumanEval (164 problems) and HumanEval+ with pass@k and recommended temperature/n_samples settings
- Covers code-generation benchmarks that execute generated code against unit tests via --allow_code_execution
- Includes accelerate launch CLI patterns for batch_size, n_samples, and max_length_generation tuning
- Maps dataset IDs on HuggingFace (e.g. openai_humaneval, evalplus/humanevalplus) to harness task names
- Oriented to functional correctness metrics, not subjective chat quality
- HumanEval covers 164 handwritten Python programming problems
- HumanEval+ adds roughly 80× more test cases per problem than the original suite
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 1/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need to pick or trust a code model but only have anecdotes, not comparable pass@k results on standard programming benchmarks.
Who is it for?
Solo builders evaluating open-weights or hosted code models with GPU access and willingness to run accelerated batch jobs.
Skip if: Teams that only need lint/unit tests on their own repo without LLM benchmarking, or builders without code-execution sandbox tolerance for harness runs.
When should I use this skill?
You need benchmark names, CLI flags, or pass@k methodology for BigCode Evaluation Harness code-model evaluation.
What do I get? / Deliverables
You get runnable harness commands, dataset/task names, and metric interpretation so you can document model choice before integrating codegen into your agent workflow.
- Documented harness command lines with tasks, sampling, and batch settings
- Interpreted pass@k results tied to named benchmarks (e.g. humaneval, humanevalplus)
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Validate because you prove model quality on standard coding tasks before scaling build and ship workflows. Prototype fits empirical model comparison with executable unit-test benchmarks rather than production deployment.
Where it fits
Benchmark two 7B code models on humaneval before paying for a larger hosted endpoint.
Document default model selection in your agent repo with reproducible harness commands.
Re-run HumanEval+ after a model version bump to catch regressions on edge-case tests.
How it compares
Use for standardized LLM codegen benchmarks instead of judging models from a handful of manual chat prompts.
Common Questions / FAQ
Who is evaluating-code-models for?
Indie developers and small teams choosing or validating code LLMs for agents, CLIs, or internal codegen who want harness-backed pass@k numbers.
When should I use evaluating-code-models?
During Validate when prototyping model stacks, during Build when tuning agent-tooling defaults, and during Ship when regression-testing a model upgrade against HumanEval-style suites.
Is evaluating-code-models safe to install?
Review the Security Audits panel on this Prism page; harness usage implies executing model-generated code when --allow_code_execution is enabled—treat runs as untrusted code in an isolated environment.
SKILL.md
READMESKILL.md - Evaluating Code Models
# BigCode Evaluation Harness - Benchmark Guide Comprehensive guide to all benchmarks supported by BigCode Evaluation Harness. ## Code Generation with Unit Tests These benchmarks test functional correctness by executing generated code against unit tests. ### HumanEval **Overview**: 164 handwritten Python programming problems created by OpenAI. **Dataset**: `openai_humaneval` on HuggingFace **Metric**: pass@k (k=1, 10, 100) **Problems**: Function completion with docstrings **Example problem structure**: ```python def has_close_elements(numbers: List[float], threshold: float) -> bool: """Check if in given list of numbers, are any two numbers closer to each other than given threshold. >>> has_close_elements([1.0, 2.0, 3.0], 0.5) False >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) True """ ``` **Usage**: ```bash accelerate launch main.py \ --model bigcode/starcoder2-7b \ --tasks humaneval \ --temperature 0.2 \ --n_samples 200 \ --batch_size 50 \ --allow_code_execution ``` **Recommended settings**: - `temperature`: 0.8 for pass@k with large n_samples, 0.2 for greedy - `n_samples`: 200 for accurate pass@k estimation - `max_length_generation`: 512 (sufficient for most problems) ### HumanEval+ **Overview**: Extended HumanEval with 80× more test cases per problem. **Dataset**: `evalplus/humanevalplus` on HuggingFace **Why use it**: Catches solutions that pass original tests but fail on edge cases **Usage**: ```bash accelerate launch main.py \ --model bigcode/starcoder2-7b \ --tasks humanevalplus \ --temperature 0.2 \ --n_samples 200 \ --allow_code_execution ``` **Note**: Execution takes longer due to additional tests. Timeout may need adjustment. ### MBPP (Mostly Basic Python Problems) **Overview**: 1,000 crowd-sourced Python problems designed for entry-level programmers. **Dataset**: `mbpp` on HuggingFace **Test split**: 500 problems (indices 11-511) **Metric**: pass@k **Problem structure**: - Task description in English - 3 automated test cases per problem - Code solution (ground truth) **Usage**: ```bash accelerate launch main.py \ --model bigcode/starcoder2-7b \ --tasks mbpp \ --temperature 0.2 \ --n_samples 200 \ --allow_code_execution ``` ### MBPP+ **Overview**: 399 curated MBPP problems with 35× more test cases. **Dataset**: `evalplus/mbppplus` on HuggingFace **Usage**: ```bash accelerate launch main.py \ --model bigcode/starcoder2-7b \ --tasks mbppplus \ --allow_code_execution ``` ### MultiPL-E (18 Languages) **Overview**: HumanEval and MBPP translated to 18 programming languages. **Languages**: Python, JavaScript, Java, C++, Go, Rust, TypeScript, C#, PHP, Ruby, Swift, Kotlin, Scala, Perl, Julia, Lua, R, Racket **Task naming**: `multiple-{lang}` where lang is file extension: - `multiple-py` (Python) - `multiple-js` (JavaScript) - `multiple-java` (Java) - `multiple-cpp` (C++) - `multiple-go` (Go) - `multiple-rs` (Rust) - `multiple-ts` (TypeScript) - `multiple-cs` (C#) - `multiple-php` (PHP) - `multiple-rb` (Ruby) - `multiple-swift` (Swift) - `multiple-kt` (Kotlin) - `multiple-scala` (Scala) - `multiple-pl` (Perl) - `multiple-jl` (Julia) - `multiple-lua` (Lua) - `multiple-r` (R) - `multiple-rkt` (Racket) **Usage with Docker** (recommended for safe execution): ```bash # Step 1: Generate on host accelerate launch main.py \ --model bigcode/starcoder2-7b \ --tasks multiple-js,multiple-java,multiple-cpp \ --generation_only \ --save_generations \ --save_generations_path generations.json # Step 2: Evaluate in Docker docker pull ghcr.io/bigcode-project/evaluation-harness-multiple docker run -v $(pwd)/generations.json:/app/generations.json:ro \ -it evaluation-harness-multiple python3 main.py \ --tasks multiple-js,multiple-java,multiple-cpp \ --load_generations_path /app/generations.json \ --allow_code_execution ``` ### APPS **Overview**: 10,000 Python problems across three difficulty levels. **Difficulty levels**: -