
Evaluating Llms Harness
Benchmark OpenAI, Anthropic, and OpenAI-compatible API models with lm-evaluation-harness tasks before you pick a model for your agent or product.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill evaluating-llms-harnessWhat is this skill?
- Unified TemplateAPI path for OpenAI completions/chat and Anthropic completions/chat
- Documents which request types work per provider (generate_until vs loglikelihood)
- Logprobs availability matrix—chat APIs often cannot run perplexity tasks
- Environment setup with OPENAI_API_KEY and lm_eval CLI examples
- Supports local OpenAI-compatible servers for comparing closed and open models
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 2/3 security scanners passed (skills.sh audits).
Recommended Skills
Microsoft Foundrymicrosoft/azure-skills
Azure Aimicrosoft/azure-skills
Azure Hosted Copilot Sdkmicrosoft/azure-skills
Lark Eventlarksuite/cli
Running Claude Code Via Litellm Copilotxixu-me/skills
Setup Matt Pocock Skillsmattpocock/skills
Journey fit
Primary fit
Canonical shelf is Ship/testing because harness runs are the gate for model quality before launch and ongoing regression checks. Task suites like HellaSwag and loglikelihood checks are QA-style evaluation, not initial ideation or marketing distribution.
Common Questions / FAQ
Is Evaluating Llms Harness safe to install?
skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Evaluating Llms Harness
# API Evaluation Guide to evaluating OpenAI, Anthropic, and other API-based language models. ## Overview The lm-evaluation-harness supports evaluating API-based models through a unified `TemplateAPI` interface. This allows benchmarking of: - OpenAI models (GPT-4, GPT-3.5, etc.) - Anthropic models (Claude 3, Claude 2, etc.) - Local OpenAI-compatible APIs - Custom API endpoints **Why evaluate API models**: - Benchmark closed-source models - Compare API models to open models - Validate API performance - Track model updates over time ## Supported API Models | Provider | Model Type | Request Types | Logprobs | |----------|------------|---------------|----------| | OpenAI (completions) | `openai-completions` | All | ✅ Yes | | OpenAI (chat) | `openai-chat-completions` | `generate_until` only | ❌ No | | Anthropic (completions) | `anthropic-completions` | All | ❌ No | | Anthropic (chat) | `anthropic-chat` | `generate_until` only | ❌ No | | Local (OpenAI-compatible) | `local-completions` | Depends on server | Varies | **Note**: Models without logprobs can only be evaluated on generation tasks, not perplexity or loglikelihood tasks. ## OpenAI Models ### Setup ```bash export OPENAI_API_KEY=sk-... ``` ### Completion Models (Legacy) **Available models**: `davinci-002`, `babbage-002` ```bash lm_eval --model openai-completions \ --model_args model=davinci-002 \ --tasks lambada_openai,hellaswag \ --batch_size auto ``` **Supports**: - `generate_until`: ✅ - `loglikelihood`: ✅ - `loglikelihood_rolling`: ✅ ### Chat Models **Available models**: `gpt-4`, `gpt-4-turbo`, `gpt-3.5-turbo` ```bash lm_eval --model openai-chat-completions \ --model_args model=gpt-4-turbo \ --tasks mmlu,gsm8k,humaneval \ --num_fewshot 5 \ --batch_size auto ``` **Supports**: - `generate_until`: ✅ - `loglikelihood`: ❌ (no logprobs) - `loglikelihood_rolling`: ❌ **Important**: Chat models don't provide logprobs, so they can only be used with generation tasks (MMLU, GSM8K, HumanEval), not perplexity tasks. ### Configuration Options ```bash lm_eval --model openai-chat-completions \ --model_args \ model=gpt-4-turbo,\ base_url=https://api.openai.com/v1,\ num_concurrent=5,\ max_retries=3,\ timeout=60,\ batch_size=auto ``` **Parameters**: - `model`: Model identifier (required) - `base_url`: API endpoint (default: OpenAI) - `num_concurrent`: Concurrent requests (default: 5) - `max_retries`: Retry failed requests (default: 3) - `timeout`: Request timeout in seconds (default: 60) - `tokenizer`: Tokenizer to use (default: matches model) - `tokenizer_backend`: `"tiktoken"` or `"huggingface"` ### Cost Management OpenAI charges per token. Estimate costs before running: ```python # Rough estimate num_samples = 1000 avg_tokens_per_sample = 500 # input + output cost_per_1k_tokens = 0.01 # GPT-3.5 Turbo total_cost = (num_samples * avg_tokens_per_sample / 1000) * cost_per_1k_tokens print(f"Estimated cost: ${total_cost:.2f}") ``` **Cost-saving tips**: - Use `--limit N` for testing - Start with `gpt-3.5-turbo` before `gpt-4` - Set `max_gen_toks` to minimum needed - Use `num_fewshot=0` for zero-shot when possible ## Anthropic Models ### Setup ```bash export ANTHROPIC_API_KEY=sk-ant-... ``` ### Completion Models (Legacy) ```bash lm_eval --model anthropic-completions \ --model_args model=claude-2.1 \ --tasks lambada_openai,hellaswag \ --batch_size auto ``` ### Chat Models (Recommended) **Available models**: `claude-3-5-sonnet-20241022`, `claude-3-opus-20240229`, `claude-3-sonnet-20240229`, `claude-3-haiku-20240307` ```bash lm_eval --model anthropic-chat \ --model_args model=claude-3-5-sonnet-20241022 \ --tasks mmlu,gsm8k,humaneval \ --num_fewshot 5 \ --batch_size auto ``` **Aliases**: `anthropic-chat-completions` (same as `anthropic-chat`) ### Configuration Options ```bash lm_eval --model anthropic-chat \ --model_args \ model=claude-3-5-sonnet-20241022,\ base_url=https://api.anthropic.com,\ num