
Sf Eval
Benchmark whether Salesforce skills improve Apex and config output by scoring with-vs-without skill context against a Salesforce rubric.
Overview
sf-eval is an agent skill most often used in Ship (also Build) that benchmarks Salesforce code with vs without skill context and scores results on a Salesforce-specific quality rubric.
Install
npx skills add https://github.com/clientell-ai/salesforce-skills --skill sf-evalWhat is this skill?
- Compares AI-generated Salesforce code with vs without skill context
- Scores against a Salesforce-specific rubric: security, governor limits, bulkification, patterns, completeness
- Mode 1: run benchmark task(s) from `evals/benchmarks/tasks.json` via `/sf-eval` or task id
- Baseline generation deliberately omits skill knowledge to surface typical LLM gaps
- Optional Salesforce CLI for static analysis; Apache-2.0 skill with fork context
- Salesforce rubric dimensions: security, governor limits, bulkification, patterns, completeness
- Eval Mode 1: benchmark tasks from evals/benchmarks/tasks.json
Adoption & trust: 1 installs on skills.sh; 7 GitHub stars; 3/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).
What problem does it solve?
You cannot tell if your Salesforce agent skills improve output or if the model still ships governor-limit violations and insecure patterns.
Who is it for?
Salesforce skill maintainers and solo consultants running repeatable eval tasks to prove skill ROI and Apex quality.
Skip if: Teams with no Salesforce work who only need generic JavaScript unit tests unrelated to Apex rubrics.
When should I use this skill?
User mentions evaluate skills, benchmark, skill quality, run eval, compare with/without skills, or invokes `/sf-eval` with an optional task id.
What do I get? / Deliverables
You get a comparison report with rubric scores for baseline vs skill-augmented generations so you can verify skill value before production merges.
- Comparison report: baseline vs skill-augmented generations
- Per-dimension rubric scores for Salesforce best practices
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Ship/testing is the canonical shelf because the skill’s purpose is evaluation, benchmarks, and quality verification before you trust generated Salesforce code. Testing subphase captures run-eval, compare modes, and rubric scoring rather than net-new feature implementation.
Where it fits
Run `/sf-eval` on a benchmark task before merging AI-written triggers into a release branch.
After editing a Salesforce SKILL.md, re-benchmark to see if rubric scores on security and bulkification improved.
Periodic eval runs when platform API versions change to catch skill drift.
How it compares
Skill-package benchmark harness with Salesforce rubric, not a generic pass/fail linter with no with/without comparison.
Common Questions / FAQ
Who is sf-eval for?
Solo builders and small teams authoring or adopting Salesforce skills who need measurable before/after quality on Apex and platform patterns.
When should I use sf-eval?
In Ship/testing before trusting generated Apex for release; in Build/agent-tooling when iterating on SKILL.md content; whenever the user mentions evaluate skills, benchmark, or compare with/without skills.
Is sf-eval safe to install?
Check the Security Audits panel on this page; the skill allows Read, Write, Edit, and Bash—run evals only in repos and orgs you control.
SKILL.md
READMESKILL.md - Sf Eval
# Salesforce Skills Evaluator You evaluate whether Salesforce skills improve AI-generated code quality. You do this by comparing code generated **with** vs **without** skill context and scoring both. ## Eval Modes ### Mode 1: Run Benchmark Task(s) When user says `/sf-eval` or `/sf-eval <task-id>`: 1. Read available tasks from `evals/benchmarks/tasks.json` 2. For each task (or the specified one): **Step A — Generate Baseline (no skill context):** Generate Salesforce code for the task prompt AS IF you had no Salesforce skill knowledge. Produce typical LLM output — functional but likely missing Salesforce-specific best practices. Do NOT use `WITH USER_MODE`, do NOT use trigger handler patterns, do NOT use `stripInaccessible` unless the prompt explicitly asks for it. Write code the way a generic AI would. **Step B — Generate With Skills:** Read the relevant skill file at `skills/<skill>/SKILL.md` and its references. Then generate code following ALL the skill's rules, patterns, and gotchas strictly. **Step C — Score Both:** Read the rubric at `evals/benchmarks/rubric.md` and the judge prompt at `evals/benchmarks/judge-prompt.md`. Score each output on 5 categories (0-5 each): | Category | What to check | |----------|---------------| | Security | WITH USER_MODE, stripInaccessible, with sharing, no injection, no hardcoded creds | | Governor Limits | No SOQL/DML in loops, uses Map/Set collections, efficient queries | | Bulkification | Handles 200+ records, uses collections, no Trigger.new[0] | | Patterns | Trigger handler, service/selector layers, naming conventions | | Completeness | Requirements met, edge cases, error handling, production-ready | **Step D — Output Report:** Format as a comparison table: ``` ## Task: <task-id> **Prompt**: <prompt text> ### Baseline (No Skills) — X/25 | Category | Score | Reason | |----------|-------|--------| | Security | X/5 | ... | | Governor Limits | X/5 | ... | | Bulkification | X/5 | ... | | Patterns | X/5 | ... | | Completeness | X/5 | ... | ### With Skills — X/25 | Category | Score | Reason | |----------|-------|--------| | Security | X/5 | ... | | Governor Limits | X/5 | ... | | Bulkification | X/5 | ... | | Patterns | X/5 | ... | | Completeness | X/5 | ... | ### Improvement: +X points (+XX%) ``` 3. If running all tasks, produce a summary table at the end: ``` ## Summary | Task | Baseline | With Skills | Delta | |------|----------|-------------|-------| | ... | X/25 | X/25 | +X | | **Average** | **X/25** | **X/25** | **+X (+XX%)** | ``` 4. Save the full report to `evals/benchmarks/results/BENCHMARK.md` ### Mode 2: Static Check When user says `/sf-eval --check <file>` or `/sf-eval check <file>`: Run `bash evals/checks/static-checks.sh <file>` and show the results. ### Mode 3: Score Custom Code When user provides their own code and asks to evaluate it: Score the code against the rubric (same 5 categories, 25 points) and provide improvement suggestions referencing the relevant skill. ## Available Benchmark Tasks Read `evals/benchmarks/tasks.json` for the