Sf Eval

Name: Sf Eval
Author: clientell-ai

clientell-ai/salesforce-skills·Apache-2.0

Benchmark whether Salesforce skills improve Apex and config output by scoring with-vs-without skill context against a Salesforce rubric.

Overview

sf-eval is an agent skill most often used in Ship (also Build) that benchmarks Salesforce code with vs without skill context and scores results on a Salesforce-specific quality rubric.

Install

npx skills add https://github.com/clientell-ai/salesforce-skills --skill sf-eval

What is this skill?

Compares AI-generated Salesforce code with vs without skill context
Scores against a Salesforce-specific rubric: security, governor limits, bulkification, patterns, completeness
Mode 1: run benchmark task(s) from `evals/benchmarks/tasks.json` via `/sf-eval` or task id
Baseline generation deliberately omits skill knowledge to surface typical LLM gaps
Optional Salesforce CLI for static analysis; Apache-2.0 skill with fork context
Salesforce rubric dimensions: security, governor limits, bulkification, patterns, completeness
Eval Mode 1: benchmark tasks from evals/benchmarks/tasks.json

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 1 installs on skills.sh; 7 GitHub stars; 3/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).

What problem does it solve?

You cannot tell if your Salesforce agent skills improve output or if the model still ships governor-limit violations and insecure patterns.

Who is it for?

Salesforce skill maintainers and solo consultants running repeatable eval tasks to prove skill ROI and Apex quality.

Skip if: Teams with no Salesforce work who only need generic JavaScript unit tests unrelated to Apex rubrics.

When should I use this skill?

User mentions evaluate skills, benchmark, skill quality, run eval, compare with/without skills, or invokes `/sf-eval` with an optional task id.

What do I get? / Deliverables

You get a comparison report with rubric scores for baseline vs skill-augmented generations so you can verify skill value before production merges.

Comparison report: baseline vs skill-augmented generations
Per-dimension rubric scores for Salesforce best practices

Recommended Skills

Find Skillsvercel-labs/skills

Find Skills is a meta agent skill from the Vercel Labs skills package that helps solo builders discover and install modu…2M installs·21.7k stars

Skill Creatoranthropics/skills

Skill-creator is an Anthropic-originated meta skill aimed at solo and indie builders who want durable agent capabilities…258k installs·148k stars

Lark Skill Makerlarksuite/cli

Meta-skill for packaging Feishu/Lark API operations into installable lark-cli Skills.207k installs·13.7k stars

Skills Clixixu-me/skills

skills-cli is a procedural agent skill that teaches assistants how to operate the open Agent Skills CLI—the package mana…200k installs·61 stars

Write A Skillmattpocock/skills

End-to-end guide for authoring new agent skills with proper metadata, folder layout, progressive disclosure, and user va…181k installs·121k stars

Using Superpowersobra/superpowers

Using Superpowers is a journey-wide meta skill for solo and indie builders who run Claude Code, Codex, Cursor, or simila…134k installs·221k stars

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Ship/testing is the canonical shelf because the skill’s purpose is evaluation, benchmarks, and quality verification before you trust generated Salesforce code. Testing subphase captures run-eval, compare modes, and rubric scoring rather than net-new feature implementation.

Also useful

BuildAgent skills & templates

Where it fits

Example use

ShipTesting & QA

Run `/sf-eval` on a benchmark task before merging AI-written triggers into a release branch.

Example use

BuildAgent skills & templates

After editing a Salesforce SKILL.md, re-benchmark to see if rubric scores on security and bulkification improved.

Example use

OperateIteration & experiments

Periodic eval runs when platform API versions change to catch skill drift.

How it compares

Skill-package benchmark harness with Salesforce rubric, not a generic pass/fail linter with no with/without comparison.

Common Questions / FAQ

Who is sf-eval for?

Solo builders and small teams authoring or adopting Salesforce skills who need measurable before/after quality on Apex and platform patterns.

When should I use sf-eval?

In Ship/testing before trusting generated Apex for release; in Build/agent-tooling when iterating on SKILL.md content; whenever the user mentions evaluate skills, benchmark, or compare with/without skills.

Is sf-eval safe to install?

Check the Security Audits panel on this page; the skill allows Read, Write, Edit, and Bash—run evals only in repos and orgs you control.

SKILL.md

READMESKILL.md - Sf Eval

# Salesforce Skills Evaluator

You evaluate whether Salesforce skills improve AI-generated code quality. You do this by comparing code generated **with** vs **without** skill context and scoring both.

## Eval Modes

### Mode 1: Run Benchmark Task(s)
When user says `/sf-eval` or `/sf-eval <task-id>`:

1. Read available tasks from `evals/benchmarks/tasks.json`
2. For each task (or the specified one):

   **Step A — Generate Baseline (no skill context):**
   Generate Salesforce code for the task prompt AS IF you had no Salesforce skill knowledge. Produce typical LLM output — functional but likely missing Salesforce-specific best practices. Do NOT use `WITH USER_MODE`, do NOT use trigger handler patterns, do NOT use `stripInaccessible` unless the prompt explicitly asks for it. Write code the way a generic AI would.

   **Step B — Generate With Skills:**
   Read the relevant skill file at `skills/<skill>/SKILL.md` and its references. Then generate code following ALL the skill's rules, patterns, and gotchas strictly.

   **Step C — Score Both:**
   Read the rubric at `evals/benchmarks/rubric.md` and the judge prompt at `evals/benchmarks/judge-prompt.md`. Score each output on 5 categories (0-5 each):

   | Category | What to check |
   |----------|---------------|
   | Security | WITH USER_MODE, stripInaccessible, with sharing, no injection, no hardcoded creds |
   | Governor Limits | No SOQL/DML in loops, uses Map/Set collections, efficient queries |
   | Bulkification | Handles 200+ records, uses collections, no Trigger.new[0] |
   | Patterns | Trigger handler, service/selector layers, naming conventions |
   | Completeness | Requirements met, edge cases, error handling, production-ready |

   **Step D — Output Report:**
   Format as a comparison table:

   ```
   ## Task: <task-id>
   **Prompt**: <prompt text>

   ### Baseline (No Skills) — X/25
   | Category | Score | Reason |
   |----------|-------|--------|
   | Security | X/5 | ... |
   | Governor Limits | X/5 | ... |
   | Bulkification | X/5 | ... |
   | Patterns | X/5 | ... |
   | Completeness | X/5 | ... |

   ### With Skills — X/25
   | Category | Score | Reason |
   |----------|-------|--------|
   | Security | X/5 | ... |
   | Governor Limits | X/5 | ... |
   | Bulkification | X/5 | ... |
   | Patterns | X/5 | ... |
   | Completeness | X/5 | ... |

   ### Improvement: +X points (+XX%)
   ```

3. If running all tasks, produce a summary table at the end:
   ```
   ## Summary
   | Task | Baseline | With Skills | Delta |
   |------|----------|-------------|-------|
   | ... | X/25 | X/25 | +X |
   | **Average** | **X/25** | **X/25** | **+X (+XX%)** |
   ```

4. Save the full report to `evals/benchmarks/results/BENCHMARK.md`

### Mode 2: Static Check
When user says `/sf-eval --check <file>` or `/sf-eval check <file>`:

Run `bash evals/checks/static-checks.sh <file>` and show the results.

### Mode 3: Score Custom Code
When user provides their own code and asks to evaluate it:

Score the code against the rubric (same 5 categories, 25 points) and provide improvement suggestions referencing the relevant skill.

## Available Benchmark Tasks

Read `evals/benchmarks/tasks.json` for the

What is this skill?

Compares AI-generated Salesforce code with vs without skill context

Scores against a Salesforce-specific rubric: security, governor limits, bulkification, patterns, completeness

Mode 1: run benchmark task(s) from `evals/benchmarks/tasks.json` via `/sf-eval` or task id

Baseline generation deliberately omits skill knowledge to surface typical LLM gaps

Optional Salesforce CLI for static analysis; Apache-2.0 skill with fork context

Salesforce rubric dimensions: security, governor limits, bulkification, patterns, completeness

Eval Mode 1: benchmark tasks from evals/benchmarks/tasks.json

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 1 installs on skills.sh; 7 GitHub stars; 3/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

BuildAgent skills & templates

Where it fits

Example use

ShipTesting & QA

Run `/sf-eval` on a benchmark task before merging AI-written triggers into a release branch.

Example use

BuildAgent skills & templates

After editing a Salesforce SKILL.md, re-benchmark to see if rubric scores on security and bulkification improved.

Example use

OperateIteration & experiments

Periodic eval runs when platform API versions change to catch skill drift.

SKILL.md

READMESKILL.md - Sf Eval

# Salesforce Skills Evaluator

You evaluate whether Salesforce skills improve AI-generated code quality. You do this by comparing code generated **with** vs **without** skill context and scoring both.

## Eval Modes

### Mode 1: Run Benchmark Task(s)
When user says `/sf-eval` or `/sf-eval <task-id>`:

1. Read available tasks from `evals/benchmarks/tasks.json`
2. For each task (or the specified one):

   **Step A — Generate Baseline (no skill context):**
   Generate Salesforce code for the task prompt AS IF you had no Salesforce skill knowledge. Produce typical LLM output — functional but likely missing Salesforce-specific best practices. Do NOT use `WITH USER_MODE`, do NOT use trigger handler patterns, do NOT use `stripInaccessible` unless the prompt explicitly asks for it. Write code the way a generic AI would.

   **Step B — Generate With Skills:**
   Read the relevant skill file at `skills/<skill>/SKILL.md` and its references. Then generate code following ALL the skill's rules, patterns, and gotchas strictly.

   **Step C — Score Both:**
   Read the rubric at `evals/benchmarks/rubric.md` and the judge prompt at `evals/benchmarks/judge-prompt.md`. Score each output on 5 categories (0-5 each):

   | Category | What to check |
   |----------|---------------|
   | Security | WITH USER_MODE, stripInaccessible, with sharing, no injection, no hardcoded creds |
   | Governor Limits | No SOQL/DML in loops, uses Map/Set collections, efficient queries |
   | Bulkification | Handles 200+ records, uses collections, no Trigger.new[0] |
   | Patterns | Trigger handler, service/selector layers, naming conventions |
   | Completeness | Requirements met, edge cases, error handling, production-ready |

   **Step D — Output Report:**
   Format as a comparison table:

   ```
   ## Task: <task-id>
   **Prompt**: <prompt text>

   ### Baseline (No Skills) — X/25
   | Category | Score | Reason |
   |----------|-------|--------|
   | Security | X/5 | ... |
   | Governor Limits | X/5 | ... |
   | Bulkification | X/5 | ... |
   | Patterns | X/5 | ... |
   | Completeness | X/5 | ... |

   ### With Skills — X/25
   | Category | Score | Reason |
   |----------|-------|--------|
   | Security | X/5 | ... |
   | Governor Limits | X/5 | ... |
   | Bulkification | X/5 | ... |
   | Patterns | X/5 | ... |
   | Completeness | X/5 | ... |

   ### Improvement: +X points (+XX%)
   ```

3. If running all tasks, produce a summary table at the end:
   ```
   ## Summary
   | Task | Baseline | With Skills | Delta |
   |------|----------|-------------|-------|
   | ... | X/25 | X/25 | +X |
   | **Average** | **X/25** | **X/25** | **+X (+XX%)** |
   ```

4. Save the full report to `evals/benchmarks/results/BENCHMARK.md`

### Mode 2: Static Check
When user says `/sf-eval --check <file>` or `/sf-eval check <file>`:

Run `bash evals/checks/static-checks.sh <file>` and show the results.

### Mode 3: Score Custom Code
When user provides their own code and asks to evaluate it:

Score the code against the rubric (same 5 categories, 25 points) and provide improvement suggestions referencing the relevant skill.

## Available Benchmark Tasks

Read `evals/benchmarks/tasks.json` for the

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is sf-eval for?

When should I use sf-eval?

Is sf-eval safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is sf-eval for?

When should I use sf-eval?

Is sf-eval safe to install?

SKILL.md