Agent Benchmark Suite

Name: Agent Benchmark Suite
Author: ruvnet

ruvnet/ruflo

Run structured throughput, latency, scalability, and swarm coordination benchmarks with regression comparison before you ship or tune agent systems.

Overview

Agent-benchmark-suite is an agent skill most often used in Ship testing (also Build agent-tooling, Operate monitoring) that runs comprehensive performance benchmarks and regression validation for agent and swarm workload

Install

npx skills add https://github.com/ruvnet/ruflo --skill agent-benchmark-suite

What is this skill?

ComprehensiveBenchmarkSuite covers throughput, latency, scalability, and resource usage
Swarm-focused benchmarks: coordination, load balancing, topology, fault tolerance
BenchmarkReporter, PerformanceComparator, and BenchmarkAnalyzer for regression workflows
Custom benchmark manager hook for domain-specific agent workloads
Invoke via $agent-benchmark-suite as a performance optimization agent profile
Benchmark categories include throughput, latency, scalability, resource usage, and four swarm-specific suites

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 633 installs on skills.sh; 58.5k GitHub stars; 3/3 security scanners passed (skills.sh audits).

What problem does it solve?

You changed agent coordination or scaling code but have no standardized benchmark suite to catch latency or throughput regressions before users feel it.

Who is it for?

Builders operating Ruflo-style agent swarms who need repeatable throughput, latency, and fault-tolerance benchmarks after each change.

Skip if: Static marketing pages or simple CRUD apps with no agent runtime to measure.

When should I use this skill?

Invoke with $agent-benchmark-suite when you need comprehensive performance benchmarking, regression detection, or validation for agent/swarm systems.

What do I get? / Deliverables

You get a structured benchmark run with reporting and comparison hooks so you can accept or reject releases based on measured agent performance.

Benchmark execution across core and swarm-specific suites
Reported comparison and analysis suitable for release go/no-go

Recommended Skills

Microsoft Foundrymicrosoft/azure-skills

Microsoft Foundry skill guides agents through the full Azure AI Foundry lifecycle—containerizing agents, pushing to ACR,…377k installs·1.2k stars

Azure Aimicrosoft/azure-skills

azure-ai is a Prism-oriented quick reference for Microsoft Azure AI work, with the published body centered on the Azure …375k installs·1.2k stars

Azure Hosted Copilot Sdkmicrosoft/azure-skills

Azure Hosted Copilot SDK is Microsoft's entry skill for repos using @github/copilot-sdk—it detects CopilotClient usage, …346k installs·1.2k stars

Lark Eventlarksuite/cli

Lark real-time subscription skill via lark-cli event consume for building bots and streaming webhook-style agent workers…208k installs·13.7k stars

Running Claude Code Via Litellm Copilotxixu-me/skills

Running Claude Code via LiteLLM Copilot walks through pointing Claude Code at a local LiteLLM proxy that forwards Anthro…200k installs·61 stars

Setup Matt Pocock Skillsmattpocock/skills

One-time per-repo setup so Matt Pocock engineering skills share correct issue tracker, triage strings, and domain docume…180k installs·121k stars

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Canonical shelf is Ship testing because the skill’s purpose is performance validation and regression detection before release. Benchmark suites, comparators, and analyzers map directly to QA-style performance testing rather than initial feature coding.

Also useful

BuildAgent skills & templates

Also useful

OperateMonitoring & observability

Where it fits

Example use

BuildAgent skills & templates

Compare coordination benchmark baselines while tuning swarm topology in development.

Example use

ShipTesting & QA

Gate a release by running throughput and fault-tolerance suites against the previous build.

Example use

OperateMonitoring & observability

Re-run scalability benchmarks after infra changes to confirm resource usage stayed within bounds.

How it compares

Agent-embedded benchmarking workflow, not a hosted SaaS load-test dashboard you click without an agent.

Common Questions / FAQ

Who is agent-benchmark-suite for?

Solo and indie developers building or operating multi-agent systems who want automated benchmarking and regression detection in their agent toolchain.

When should I use agent-benchmark-suite?

In Ship testing before releases, during Build agent-tooling when designing coordination, and in Operate monitoring when validating that production-like swarms still meet latency and throughput baselines.

Is agent-benchmark-suite safe to install?

Check this page’s Security Audits panel for audit results and risk level; benchmark skills may imply running load against your own infrastructure—scope tests to non-production when unsure.

SKILL.md

READMESKILL.md - Agent Benchmark Suite

---
name: Benchmark Suite
type: agent
category: optimization
description: Comprehensive performance benchmarking, regression detection and performance validation
---

# Benchmark Suite Agent

## Agent Profile
- **Name**: Benchmark Suite
- **Type**: Performance Optimization Agent
- **Specialization**: Comprehensive performance benchmarking and testing
- **Performance Focus**: Automated benchmarking, regression detection, and performance validation

## Core Capabilities

### 1. Comprehensive Benchmarking Framework
```javascript
// Advanced benchmarking system
class ComprehensiveBenchmarkSuite {
  constructor() {
    this.benchmarks = {
      // Core performance benchmarks
      throughput: new ThroughputBenchmark(),
      latency: new LatencyBenchmark(),
      scalability: new ScalabilityBenchmark(),
      resource_usage: new ResourceUsageBenchmark(),
      
      // Swarm-specific benchmarks
      coordination: new CoordinationBenchmark(),
      load_balancing: new LoadBalancingBenchmark(),
      topology: new TopologyBenchmark(),
      fault_tolerance: new FaultToleranceBenchmark(),
      
      // Custom benchmarks
      custom: new CustomBenchmarkManager()
    };
    
    this.reporter = new BenchmarkReporter();
    this.comparator = new PerformanceComparator();
    this.analyzer = new BenchmarkAnalyzer();
  }
  
  // Execute comprehensive benchmark suite
  async runBenchmarkSuite(config = {}) {
    const suiteConfig = {
      duration: config.duration || 300000, // 5 minutes default
      iterations: config.iterations || 10,
      warmupTime: config.warmupTime || 30000, // 30 seconds
      cooldownTime: config.cooldownTime || 10000, // 10 seconds
      parallel: config.parallel || false,
      baseline: config.baseline || null
    };
    
    const results = {
      summary: {},
      detailed: new Map(),
      baseline_comparison: null,
      recommendations: []
    };
    
    // Warmup phase
    await this.warmup(suiteConfig.warmupTime);
    
    // Execute benchmarks
    if (suiteConfig.parallel) {
      results.detailed = await this.runBenchmarksParallel(suiteConfig);
    } else {
      results.detailed = await this.runBenchmarksSequential(suiteConfig);
    }
    
    // Generate summary
    results.summary = this.generateSummary(results.detailed);
    
    // Compare with baseline if provided
    if (suiteConfig.baseline) {
      results.baseline_comparison = await this.compareWithBaseline(
        results.detailed, 
        suiteConfig.baseline
      );
    }
    
    // Generate recommendations
    results.recommendations = await this.generateRecommendations(results);
    
    // Cooldown phase
    await this.cooldown(suiteConfig.cooldownTime);
    
    return results;
  }
  
  // Parallel benchmark execution
  async runBenchmarksParallel(config) {
    const benchmarkPromises = Object.entries(this.benchmarks).map(
      async ([name, benchmark]) => {
        const result = await this.executeBenchmark(benchmark, name, config);
        return [name, result];
      }
    );
    
    const results = await Promise.all(benchmarkPromises);
    return new Map(results);
  }
  
  // Sequential benchmark execution
  async runBenchmarksSequential(config) {
    const results = new Map();
    
    for (const [name, benchmark] of Object.entries(this.benchmarks)) {
      const result = await this.executeBenchmark(benchmark, name, config);
      results.set(name, result);
      
      // Brief pause between benchmarks
      await this.sleep(1000);
    }
    
    return results;
  }
}
```

### 2. Performance Regression Detection
```javascript
// Advanced regression detection system
class RegressionDetector {
  constructor() {
    this.detectors = {
      statistical: new StatisticalRegressionDetector(),
      machine_learning: new MLRegressionDetector(),
      threshold: new ThresholdRegressionDetector(),
      trend

What is this skill?

ComprehensiveBenchmarkSuite covers throughput, latency, scalability, and resource usage

Swarm-focused benchmarks: coordination, load balancing, topology, fault tolerance

BenchmarkReporter, PerformanceComparator, and BenchmarkAnalyzer for regression workflows

Custom benchmark manager hook for domain-specific agent workloads

Invoke via $agent-benchmark-suite as a performance optimization agent profile

Benchmark categories include throughput, latency, scalability, resource usage, and four swarm-specific suites

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 633 installs on skills.sh; 58.5k GitHub stars; 3/3 security scanners passed (skills.sh audits).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

BuildAgent skills & templates

Also useful

OperateMonitoring & observability

Where it fits

Example use

BuildAgent skills & templates

Compare coordination benchmark baselines while tuning swarm topology in development.

Example use

ShipTesting & QA

Gate a release by running throughput and fault-tolerance suites against the previous build.

Example use

OperateMonitoring & observability

Re-run scalability benchmarks after infra changes to confirm resource usage stayed within bounds.

SKILL.md

READMESKILL.md - Agent Benchmark Suite

---
name: Benchmark Suite
type: agent
category: optimization
description: Comprehensive performance benchmarking, regression detection and performance validation
---

# Benchmark Suite Agent

## Agent Profile
- **Name**: Benchmark Suite
- **Type**: Performance Optimization Agent
- **Specialization**: Comprehensive performance benchmarking and testing
- **Performance Focus**: Automated benchmarking, regression detection, and performance validation

## Core Capabilities

### 1. Comprehensive Benchmarking Framework
```javascript
// Advanced benchmarking system
class ComprehensiveBenchmarkSuite {
  constructor() {
    this.benchmarks = {
      // Core performance benchmarks
      throughput: new ThroughputBenchmark(),
      latency: new LatencyBenchmark(),
      scalability: new ScalabilityBenchmark(),
      resource_usage: new ResourceUsageBenchmark(),
      
      // Swarm-specific benchmarks
      coordination: new CoordinationBenchmark(),
      load_balancing: new LoadBalancingBenchmark(),
      topology: new TopologyBenchmark(),
      fault_tolerance: new FaultToleranceBenchmark(),
      
      // Custom benchmarks
      custom: new CustomBenchmarkManager()
    };
    
    this.reporter = new BenchmarkReporter();
    this.comparator = new PerformanceComparator();
    this.analyzer = new BenchmarkAnalyzer();
  }
  
  // Execute comprehensive benchmark suite
  async runBenchmarkSuite(config = {}) {
    const suiteConfig = {
      duration: config.duration || 300000, // 5 minutes default
      iterations: config.iterations || 10,
      warmupTime: config.warmupTime || 30000, // 30 seconds
      cooldownTime: config.cooldownTime || 10000, // 10 seconds
      parallel: config.parallel || false,
      baseline: config.baseline || null
    };
    
    const results = {
      summary: {},
      detailed: new Map(),
      baseline_comparison: null,
      recommendations: []
    };
    
    // Warmup phase
    await this.warmup(suiteConfig.warmupTime);
    
    // Execute benchmarks
    if (suiteConfig.parallel) {
      results.detailed = await this.runBenchmarksParallel(suiteConfig);
    } else {
      results.detailed = await this.runBenchmarksSequential(suiteConfig);
    }
    
    // Generate summary
    results.summary = this.generateSummary(results.detailed);
    
    // Compare with baseline if provided
    if (suiteConfig.baseline) {
      results.baseline_comparison = await this.compareWithBaseline(
        results.detailed, 
        suiteConfig.baseline
      );
    }
    
    // Generate recommendations
    results.recommendations = await this.generateRecommendations(results);
    
    // Cooldown phase
    await this.cooldown(suiteConfig.cooldownTime);
    
    return results;
  }
  
  // Parallel benchmark execution
  async runBenchmarksParallel(config) {
    const benchmarkPromises = Object.entries(this.benchmarks).map(
      async ([name, benchmark]) => {
        const result = await this.executeBenchmark(benchmark, name, config);
        return [name, result];
      }
    );
    
    const results = await Promise.all(benchmarkPromises);
    return new Map(results);
  }
  
  // Sequential benchmark execution
  async runBenchmarksSequential(config) {
    const results = new Map();
    
    for (const [name, benchmark] of Object.entries(this.benchmarks)) {
      const result = await this.executeBenchmark(benchmark, name, config);
      results.set(name, result);
      
      // Brief pause between benchmarks
      await this.sleep(1000);
    }
    
    return results;
  }
}
```

### 2. Performance Regression Detection
```javascript
// Advanced regression detection system
class RegressionDetector {
  constructor() {
    this.detectors = {
      statistical: new StatisticalRegressionDetector(),
      machine_learning: new MLRegressionDetector(),
      threshold: new ThresholdRegressionDetector(),
      trend

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is agent-benchmark-suite for?

When should I use agent-benchmark-suite?

Is agent-benchmark-suite safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is agent-benchmark-suite for?

When should I use agent-benchmark-suite?

Is agent-benchmark-suite safe to install?

SKILL.md