
Agent Benchmark Suite
Run structured throughput, latency, scalability, and swarm coordination benchmarks with regression comparison before you ship or tune agent systems.
Overview
Agent-benchmark-suite is an agent skill most often used in Ship testing (also Build agent-tooling, Operate monitoring) that runs comprehensive performance benchmarks and regression validation for agent and swarm workload
Install
npx skills add https://github.com/ruvnet/ruflo --skill agent-benchmark-suiteWhat is this skill?
- ComprehensiveBenchmarkSuite covers throughput, latency, scalability, and resource usage
- Swarm-focused benchmarks: coordination, load balancing, topology, fault tolerance
- BenchmarkReporter, PerformanceComparator, and BenchmarkAnalyzer for regression workflows
- Custom benchmark manager hook for domain-specific agent workloads
- Invoke via $agent-benchmark-suite as a performance optimization agent profile
- Benchmark categories include throughput, latency, scalability, resource usage, and four swarm-specific suites
Adoption & trust: 633 installs on skills.sh; 58.5k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You changed agent coordination or scaling code but have no standardized benchmark suite to catch latency or throughput regressions before users feel it.
Who is it for?
Builders operating Ruflo-style agent swarms who need repeatable throughput, latency, and fault-tolerance benchmarks after each change.
Skip if: Static marketing pages or simple CRUD apps with no agent runtime to measure.
When should I use this skill?
Invoke with $agent-benchmark-suite when you need comprehensive performance benchmarking, regression detection, or validation for agent/swarm systems.
What do I get? / Deliverables
You get a structured benchmark run with reporting and comparison hooks so you can accept or reject releases based on measured agent performance.
- Benchmark execution across core and swarm-specific suites
- Reported comparison and analysis suitable for release go/no-go
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Ship testing because the skill’s purpose is performance validation and regression detection before release. Benchmark suites, comparators, and analyzers map directly to QA-style performance testing rather than initial feature coding.
Where it fits
Compare coordination benchmark baselines while tuning swarm topology in development.
Gate a release by running throughput and fault-tolerance suites against the previous build.
Re-run scalability benchmarks after infra changes to confirm resource usage stayed within bounds.
How it compares
Agent-embedded benchmarking workflow, not a hosted SaaS load-test dashboard you click without an agent.
Common Questions / FAQ
Who is agent-benchmark-suite for?
Solo and indie developers building or operating multi-agent systems who want automated benchmarking and regression detection in their agent toolchain.
When should I use agent-benchmark-suite?
In Ship testing before releases, during Build agent-tooling when designing coordination, and in Operate monitoring when validating that production-like swarms still meet latency and throughput baselines.
Is agent-benchmark-suite safe to install?
Check this page’s Security Audits panel for audit results and risk level; benchmark skills may imply running load against your own infrastructure—scope tests to non-production when unsure.
SKILL.md
READMESKILL.md - Agent Benchmark Suite
--- name: Benchmark Suite type: agent category: optimization description: Comprehensive performance benchmarking, regression detection and performance validation --- # Benchmark Suite Agent ## Agent Profile - **Name**: Benchmark Suite - **Type**: Performance Optimization Agent - **Specialization**: Comprehensive performance benchmarking and testing - **Performance Focus**: Automated benchmarking, regression detection, and performance validation ## Core Capabilities ### 1. Comprehensive Benchmarking Framework ```javascript // Advanced benchmarking system class ComprehensiveBenchmarkSuite { constructor() { this.benchmarks = { // Core performance benchmarks throughput: new ThroughputBenchmark(), latency: new LatencyBenchmark(), scalability: new ScalabilityBenchmark(), resource_usage: new ResourceUsageBenchmark(), // Swarm-specific benchmarks coordination: new CoordinationBenchmark(), load_balancing: new LoadBalancingBenchmark(), topology: new TopologyBenchmark(), fault_tolerance: new FaultToleranceBenchmark(), // Custom benchmarks custom: new CustomBenchmarkManager() }; this.reporter = new BenchmarkReporter(); this.comparator = new PerformanceComparator(); this.analyzer = new BenchmarkAnalyzer(); } // Execute comprehensive benchmark suite async runBenchmarkSuite(config = {}) { const suiteConfig = { duration: config.duration || 300000, // 5 minutes default iterations: config.iterations || 10, warmupTime: config.warmupTime || 30000, // 30 seconds cooldownTime: config.cooldownTime || 10000, // 10 seconds parallel: config.parallel || false, baseline: config.baseline || null }; const results = { summary: {}, detailed: new Map(), baseline_comparison: null, recommendations: [] }; // Warmup phase await this.warmup(suiteConfig.warmupTime); // Execute benchmarks if (suiteConfig.parallel) { results.detailed = await this.runBenchmarksParallel(suiteConfig); } else { results.detailed = await this.runBenchmarksSequential(suiteConfig); } // Generate summary results.summary = this.generateSummary(results.detailed); // Compare with baseline if provided if (suiteConfig.baseline) { results.baseline_comparison = await this.compareWithBaseline( results.detailed, suiteConfig.baseline ); } // Generate recommendations results.recommendations = await this.generateRecommendations(results); // Cooldown phase await this.cooldown(suiteConfig.cooldownTime); return results; } // Parallel benchmark execution async runBenchmarksParallel(config) { const benchmarkPromises = Object.entries(this.benchmarks).map( async ([name, benchmark]) => { const result = await this.executeBenchmark(benchmark, name, config); return [name, result]; } ); const results = await Promise.all(benchmarkPromises); return new Map(results); } // Sequential benchmark execution async runBenchmarksSequential(config) { const results = new Map(); for (const [name, benchmark] of Object.entries(this.benchmarks)) { const result = await this.executeBenchmark(benchmark, name, config); results.set(name, result); // Brief pause between benchmarks await this.sleep(1000); } return results; } } ``` ### 2. Performance Regression Detection ```javascript // Advanced regression detection system class RegressionDetector { constructor() { this.detectors = { statistical: new StatisticalRegressionDetector(), machine_learning: new MLRegressionDetector(), threshold: new ThresholdRegressionDetector(), trend