
Agent Designer
Analyze multi-agent execution logs for success rate, cost, latency, errors, and tool efficiency to find bottlenecks before you scale agent workflows.
Overview
Agent Designer is an agent skill most often used in Operate (also Ship) that evaluates multi-agent execution logs for success, cost, latency, and bottlenecks.
Install
npx skills add https://github.com/alirezarezvani/claude-skills --skill agent-designerWhat is this skill?
- Ingests execution logs JSON with task_id, agent_id, actions, tokens, cost_usd, status, and tool usage
- Computes success rate, failure and timeout rates, latency distribution, and average cost per task
- Surfaces error patterns, retry behavior, and tool usage efficiency across agents
- Outputs performance report plus bottleneck analysis and optimization recommendations
- Python agent_evaluator-style pipeline with argparse for batch reports on real runs
Adoption & trust: 542 installs on skills.sh; 17.5k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your agents complete tasks unpredictably and burn tokens, but you lack a structured report tying failures, latency, and tool usage to concrete fixes.
Who is it for?
Indie builders running repeatable agent workflows who can export execution logs with timing, tokens, cost, and status fields.
Skip if: Greenfield agent ideation with no logs yet, or teams that only need prompt copy without operational metrics.
When should I use this skill?
You have agent execution logs JSON and need success rates, cost/latency stats, error patterns, and bottleneck recommendations.
What do I get? / Deliverables
You receive a performance report with bottleneck analysis and optimization recommendations grounded in parsed execution logs JSON.
- Performance metrics summary
- Bottleneck analysis
- Optimization recommendations report
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Once agents run in production or heavy dev loops, you need monitoring-grade analytics; canonical shelf is Operate even though the same evaluator helps during Ship hardening. Monitoring subphase fits performance metrics, error patterns, and bottleneck reports derived from execution logs.
Where it fits
Weekly batch report on agent_id cost_usd and success_rate to decide which task types need slimmer tool sets.
After a load test, parse logs to see timeout_tasks concentration before you enable the feature flag.
Compare latency distribution before and after a prompt change to justify keeping a more expensive but reliable agent path.
How it compares
Use for log-driven agent performance analysis, not as a greenfield agent architecture blueprint skill.
Common Questions / FAQ
Who is agent-designer for?
Solo builders operating multi-agent or tool-heavy agents who want data-backed performance and cost reports from execution history.
When should I use agent-designer?
In Operate when monitoring production or staging agent jobs, and in Ship during testing when you stress agent flows before launch.
Is agent-designer safe to install?
Review the Security Audits panel on this Prism page; logs may contain sensitive task text, API-derived metadata, and cost figures you should redact before sharing.
SKILL.md
READMESKILL.md - Agent Designer
#!/usr/bin/env python3 """ Agent Evaluator - Multi-Agent System Performance Analysis Takes agent execution logs (task, actions taken, results, time, tokens used) and evaluates performance: task success rate, average cost per task, latency distribution, error patterns, tool usage efficiency, identifies bottlenecks and improvement opportunities. Input: execution logs JSON Output: performance report + bottleneck analysis + optimization recommendations """ import json import argparse import sys import statistics from typing import Dict, List, Any, Optional, Tuple from dataclasses import dataclass, asdict from collections import defaultdict, Counter from datetime import datetime, timedelta import re @dataclass class ExecutionLog: """Single execution log entry""" task_id: str agent_id: str task_type: str task_description: str start_time: str end_time: str duration_ms: int status: str # success, failure, partial, timeout actions: List[Dict[str, Any]] results: Dict[str, Any] tokens_used: Dict[str, int] # input_tokens, output_tokens, total_tokens cost_usd: float error_details: Optional[Dict[str, Any]] tools_used: List[str] retry_count: int metadata: Dict[str, Any] @dataclass class PerformanceMetrics: """Performance metrics for an agent or system""" total_tasks: int successful_tasks: int failed_tasks: int partial_tasks: int timeout_tasks: int success_rate: float failure_rate: float average_duration_ms: float median_duration_ms: float percentile_95_duration_ms: float min_duration_ms: int max_duration_ms: int total_tokens_used: int average_tokens_per_task: float total_cost_usd: float average_cost_per_task: float cost_per_token: float throughput_tasks_per_hour: float error_rate: float retry_rate: float @dataclass class ErrorAnalysis: """Error pattern analysis""" error_type: str count: int percentage: float affected_agents: List[str] affected_task_types: List[str] common_patterns: List[str] suggested_fixes: List[str] impact_level: str # high, medium, low @dataclass class BottleneckAnalysis: """System bottleneck analysis""" bottleneck_type: str # agent, tool, communication, resource location: str severity: str # critical, high, medium, low description: str impact_on_performance: Dict[str, float] affected_workflows: List[str] optimization_suggestions: List[str] estimated_improvement: Dict[str, float] @dataclass class OptimizationRecommendation: """Performance optimization recommendation""" category: str # performance, cost, reliability, scalability priority: str # high, medium, low title: str description: str implementation_effort: str # low, medium, high expected_impact: Dict[str, Any] estimated_cost_savings: Optional[float] estimated_performance_gain: Optional[float] implementation_steps: List[str] risks: List[str] prerequisites: List[str] @dataclass class EvaluationReport: """Complete evaluation report""" summary: Dict[str, Any] system_metrics: PerformanceMetrics agent_metrics: Dict[str, PerformanceMetrics] task_type_metrics: Dict[str, PerformanceMetrics] tool_usage_analysis: Dict[str, Any] error_analysis: List[ErrorAnalysis] bottleneck_analysis: List[BottleneckAnalysis] optimization_recommendations: List[OptimizationRecommendation] trends_analysis: Dict[str, Any] cost_breakdown: Dict[str, Any] sla_compliance: Dict[str, Any] metadata: Dict[str, Any] class AgentEvaluator: """Evaluate multi-agent system performance from execution logs""" def __init__(self): self.error_patterns = self._define_error_patterns() self.performance_thresholds = self._define_performance_thresholds() self.cost_benchmarks = self._define_cost_benchmarks() def _define_error_patte