
M10 Performance
Install this when Rust code is slow or allocation-heavy and you want measure-first optimization guidance—profiling, criterion benches, and design trade-offs before micro-optimizing.
Overview
M10-performance is an agent skill most often used in Ship (also Build when structuring hot paths) that guides Rust performance work through profiling, criterion benchmarking, and prioritized optimization design choices.
Install
npx skills add https://github.com/actionbook/rust-skills --skill m10-performanceWhat is this skill?
- Measure-first gate: profile with flamegraph/perf and benchmark with criterion before changing code
- Priority ladder: algorithm (10x–1000x), data structure (2x–10x), allocation (2x–5x), cache (1.5x–3x)
- Design-to-implementation table: pre-allocate (`with_capacity`), `SmallVec`, `rayon`, `Cow<T>`, zero-copy references
- Thinking prompts on acceptable performance, complexity trade-offs, memory vs CPU, latency vs throughput
- Layer 2 design-choices framing tied to tracing up to domain performance SLAs
- Optimization priority table cites roughly 10x–1000x (algorithm), 2x–10x (data structure), 2x–5x (allocation), 1.5x–3x (c
- Documented as Layer 2: Design Choices in the rust-skills performance module
Adoption & trust: 973 installs on skills.sh; 1.2k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your Rust service feels slow but you do not know whether the fix is algorithmic, allocation-heavy, or cache-unfriendly—and guessing adds complexity without proof.
Who is it for?
Indie builders optimizing Rust binaries or APIs who already have failing benches or profiler output and want structured next steps.
Skip if: Projects still choosing architecture with no code to profile, or teams seeking non-Rust database/query tuning only.
When should I use this skill?
CRITICAL performance work when triggers include performance, optimization, benchmark, profiling, flamegraph, criterion, slow, fast, allocation, cache, SIMD, or equivalent phrasing.
What do I get? / Deliverables
After the skill runs, you have a measured hotspot plan aligned to the goal table (pre-allocate, rayon, Cow, etc.) and explicit trade-offs before merging perf changes.
- Documented bottleneck and measurement evidence
- Optimization plan mapped to the skill’s design-choice table with stated trade-offs
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Ship because the skill is explicitly triggered for performance optimization, benchmarks, and profiling after functionality exists. Perf subphase matches bottleneck analysis, flamegraphs, criterion, and allocation/cache/SIMD decision tables.
Where it fits
Criterion shows a regression; agent walks through flamegraph hotspots then applies `with_capacity` and cache-friendly `Vec` layout.
You design a hot parsing path choosing `Cow<str>` and avoiding redundant clones before first deploy.
Production latency alerts lead to measured fixes rather than speculative SIMD changes.
How it compares
Rust perf methodology skill—not a generic code-review or security audit package.
Common Questions / FAQ
Who is m10-performance for?
Solo and indie builders writing Rust who hit performance triggers and need agent help that insists on profiling and criterion before micro-optimizations.
When should I use m10-performance?
In Ship when benchmarks regress or latency breaches targets; in Build when designing hot loops and choosing `Vec` vs `SmallVec` or rayon parallelism; in Operate when production slowness needs flamegraph-driven fixes.
Is m10-performance safe to install?
See the Security Audits panel on this Prism page; profiling and bench commands run locally and may execute untrusted code paths—review agent shell suggestions before running on sensitive environments.
SKILL.md
READMESKILL.md - M10 Performance
# Performance Optimization > **Layer 2: Design Choices** ## Core Question **What's the bottleneck, and is optimization worth it?** Before optimizing: - Have you measured? (Don't guess) - What's the acceptable performance? - Will optimization add complexity? --- ## Performance Decision → Implementation | Goal | Design Choice | Implementation | |------|---------------|----------------| | Reduce allocations | Pre-allocate, reuse | `with_capacity`, object pools | | Improve cache | Contiguous data | `Vec`, `SmallVec` | | Parallelize | Data parallelism | `rayon`, threads | | Avoid copies | Zero-copy | References, `Cow<T>` | | Reduce indirection | Inline data | `smallvec`, arrays | --- ## Thinking Prompt Before optimizing: 1. **Have you measured?** - Profile first → flamegraph, perf - Benchmark → criterion, cargo bench - Identify actual hotspots 2. **What's the priority?** - Algorithm (10x-1000x improvement) - Data structure (2x-10x) - Allocation (2x-5x) - Cache (1.5x-3x) 3. **What's the trade-off?** - Complexity vs speed - Memory vs CPU - Latency vs throughput --- ## Trace Up ↑ To domain constraints (Layer 3): ``` "How fast does this need to be?" ↑ Ask: What's the performance SLA? ↑ Check: domain-* (latency requirements) ↑ Check: Business requirements (acceptable response time) ``` | Question | Trace To | Ask | |----------|----------|-----| | Latency requirements | domain-* | What's acceptable response time? | | Throughput needs | domain-* | How many requests per second? | | Memory constraints | domain-* | What's the memory budget? | --- ## Trace Down ↓ To implementation (Layer 1): ``` "Need to reduce allocations" ↓ m01-ownership: Use references, avoid clone ↓ m02-resource: Pre-allocate with_capacity "Need to parallelize" ↓ m07-concurrency: Choose rayon or threads ↓ m07-concurrency: Consider async for I/O-bound "Need cache efficiency" ↓ Data layout: Prefer Vec over HashMap when possible ↓ Access patterns: Sequential over random access ``` --- ## Quick Reference | Tool | Purpose | |------|---------| | `cargo bench` | Micro-benchmarks | | `criterion` | Statistical benchmarks | | `perf` / `flamegraph` | CPU profiling | | `heaptrack` | Allocation tracking | | `valgrind` / `cachegrind` | Cache analysis | ## Optimization Priority ``` 1. Algorithm choice (10x - 1000x) 2. Data structure (2x - 10x) 3. Allocation reduction (2x - 5x) 4. Cache optimization (1.5x - 3x) 5. SIMD/Parallelism (2x - 8x) ``` ## Common Techniques | Technique | When | How | |-----------|------|-----| | Pre-allocation | Known size | `Vec::with_capacity(n)` | | Avoid cloning | Hot paths | Use references or `Cow<T>` | | Batch operations | Many small ops | Collect then process | | SmallVec | Usually small | `smallvec::SmallVec<[T; N]>` | | Inline buffers | Fixed-size data | Arrays over Vec | --- ## Common Mistakes | Mistake | Why Wrong | Better | |---------|-----------|--------| | Optimize without profiling | Wrong target | Profile first | | Benchmark in debug mode | Meaningless | Always `--release` | | Use LinkedList | Cache unfriendly | `Vec` or `VecDeque` | | Hidden `.clone()` | Unnecessary allocs | Use references | | Premature optimization | Wasted effort | Make it work first | --- ## Anti-Patterns | Anti-Pattern | Why Bad | Better | |--------------|---------|--------| | Clone to avoid lifetimes | Performance cost | Proper ownership | | Box everything | Indirection cost | Stack when possible | | HashMap for small sets | Overhead | Vec with linear search | | String concat in loop | O(n^2) | `String::with_capacity` or `format!` | --- ## Related Skills | When | See | |------|-----| | Reducing clones