M10 Performance

Name: M10 Performance
Author: zhanghandong

zhanghandong/rust-skills

911 installs
1.3k repo stars
Updated May 24, 2026
zhanghandong/rust-skills

This is a copy of m10-performance by actionbook - installs and ranking accrue to the original listing.

m10-performance is a Rust skill that guides developers through CPU and memory profiling, Criterion benchmarking, and cache analysis to optimize speed and memory before shipping.

About

m10-performance is a Rust performance optimization guide from zhanghandong/rust-skills that teaches a profiling-first workflow before rewriting code. The skill documents concrete commands for cargo flamegraph CPU traces, cargo-instruments on macOS, heaptrack on Linux, cargo bench with Criterion, and valgrind --tool=cachegrind for cache behavior. It includes Criterion benchmark scaffolding with criterion_group and criterion_main so developers can compare parse_v1 versus parse_v2 implementations on repeated inputs. Developers reach for m10-performance when a Rust binary or library is functionally correct but too slow, memory-heavy, or cache-unfriendly, and they need a repeatable measurement loop instead of guessing at optimizations.

Profiling-first workflow using flamegraph, heaptrack, cachegrind and cargo bench
Criterion benchmark templates with multiple implementation comparisons
9 common Rust performance patterns including Cow for zero-cost abstractions
Allocation reuse techniques that eliminate repeated Vec and String creation
Hard-gate: always profile before optimizing to avoid premature tuning

M10 Performance by the numbers

911 all-time installs (skills.sh)
+6 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Security screen: HIGH risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/zhanghandong/rust-skills --skill m10-performance

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/zhanghandong/rust-skills/m10-performance.svg)](https://skillselion.com/skills/zhanghandong/rust-skills/m10-performance)

Installs	911
repo stars	★ 1.3k
Security audit	2 / 3 scanners passed
Last updated	May 24, 2026
Repository	zhanghandong/rust-skills ↗

How do you profile and benchmark Rust code before release?

Systematically profile, benchmark, and optimize Rust code for speed and memory efficiency before shipping.

Who is it for?

Rust developers shipping performance-sensitive binaries or libraries who need systematic profiling before rewriting algorithms or data structures.

Skip if: Teams still defining Rust architecture or correctness requirements who have not yet established baseline functionality tests.

When should I use this skill?

A Rust service is too slow, memory-heavy, or cache-unfriendly and the developer mentions flamegraph, Criterion, valgrind, or pre-ship optimization.

What you get

Flamegraph SVGs, Criterion benchmark reports, heap profiles, cachegrind output, and a prioritized list of Rust hot-path fixes.

Criterion benchmark suites
flamegraph profiles
optimization change list

By the numbers

Documents 5 profiling approaches: flamegraph, cargo-instruments, heaptrack, Criterion benches, and valgrind cachegrind

Files

SKILL.mdMarkdownGitHub ↗

Performance Optimization

Layer 2: Design Choices

Core Question

What's the bottleneck, and is optimization worth it?

Before optimizing:

Have you measured? (Don't guess)
What's the acceptable performance?
Will optimization add complexity?

---

Performance Decision → Implementation

Goal	Design Choice	Implementation
Reduce allocations	Pre-allocate, reuse	`with_capacity`, object pools
Improve cache	Contiguous data	`Vec`, `SmallVec`
Parallelize	Data parallelism	`rayon`, threads
Avoid copies	Zero-copy	References, `Cow<T>`
Reduce indirection	Inline data	`smallvec`, arrays

---

Thinking Prompt

Before optimizing:

1. Have you measured?

Profile first → flamegraph, perf
Benchmark → criterion, cargo bench
Identify actual hotspots

2. What's the priority?

Algorithm (10x-1000x improvement)
Data structure (2x-10x)
Allocation (2x-5x)
Cache (1.5x-3x)

3. What's the trade-off?

Complexity vs speed
Memory vs CPU
Latency vs throughput

---

Trace Up ↑

To domain constraints (Layer 3):

"How fast does this need to be?"
    ↑ Ask: What's the performance SLA?
    ↑ Check: domain-* (latency requirements)
    ↑ Check: Business requirements (acceptable response time)

Question	Trace To	Ask
Latency requirements	domain-*	What's acceptable response time?
Throughput needs	domain-*	How many requests per second?
Memory constraints	domain-*	What's the memory budget?

---

Trace Down ↓

To implementation (Layer 1):

"Need to reduce allocations"
    ↓ m01-ownership: Use references, avoid clone
    ↓ m02-resource: Pre-allocate with_capacity

"Need to parallelize"
    ↓ m07-concurrency: Choose rayon or threads
    ↓ m07-concurrency: Consider async for I/O-bound

"Need cache efficiency"
    ↓ Data layout: Prefer Vec over HashMap when possible
    ↓ Access patterns: Sequential over random access

---

Quick Reference

Tool	Purpose
`cargo bench`	Micro-benchmarks
`criterion`	Statistical benchmarks
`perf` / `flamegraph`	CPU profiling
`heaptrack`	Allocation tracking
`valgrind` / `cachegrind`	Cache analysis

Optimization Priority

1. Algorithm choice     (10x - 1000x)
2. Data structure       (2x - 10x)
3. Allocation reduction (2x - 5x)
4. Cache optimization   (1.5x - 3x)
5. SIMD/Parallelism     (2x - 8x)

Common Techniques

Technique	When	How
Pre-allocation	Known size	`Vec::with_capacity(n)`
Avoid cloning	Hot paths	Use references or `Cow<T>`
Batch operations	Many small ops	Collect then process
SmallVec	Usually small	`smallvec::SmallVec<[T; N]>`
Inline buffers	Fixed-size data	Arrays over Vec

---

Common Mistakes

Mistake	Why Wrong	Better
Optimize without profiling	Wrong target	Profile first
Benchmark in debug mode	Meaningless	Always `--release`
Use LinkedList	Cache unfriendly	`Vec` or `VecDeque`
Hidden `.clone()`	Unnecessary allocs	Use references
Premature optimization	Wasted effort	Make it work first

---

Anti-Patterns

Anti-Pattern	Why Bad	Better
Clone to avoid lifetimes	Performance cost	Proper ownership
Box everything	Indirection cost	Stack when possible
HashMap for small sets	Overhead	Vec with linear search
String concat in loop	O(n^2)	`String::with_capacity` or `format!`

---

Related Skills

When	See
Reducing clones	m01-ownership
Concurrency options	m07-concurrency
Smart pointer choice	m02-resource
Domain requirements	domain-*

Rust Performance Optimization Guide

Profiling First

Tools

# CPU profiling
cargo install flamegraph
cargo flamegraph --bin myapp

# Memory profiling
cargo install cargo-instruments  # macOS
heaptrack ./target/release/myapp  # Linux

# Benchmarking
cargo bench  # with criterion

# Cache analysis
valgrind --tool=cachegrind ./target/release/myapp

Criterion Benchmarks

use criterion::{criterion_group, criterion_main, Criterion};

fn benchmark_parse(c: &mut Criterion) {
    let input = "test data".repeat(1000);

    c.bench_function("parse_v1", |b| {
        b.iter(|| parse_v1(&input))
    });

    c.bench_function("parse_v2", |b| {
        b.iter(|| parse_v2(&input))
    });
}

criterion_group!(benches, benchmark_parse);
criterion_main!(benches);

---

Common Optimizations

1. Avoid Unnecessary Allocations

// BAD: allocates on every call
fn to_uppercase(s: &str) -> String {
    s.to_uppercase()
}

// GOOD: return Cow, allocate only if needed
use std::borrow::Cow;

fn to_uppercase(s: &str) -> Cow<'_, str> {
    if s.chars().all(|c| c.is_uppercase()) {
        Cow::Borrowed(s)
    } else {
        Cow::Owned(s.to_uppercase())
    }
}

2. Reuse Allocations

// BAD: creates new Vec each iteration
for item in items {
    let mut buffer = Vec::new();
    process(&mut buffer, item);
}

// GOOD: reuse buffer
let mut buffer = Vec::new();
for item in items {
    buffer.clear();
    process(&mut buffer, item);
}

3. Use Appropriate Collections

Need	Collection	Notes
Sequential access	`Vec<T>`	Best cache locality
Random access by key	`HashMap<K, V>`	O(1) lookup
Ordered keys	`BTreeMap<K, V>`	O(log n) lookup
Small sets (<20)	`Vec<T>` + linear search	Lower overhead
FIFO queue	`VecDeque<T>`	O(1) push/pop both ends

4. Pre-allocate Capacity

// BAD: many reallocations
let mut v = Vec::new();
for i in 0..10000 {
    v.push(i);
}

// GOOD: single allocation
let mut v = Vec::with_capacity(10000);
for i in 0..10000 {
    v.push(i);
}

---

String Optimization

Avoid String Concatenation in Loops

// BAD: O(n²) allocations
let mut result = String::new();
for s in strings {
    result = result + &s;
}

// GOOD: O(n) with push_str
let mut result = String::new();
for s in strings {
    result.push_str(&s);
}

// BETTER: pre-calculate capacity
let total_len: usize = strings.iter().map(|s| s.len()).sum();
let mut result = String::with_capacity(total_len);
for s in strings {
    result.push_str(&s);
}

// BEST: use join for simple cases
let result = strings.join("");

Use &str When Possible

// BAD: requires allocation
fn greet(name: String) {
    println!("Hello, {}", name);
}

// GOOD: borrows, no allocation
fn greet(name: &str) {
    println!("Hello, {}", name);
}

// Works with both:
greet("world");                    // &str
greet(&String::from("world"));     // &String coerces to &str

---

Iterator Optimization

Use Iterators Over Indexing

// BAD: bounds checking on each access
let mut sum = 0;
for i in 0..vec.len() {
    sum += vec[i];
}

// GOOD: no bounds checking
let sum: i32 = vec.iter().sum();

// GOOD: when index needed
for (i, item) in vec.iter().enumerate() {
    // ...
}

Lazy Evaluation

// Iterators are lazy - computation happens at collect
let result: Vec<_> = data
    .iter()
    .filter(|x| x.is_valid())
    .map(|x| x.process())
    .take(10)  // stop after 10 items
    .collect();

Avoid Collecting When Not Needed

// BAD: unnecessary intermediate allocation
let filtered: Vec<_> = items.iter().filter(|x| x.valid).collect();
let count = filtered.len();

// GOOD: no allocation
let count = items.iter().filter(|x| x.valid).count();

---

Parallelism with Rayon

use rayon::prelude::*;

// Sequential
let sum: i32 = (0..1_000_000).map(|x| x * x).sum();

// Parallel (automatic work stealing)
let sum: i32 = (0..1_000_000).into_par_iter().map(|x| x * x).sum();

// Parallel with custom chunk size
let results: Vec<_> = data
    .par_chunks(1000)
    .map(|chunk| process_chunk(chunk))
    .collect();

---

Memory Layout

Use Appropriate Integer Sizes

// If values are small, use smaller types
struct Item {
    count: u8,      // 0-255, not u64
    flags: u8,      // small enum
    id: u32,        // if 4 billion is enough
}

Pack Structs Efficiently

// BAD: 24 bytes due to padding
struct Bad {
    a: u8,   // 1 byte + 7 padding
    b: u64,  // 8 bytes
    c: u8,   // 1 byte + 7 padding
}

// GOOD: 16 bytes (or use #[repr(packed)])
struct Good {
    b: u64,  // 8 bytes
    a: u8,   // 1 byte
    c: u8,   // 1 byte + 6 padding
}

Box Large Values

// Large enum variants waste space
enum Message {
    Quit,
    Data([u8; 10000]),  // all variants are 10000+ bytes
}

// Better: box the large variant
enum Message {
    Quit,
    Data(Box<[u8; 10000]>),  // variants are pointer-sized
}

---

Async Performance

Avoid Blocking in Async

// BAD: blocks the executor
async fn bad() {
    std::thread::sleep(Duration::from_secs(1));  // blocking!
    std::fs::read_to_string("file.txt").unwrap();  // blocking!
}

// GOOD: use async versions
async fn good() {
    tokio::time::sleep(Duration::from_secs(1)).await;
    tokio::fs::read_to_string("file.txt").await.unwrap();
}

// For CPU work: spawn_blocking
async fn compute() -> i32 {
    tokio::task::spawn_blocking(|| {
        heavy_computation()
    }).await.unwrap()
}

Buffer Async I/O

use tokio::io::{AsyncBufReadExt, BufReader};

// BAD: many small reads
async fn bad(file: File) {
    let mut byte = [0u8];
    while file.read(&mut byte).await.unwrap() > 0 {
        process(byte[0]);
    }
}

// GOOD: buffered reading
async fn good(file: File) {
    let reader = BufReader::new(file);
    let mut lines = reader.lines();
    while let Some(line) = lines.next_line().await.unwrap() {
        process(&line);
    }
}

---

Release Build Optimization

Cargo.toml Settings

[profile.release]
lto = true           # Link-time optimization
codegen-units = 1    # Single codegen unit (slower compile, faster code)
panic = "abort"      # Smaller binary, no unwinding
strip = true         # Strip symbols

[profile.release-fast]
inherits = "release"
opt-level = 3        # Maximum optimization

[profile.release-small]
inherits = "release"
opt-level = "s"      # Optimize for size

Compile-Time Assertions

// Zero runtime cost
const _: () = assert!(std::mem::size_of::<MyStruct>() <= 64);

---

Checklist

Before optimizing:

[ ] Profile to find actual bottlenecks
[ ] Have benchmarks to measure improvement
[ ] Consider if optimization is worth complexity

Common wins:

[ ] Reduce allocations (Cow, reuse buffers)
[ ] Use appropriate collections
[ ] Pre-allocate with_capacity
[ ] Use iterators instead of indexing
[ ] Enable LTO for release builds
[ ] Use rayon for parallel workloads

Related skills

Python ExecutorExecute Python code and scripts from agents and workflows.52.9k660

Python ExecutorSafely run Python code from within Claude, Cursor or other agents without leaving the IDE.45.7k660

Python Performance OptimizationProfile slow Python code, locate bottlenecks, and apply targeted optimizations that reduce latency and memory usage.29.9k38.3k

Async Python PatternsLearn and apply battle-tested async patterns when writing concurrent Python services, agents, and data pipelines.13.8k38.3k

Uv Package ManagerManage Python dependencies with maximum speed, reproducibility, and CI/CD compatibility using the uv tool.11.8k38.3k

Python PatternsEnsure every line of Python they write follows idiomatic patterns, PEP 8, and battle-tested practices that keep code readable and maintainable.7.4k234k

How it compares

Choose m10-performance over general Rust coding skills when the bottleneck is measurable latency or memory rather than compiler errors or API design.

FAQ

Which profiling tools does m10-performance recommend for Rust?

m10-performance recommends cargo flamegraph for CPU profiling, cargo-instruments on macOS and heaptrack on Linux for memory, Criterion via cargo bench for micro-benchmarks, and valgrind --tool=cachegrind for cache analysis on release binaries.

When should developers use Criterion in m10-performance?

m10-performance uses Criterion when comparing two Rust implementations on the same input, defining bench_function cases inside criterion_group blocks so parse_v1 and parse_v2 throughput can be measured side by side before choosing an optimization.

Is M10 Performance safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Pythonbackenddevopsintegrations

About

M10 Performance by the numbers

Add your badge

How do you profile and benchmark Rust code before release?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

Performance Optimization

Core Question

Performance Decision → Implementation

Thinking Prompt

Trace Up ↑

Trace Down ↓

Quick Reference

Optimization Priority

Common Techniques

Common Mistakes

Anti-Patterns

Related Skills

Rust Performance Optimization Guide

Profiling First

Tools

Criterion Benchmarks

Common Optimizations

1. Avoid Unnecessary Allocations

2. Reuse Allocations

3. Use Appropriate Collections

4. Pre-allocate Capacity

String Optimization

Avoid String Concatenation in Loops

Use &str When Possible

Iterator Optimization

Use Iterators Over Indexing

Lazy Evaluation

Avoid Collecting When Not Needed

Parallelism with Rayon

Memory Layout

Use Appropriate Integer Sizes

Pack Structs Efficiently

Box Large Values

Async Performance

Avoid Blocking in Async

Buffer Async I/O

Release Build Optimization

Cargo.toml Settings

Compile-Time Assertions

Checklist

Related skills

How it compares

FAQ

Which profiling tools does m10-performance recommend for Rust?

When should developers use Criterion in m10-performance?

Is M10 Performance safe to install?

This week in AI coding