Golang Benchmark

Name: Golang Benchmark
Author: samber

samber/cc-skills-golang

33.7k installs
2.8k repo stars
Updated July 27, 2026
samber/cc-skills-golang

golang-benchmark is a Go testing skill that teaches b.Loop() benchmarking, dead code elimination, and statistical analysis.

About

Go benchmarking skill covering Go 1.24+ patterns including b.Loop() usage, dead code elimination awareness, statistical significance with -count flags, and benchstat interpretation. Teaches developers to write reliable performance tests and correctly analyze benchmark results.

Go 1.24 b.Loop() pattern for accurate benchmarking
Dead code elimination awareness in benchmarks
benchstat for statistical significance and comparison

Golang Benchmark by the numbers

33,660 all-time installs (skills.sh)
+488 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #15 of 2,184 Testing & QA skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

At a glance

golang-benchmark capabilities & compatibility

Capabilities: testing · performance analysis
Use cases: testing

From the docs

What golang-benchmark says it does

Tests whether the model uses b.Loop() (Go 1.24+) instead of the legacy for range b.N pattern

SKILL.md

Recommends -count=10 (or higher) for statistical significance

SKILL.md

npx skills add https://github.com/samber/cc-skills-golang --skill golang-benchmark

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/samber/cc-skills-golang/golang-benchmark.svg)](https://skillselion.com/skills/samber/cc-skills-golang/golang-benchmark)

Installs	33.7k
repo stars	★ 2.8k
Security audit	3 / 3 scanners passed
Last updated	July 27, 2026
Repository	samber/cc-skills-golang ↗

How do you write Go 1.24 benchmarks with b.Loop?

Developers optimizing Go code need reliable benchmarking patterns and statistical methods to validate performance improvements.

Who is it for?

Performance engineers optimizing Go applications

Skip if: Teams still on Go versions before 1.24 or workflows needing production pprof profiling instead of unit-level benchmarks.

When should I use this skill?

Profiling Go code, validating performance improvements, analyzing benchmark results

What you get

Go test benchmark files using b.Loop(), fixture setup, and anti-dead-code sinks ready for go test -bench.

Go benchmark test files
bench-ready fixture setup

By the numbers

Targets Go 1.24+ b.Loop() benchmark API
Scenario uses 1MB SHA-256 ComputeHash fixture in benchmark setup

Files

SKILL.mdMarkdownGitHub ↗

Persona: You are a Go performance measurement engineer. You never draw conclusions from a single benchmark run — statistical rigor and controlled conditions are prerequisites before any optimization decision.

Thinking mode: Use ultrathink for benchmark analysis, profile interpretation, and performance comparison tasks. Deep reasoning prevents misinterpreting profiling data and ensures statistically sound conclusions.

Dependencies:

benchstat: go install golang.org/x/perf/cmd/benchstat@latest

Go Benchmarking & Performance Measurement

Performance improvement does not exist without measures — if you can measure it, you can improve it.

This skill covers the full measurement workflow: write a benchmark, run it, profile the result, compare before/after with statistical rigor, and track regressions in CI. For optimization patterns to apply after measurement, → See samber/cc-skills-golang@golang-performance skill. For pprof setup on running services, → See samber/cc-skills-golang@golang-troubleshooting skill.

Writing Benchmarks

`b.Loop()` (Go 1.24+) — preferred

For Go 1.24+, prefer b.Loop() for new benchmarks. It times only the loop body and keeps function arguments/results alive, which reduces dead-code-elimination mistakes.

func BenchmarkParse(b *testing.B) {
    data := loadFixture("large.json") // setup — excluded from timing
    for b.Loop() {
        Parse(data)  // compiler cannot eliminate this call
    }
}

Legacy b.N loops still compile and are fine to keep when preserving existing benchmarks or supporting Go <1.24. They are easier to get wrong: setup may need b.ResetTimer(), and results may need a sink if the compiler can eliminate the work. Go 1.26 fixed an earlier b.Loop() inlining limitation — benchmarks on 1.24–1.25 already benefit from b.Loop() but may miss inlining optimizations that 1.26 delivers.

Memory tracking

func BenchmarkAlloc(b *testing.B) {
    b.ReportAllocs() // or run with -benchmem flag
    var sink []byte
    for b.Loop() {
        sink = make([]byte, 1024)
    }
    _ = sink
}

b.ReportMetric() adds custom metrics (e.g., throughput):

b.ReportMetric(float64(totalBytes)/b.Elapsed().Seconds(), "bytes/s") // b.Elapsed() is only valid inside b.Loop()

Sub-benchmarks and table-driven

func BenchmarkEncode(b *testing.B) {
    for _, size := range []int{64, 256, 4096} {
        b.Run(fmt.Sprintf("size=%d", size), func(b *testing.B) {
            data := make([]byte, size)
            for b.Loop() {
                Encode(data)
            }
        })
    }
}

Running Benchmarks

go test -bench=BenchmarkEncode -benchmem -count=10 ./pkg/... | tee bench.txt

Flag	Purpose
`-bench=.`	Run all benchmarks (regexp filter)
`-benchmem`	Report allocations (B/op, allocs/op)
`-count=10`	Run 10 times for statistical significance
`-benchtime=3s`	Minimum time per benchmark (default 1s)
`-cpu=1,2,4`	Run with different GOMAXPROCS values
`-cpuprofile=cpu.prof`	Write CPU profile
`-memprofile=mem.prof`	Write memory profile
`-trace=trace.out`	Write execution trace

Output format: BenchmarkEncode/size=64-8 5000000 230.5 ns/op 128 B/op 2 allocs/op — the -8 suffix is GOMAXPROCS, ns/op is time per operation, B/op is bytes allocated per op, allocs/op is heap allocation count per op.

Documenting Results in Commits

Paste benchstat output in the commit body when the change has a measurable performance impact. This documents _why_ an optimization was made, prevents future readers from reverting it, and lets reviewers verify the claim without re-running benchmarks.

Commit format:

perf(parser): reduce Parse allocations 50% with sync.Pool

Replace per-call []byte allocation with a pooled buffer.

goos: linux / goarch: amd64 / cpu: AMD Ryzen 9 5950X
          │    old     │              new               │
          │  sec/op    │  sec/op     vs base            │
Parse-32    4.592µ ± 2%  3.041µ ± 1%  -33.78% (p=0.000 n=10)

          │   old    │             new              │
          │   B/op   │   B/op     vs base           │
Parse-32   1.024Ki ± 0%  0.512Ki ± 0%  -50.00% (p=0.000 n=10)

          │ old  │            new             │
          │ allocs/op │ allocs/op  vs base    │
Parse-32   12.00 ± 0%   6.000 ± 0%  -50.00% (p=0.000 n=10)

Rules:

Only include benchmarks directly affected by the change — strip unrelated rows
Never paste results with ~ (no statistical significance) — the improvement cannot be claimed
Include the hardware context line (goos/goarch/cpu) so results are reproducible
Use perf(scope): commit type for performance-only changes

Profiling from Benchmarks

Generate profiles directly from benchmark runs — no HTTP server needed:

# CPU profile
go test -bench=BenchmarkParse -cpuprofile=cpu.prof ./pkg/parser
go tool pprof cpu.prof

# Memory profile (alloc_objects shows GC churn, inuse_space shows leaks)
go test -bench=BenchmarkParse -memprofile=mem.prof ./pkg/parser
go tool pprof -alloc_objects mem.prof

# Execution trace
go test -bench=BenchmarkParse -trace=trace.out ./pkg/parser
go tool trace trace.out

For full pprof CLI reference (all commands, non-interactive mode, profile interpretation), see pprof Reference. For execution trace interpretation, see Trace Reference. For statistical comparison, see benchstat Reference.

Reference Files

[pprof Reference](./references/pprof.md) — Interactive and non-interactive analysis of CPU, memory, and goroutine profiles. Full CLI commands, profile types (CPU vs allocobjects vs inuse_space), web UI navigation, and interpretation patterns. Use this to dive deep into \_where time and memory are being spent in your code.

[benchstat Reference](./references/benchstat.md) — Statistical comparison of benchmark runs with rigorous confidence intervals and p-value tests. Covers output reading, filtering old benchmarks, interleaving results for visual clarity, and regression detection. Use this when you need to prove a change made a meaningful performance difference, not just a lucky run.

[Trace Reference](./references/trace.md) — Execution tracer for understanding _when_ and _why_ code runs. Visualizes goroutine scheduling, garbage collection phases, network blocking, and custom span annotations. Use this when pprof (which shows _where_ CPU goes) isn't enough — you need to see the timeline of what happened.

[Diagnostic Tools](./references/tools.md) — Quick reference for ancillary tools: fieldalignment (struct padding waste), GODEBUG (runtime logging flags), fgprof (frame graph profiles), race detector (concurrency bugs), and others. Use this when you have a specific symptom and need a focused diagnostic — don't reach for pprof if a simpler tool already answers your question.

[Compiler Analysis](./references/compiler-analysis.md) — Low-level compiler optimization insights: escape analysis (when values move to the heap), inlining decisions (which function calls are eliminated), SSA dump (intermediate representation), and assembly output. Use this when benchmarks show allocations you didn't expect, or when you want to verify the compiler did what you intended.

[CI Regression Detection](./references/ci-regression.md) — Automated performance regression gating in CI pipelines. Covers three tools (benchdiff for quick PR comparisons, cob for strict threshold-based gating, gobenchdata for long-term trend dashboards), noisy neighbor mitigation strategies (why cloud CI benchmarks vary 5-10% even on quiet machines), and self-hosted runner tuning to make benchmarks reproducible. Use this when you want to ensure pull requests don't silently slow down your codebase — detecting regressions early prevents shipping performance debt.

[Investigation Session](./references/investigation-session.md) — Production performance troubleshooting workflow combining Prometheus runtime metrics (heap size, GC frequency, goroutine counts), PromQL queries to correlate metrics with code changes, runtime configuration flags (GODEBUG env vars to enable GC logging), and cost warnings (when you're hitting performance tax). Use this when production benchmarks look good but real traffic behaves differently.

[Prometheus Go Metrics Reference](./references/prometheus-go-metrics.md) — Complete listing of Go runtime metrics actually exposed as Prometheus metrics by prometheus/client_golang. Covers 30 default metrics, 40+ optional metrics (Go 1.17+), process metrics, and common PromQL queries. Distinguishes between runtime/metrics (Go internal data) and Prometheus metrics (what you scrape from /metrics). Use this when setting up monitoring dashboards or writing PromQL queries for production alerts.

Cross-References

→ See samber/cc-skills-golang@golang-performance skill for optimization patterns to apply after measuring ("if X bottleneck, apply Y")
→ See samber/cc-skills-golang@golang-troubleshooting skill for pprof setup on running services (enable, secure, capture), Delve debugger, GODEBUG flags, root cause methodology
→ See samber/cc-skills-golang@golang-observability skill for everyday always-on monitoring, continuous profiling (Pyroscope), distributed tracing (OpenTelemetry)
→ See samber/cc-skills-golang@golang-testing skill for general testing practices
→ See samber/cc-skills@promql-cli skill for querying Prometheus runtime metrics in production to validate benchmark findings

benchstat Reference

benchstat computes statistical summaries and A/B comparisons of Go benchmark results. A single benchmark run tells you nothing about variance — benchstat tells you whether the difference between two runs is real or noise.

Installation

go install golang.org/x/perf/cmd/benchstat@latest

Usage

benchstat [flags] inputs...

Each input is a file containing go test -bench output. Optionally label inputs with label=path syntax.

Basic Workflow

Step 0: Write benchmarks

Use the standard Go benchmark function signature in *_test.go:

Step 1: Measure baseline

Run benchmarks with -count=10 or more. Each run produces one data point — you need at least 10 to compute a meaningful confidence interval:

go test -run='^$' -bench=BenchmarkParse -benchmem -count=10 ./pkg/parser | tee old.txt

-run='^$' skips unit tests so only benchmarks run — avoids wasting time on tests during measurement sessions.

Step 2: Make your change

Edit the code you want to optimize.

Step 3: Measure again

Same command, same flags, same machine, same load conditions:

go test -run='^$' -bench=BenchmarkParse -benchmem -count=10 ./pkg/parser | tee new.txt

Step 4: Compare

benchstat old.txt new.txt

Output:

goos: linux
goarch: amd64
pkg: myapp/pkg/parser
cpu: AMD Ryzen 9 5950X 16-Core Processor
          │   old.txt   │              new.txt               │
          │   sec/op    │   sec/op     vs base               │
Parse-32    4.592µ ± 2%   3.041µ ± 1%  -33.78% (p=0.000 n=10)

          │  old.txt   │             new.txt              │
          │    B/op    │    B/op     vs base              │
Parse-32    1.024Ki ± 0%   0.512Ki ± 0%  -50.00% (p=0.000 n=10)

          │  old.txt  │            new.txt             │
          │ allocs/op │ allocs/op   vs base            │
Parse-32    12.00 ± 0%   6.000 ± 0%  -50.00% (p=0.000 n=10)

Reading the Output

Element	Meaning	What to look for
median (e.g., `4.592µ`)	Central value across runs — more robust than mean because outliers don't skew it	The reference number for this benchmark
± N% (e.g., `± 2%`)	Half-width of the 95% confidence interval as a percentage of the median	Low (≤2%) = stable measurement. High (>5%) = noisy — investigate noise sources before trusting results
vs base (e.g., `-33.78%`)	Percentage change from the first input (base) to subsequent inputs	Negative = faster/smaller. Positive = slower/larger
p=N (e.g., `p=0.000`)	p-value from Mann-Whitney U-test (non-parametric)	<0.05 = statistically significant. ≥0.05 = difference could be noise
n=N (e.g., `n=10`)	Number of samples used in the comparison	Should usually match your `-count`; if it does not, check that each input file contains the same benchmark rows and units
`~`	No statistically significant difference detected	Do NOT claim improvement — the change might be zero
geomean row	Geometric mean of changes across all benchmarks in the table	Overall proportional change; useful when comparing many benchmarks at once

Unit normalization

benchstat automatically normalizes units for display:

ns/op → displayed as sec/op (with µ, m prefixes) to avoid nonsensical µns/op
MB/s → displayed as B/s (with K, M, G prefixes)

When the `~` symbol appears

Parse-32    4.592µ ± 8%   4.481µ ± 7%  ~ (p=0.089 n=10)

This means benchstat cannot distinguish the difference from random noise. The wide confidence intervals (±8%, ±7%) overlap. Do not claim improvement. Options:

Increase -count to 20+ (narrower CI may reveal a real difference)
Reduce noise sources (close applications, plug in power, use dedicated machine)
Accept that the change has no measurable effect on this benchmark

Flags Reference

Projection flags

These flags control how benchmark results are grouped into tables, rows, and columns.

Flag	Default	Purpose
`-table KEYS`	`.config`	Group results into separate tables by these keys
`-row KEYS`	`.fullname`	Group results into table rows by these keys
`-col KEYS`	`.file`	Compare across columns with different values of these keys
`-ignore KEYS`	(none)	Omit keys from grouping — suppresses "benchmarks vary" warnings

Available keys:

Key	Meaning	Example value
`.name`	Base benchmark name (without sub-benchmark config)	`Parse` from `BenchmarkParse/size=4k-16`
`.fullname`	Full name including sub-benchmark configuration	`Parse/size=4k-16`
`.file`	Input file name or custom label	`old.txt` or `baseline`
`.config`	All file-level configuration keys combined	`goos/goarch/pkg/cpu`
`.unit`	Metric unit name	`sec/op`, `B/op`, `allocs/op`
`/{name-key}`	Per-benchmark sub-name key	`/size` extracts `4k` from `Parse/size=4k`
`/gomaxprocs`	GOMAXPROCS value — recognizes both `/gomaxprocs=N` and the `-N` suffix convention	`16` from `Parse-16`
`goos`	Operating system (from benchmark output header)	`linux`, `darwin`
`goarch`	Architecture (from benchmark output header)	`amd64`, `arm64`
`pkg`	Package path (from benchmark output header)	`myapp/pkg/parser`
`cpu`	CPU model (from benchmark output header)	`AMD Ryzen 9 5950X`

Sort order modifiers — append to any key:

Modifier	Meaning	Example
`@alpha`	Alphabetic sort	`/format@alpha`
`@num`	Numeric sort (understands prefixes: 2k, 1Mi)	`/size@num`
`@(val1 val2 ...)`	Fixed order + filter (only listed values, in this order)	`/format@(gob json)`

Filter flag

Flag	Purpose
`-filter EXPR`	Filter which benchmarks are processed before grouping and comparison

See Filter Expression Syntax below for full details.

Input labeling

Not a flag but a syntax feature — label input files for clearer column headers:

# Default: file names become column headers
benchstat old.txt new.txt

# Custom labels
benchstat baseline=old.txt optimized=new.txt

# Multiple versions
benchstat v1=v1.txt v2=v2.txt v3=v3.txt

The first input is always the base for comparison. All subsequent inputs are compared against it.

Filter Expression Syntax

Filters select which benchmarks to include before grouping and comparison. The syntax is:

Matching operators

Pattern	Meaning	Example
`key:value`	Exact match	`goos:linux`
`key:"value"`	Exact match with quoted value (allows spaces, special chars)	`pkg:"github.com/user/repo"`
`key:/regexp/`	Regular expression match (Go regexp syntax)	`.name:/Parse\
`key:(val1 OR val2)`	Match any of the listed values	`goos:(linux OR darwin)`
`*`	Match everything (all benchmarks)	`*`

Logical operators

Operator	Meaning	Example
`x y`	AND — both must match (implicit)	`goos:linux goarch:amd64`
`x AND y`	AND — explicit form	`goos:linux AND goarch:amd64`
`x OR y`	OR — either must match	`goos:linux OR goos:darwin`
`-x`	NOT — must not match	`-goos:windows`
`(...)`	Grouping / subexpression	`(goos:linux OR goos:darwin) -pkg:/internal/`

Filter key types

Key	What it matches	Example
`.name`	Base benchmark name	`.name:Parse`
`.fullname`	Full name with sub-benchmark config	`.fullname:/Parse\/size=4k/`
`/{name-key}`	Sub-benchmark parameter	`/size:4k`
`/gomaxprocs`	GOMAXPROCS value	`/gomaxprocs:16`
`.file`	Input file label	`.file:old.txt`
`.unit`	Metric unit	`.unit:sec/op`
`goos`	OS from header	`goos:linux`
`goarch`	Architecture from header	`goarch:amd64`
`pkg`	Package from header	`pkg:/parser/`

Filter examples

# Only Parse benchmarks
benchstat -filter '.name:Parse' old.txt new.txt

# Only benchmarks with size=4096 sub-parameter
benchstat -filter '/size:4096' old.txt new.txt

# Exclude Parallel benchmarks
benchstat -filter '-.name:/Parallel/' old.txt new.txt

# Linux amd64 only
benchstat -filter 'goos:linux goarch:amd64' old.txt new.txt

# Multiple benchmark names
benchstat -filter '.name:(Parse OR Encode OR Decode)' old.txt new.txt

# Complex: Linux or Darwin, not internal packages, only sec/op metric
benchstat -filter '(goos:linux OR goos:darwin) -pkg:/internal/ .unit:sec/op' old.txt new.txt

# Regex: all benchmarks starting with Bench
benchstat -filter '.name:/^Bench/' old.txt new.txt

Projection Examples

Default: before/after file comparison

benchstat old.txt new.txt
# Equivalent to:
benchstat -table .config -row .fullname -col .file old.txt new.txt

Creates one row per benchmark, one column per file.

Compare sub-benchmark parameters within a single file

When a single benchmark file contains multiple sub-benchmarks (e.g., BenchmarkEncode/format=json and BenchmarkEncode/format=gob):

benchstat -col /format bench.txt

Creates columns for each value of /format, comparing them against each other.

Simplify rows to base name only

benchstat -col /format -row .name bench.txt

Strips sub-benchmark configuration from row names, making the table more compact.

Control column order

# Force gob first, then json (instead of alphabetical)
benchstat -col '/format@(gob json)' bench.txt

Group by GOMAXPROCS

benchstat -col /gomaxprocs bench.txt

Compares performance across different GOMAXPROCS values within the same file.

Separate tables per package

benchstat -table pkg old.txt new.txt

Creates one table per package — useful when comparing benchmarks across multiple packages.

Ignore a dimension

# Suppress "benchmarks vary in /gomaxprocs" warning
benchstat -row .name -ignore /gomaxprocs bench.txt

Compare three versions

benchstat v1=v1.txt v2=v2.txt v3=v3.txt

Shows v2 vs v1 and v3 vs v1 (first input is always the base).

Cross-dimensional comparison

# Rows = benchmark name, columns = OS, separate tables per architecture
benchstat -row .name -col goos -table goarch results.txt

Unit Metadata

`assume=exact`

For metrics that should not vary between runs (e.g., binary size, generated code size):

BenchmarkSize 1 42 custom-bytes/op
Unit custom-bytes/op assume=exact

With assume=exact:

Non-parametric statistics are disabled
benchstat warns if measured values vary
Shows comparisons even with a single before/after measurement (no -count needed)

`assume=nothing` (default)

Standard behavior — uses non-parametric statistics (median + Mann-Whitney U-test). Requires multiple samples.

Interleaving Runs

Sequential runs (all old, then all new) are vulnerable to systematic bias — thermal throttling builds up over time, background processes come and go, CPU frequency scaling adapts. Interleaving reduces this:

# Pre-compile both versions to avoid measuring compilation time
go test -c -o old.test ./pkg/parser
# ... make your change ...
go test -c -o new.test ./pkg/parser

# Interleave runs — alternating reduces systematic bias
for i in $(seq 1 10); do
    ./old.test -test.bench=BenchmarkParse -test.benchmem >> old.txt
    ./new.test -test.bench=BenchmarkParse -test.benchmem >> new.txt
done

benchstat old.txt new.txt

Pre-compiling with go test -c is critical — without it, each go test -bench invocation includes compilation time, which varies and contaminates results.

How Many Runs?

Scenario	Minimum `-count`	Why
Quick local check	6	Enough for a rough confidence interval; fast feedback loop
Pre-merge comparison	10	Standard for detecting moderate (>5%) changes with confidence
Detecting small changes (<5%)	20-30	More samples narrow the CI; needed when signal is small relative to noise
Noisy CI environment	20+	Shared CI runners have higher variance; more runs compensate

Never "retry until significant" — rerunning benchmarks until ~ goes away introduces selection bias (p-hacking). If 10 runs show ~, the change is probably not meaningful. Increase run count once and accept the result.

At α=0.05, expect ~5% of benchmarks to randomly report significance with no real change (false positives). This is normal — don't chase them.

Single-File Summary

Analyze variance of a single run without comparison:

benchstat bench.txt

Shows median and confidence interval for each benchmark. Use to:

Check measurement stability before making code changes
Identify noisy benchmarks that need more runs or better isolation
Get a quick summary of current performance

Common Pitfalls

Pitfall	Why it's wrong	Fix
`-count=1`	Single run has no variance information; benchstat can't compute confidence	Always use `-count=6` minimum, prefer `-count=10`
Running on a laptop on battery	CPU throttles to save power; variance explodes	Plug in, disable power saving, or use a desktop/server
Running with browser/IDE open	Background processes steal CPU cycles; adds noise	Close unnecessary applications, or accept wider CIs
Rerunning until `~` disappears	Selection bias (p-hacking) — you're cherry-picking runs that showed improvement	Run once with high `-count`, accept the result
Comparing across machines	Different CPUs, memory, OS = incomparable baselines	Same machine, same conditions, both runs
Not interleaving	Systematic bias from thermal throttling, background load drift	Pre-compile both versions with `go test -c`, alternate runs
Measuring compilation time	`go test -bench` compiles first; startup overhead varies	Pre-compile with `go test -c`, run the binary directly
Ignoring wide CI (± >5%)	Results look significant but variance is too high to be trustworthy	Fix the noise first, then compare; or increase `-count`
Comparing different `-count` values	Unequal sample sizes bias the comparison	Use the same `-count` for all inputs

benchstat in CI

See CI Regression Detection for integrating benchstat comparisons into CI pipelines with benchdiff, cob, and gobenchdata.

CI Benchmark Regression Detection

Run these tools in CI only, not on local machines. Local benchmark results are noisy due to background processes, thermal throttling, and inconsistent CPU frequency — regressions detected locally are unreliable and waste developer time. Even shared CI runners can produce significant variance (5-10%); use statistical methods like benchstat with multiple iterations and relative comparisons to filter noise, or invest in dedicated benchmark runners for critical paths.

benchdiff

Runs Go benchmarks on two git refs and uses benchstat to display deltas. Caches results for non-worktree refs so re-runs are fast. Prevents macOS sleep during benchmarks.

go install filippo.io/mostly-harmless/benchdiff@latest

# Compare current worktree against HEAD (default)
benchdiff -- -benchmem

# Compare two specific refs
benchdiff -base-ref main -head-ref feature-branch

# Compare against a specific commit or tag
benchdiff -base-ref v1.2.0

# Pass extra flags to go test — everything after -- goes to go test
benchdiff -- -benchmem -count=10 -benchtime=3s

# Filter to specific benchmarks
benchdiff -- -benchmem -count=10 -bench=BenchmarkParse

# Target a specific package
benchdiff -- -benchmem -count=10 ./pkg/parser/...

# Clear cached results (useful after rebasing or when cache is stale)
benchdiff -clear-cache

# Combine: compare main with 10 iterations, filtered to critical benchmarks
benchdiff -base-ref main -- -benchmem -count=10 -bench='BenchmarkParse|BenchmarkEncode'

Best for: quick PR-to-base comparisons in git-based workflows. Leverages benchstat for statistical rigor and caches non-worktree refs so re-runs only re-measure the worktree.

cob

Compares benchmarks between HEAD and HEAD~1, failing the CI job if performance degrades beyond a configurable threshold (default 20%).

go install github.com/knqyf263/cob@latest

# Run with default 20% threshold — compares HEAD vs HEAD~1
cob

# Stricter threshold for critical paths (10% regression = failure)
cob -threshold 10

# Compare against a specific base commit
cob -base main

# Only report regressions (ignore improvements)
cob -only-degression

# Choose which metrics to compare (default: ns/op,B/op)
cob -compare "ns/op,B/op,allocs/op"

# Custom go test arguments
cob -bench-args "test -run '^$' -bench BenchmarkParse -benchmem ./pkg/parser/..."

# Increase benchmark duration for more stable results
cob -bench-args "test -run '^$' -bench . -benchmem -benchtime=3s ./..."

# Skip cob for a specific commit: include [skip cob] in commit message

Caution: cob uses git reset internally, which can cause data loss if uncommitted changes exist. Always commit your work before running. Additionally, cob requires all benchmarks to pass; it skips CI gating if any benchmark fails. For safety, run only in CI pipelines, not locally. Note that cob compares single runs without benchstat-style statistics, making it more susceptible to noise than benchdiff.

Best for: simple post-commit regression gating in CI where statistical rigor is less critical than fast feedback.

gobenchdata

GitHub Action + CLI that collects benchmark results, publishes to gh-pages as JSON, and visualizes with an interactive web dashboard. Shows performance trends over time.

go install go.bobheadxi.dev/gobenchdata@latest

CLI commands

# Parse go test -bench output to JSON
go test -bench=. -benchmem -count=5 ./... | gobenchdata --json bench.json

# Parse from a file
gobenchdata --json bench.json < bench.txt

# Add a tag to the benchmark run (e.g., git commit)
gobenchdata --json bench.json --tag "$(git rev-parse --short HEAD)" < bench.txt

# Evaluate regression checks against a checks config
gobenchdata checks eval bench.txt --checks-config .gobenchdata-checks.yml

# Generate the web dashboard app (static Vue.js site)
gobenchdata web generate ./dashboard-app

# Serve the dashboard locally for preview
gobenchdata web serve ./dashboard-app

# Merge multiple benchmark JSON files
gobenchdata merge old-bench.json new-bench.json > combined.json

# Prune old entries (keep last 30 runs)
gobenchdata prune --count 30 bench.json

GitHub Action setup

# .github/workflows/benchmark.yml
name: Benchmark
on: [push]
jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v5
        with:
          go-version: stable
      - name: Run benchmarks
        run: go test -bench=. -benchmem -count=5 ./... | tee bench.txt
      - uses: bobheadxi/gobenchdata@v1
        with:
          PRUNE_COUNT: 30
          GO_TEST_PKGS: ./...
          BENCHMARKS_OUT: bench.txt
          PUBLISH: true
          PUBLISH_BRANCH: gh-pages
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Regression gating on PRs

- name: Check for regressions
  run: gobenchdata checks eval bench.txt --checks-config .gobenchdata-checks.yml

# .gobenchdata-checks.yml
checks:
  - name: "No major regressions"
    package: ./...
    benchmarks: [".*"]
    thresholds:
      - metric: NsPerOp
        max: 1.2 # fail if >20% slower
      - metric: AllocedBytesPerOp
        max: 1.3 # fail if >30% more allocations
  - name: "Critical path stability"
    package: ./pkg/parser
    benchmarks: ["BenchmarkParse.*"]
    thresholds:
      - metric: NsPerOp
        max: 1.1 # stricter: fail if >10% slower

Dashboard configuration

# gobenchdata-web.yml — configure the Vue.js dashboard
title: "My Project Benchmarks"
description: "Performance tracking dashboard"
chartGroups:
  - name: Parser
    charts:
      - name: Parse Performance
        package: myapp/pkg/parser
        benchmarks: ["BenchmarkParse.*"]
        metrics: [NsPerOp, AllocedBytesPerOp, AllocsPerOp]
  - name: Encoding
    charts:
      - name: Encode/Decode
        package: myapp/pkg/encoding
        benchmarks: ["Benchmark(Encode|Decode).*"]
        metrics: [NsPerOp, MBPerS]

Best for: long-term trend tracking and visualization; complements benchdiff/cob for immediate gating.

Tool Selection Guide

Tool	Statistical rigor	Dashboard	Best for
benchdiff	High (uses benchstat)	No	Local dev + CI PR comparisons
cob	Low (single comparison)	No	Quick CI gate, simple setup
gobenchdata	Medium (configurable checks)	Yes (Vue.js on gh-pages)	Long-term trend tracking
benchstat (raw)	High	No (CSV export)	Maximum control, custom workflows

Noisy Neighbor Mitigation

Cloud CI environments share hardware with other jobs. Expect 5-10% variance even on quiet machines.

Why CI benchmarks are noisy

Shared CPU/memory — other CI jobs compete for resources
Thermal throttling — sustained load reduces clock speed
Different hardware across runs — CI runners may have different specs
Kernel scheduling — context switches add unpredictable latency
Disk I/O contention — shared storage affects I/O-bound benchmarks

Strategies

Statistical rigor — run with -count=10 or more and compare with benchstat. A single run is meaningless. benchstat's p-value test filters out noise-induced false positives.

Relative comparison in same job — run both base and head benchmarks in the same CI job on the same machine, rather than comparing against historical absolute values. This cancels out machine-to-machine variation. Tools like benchdiff do this automatically by checking out both git refs.

Dedicated benchmark runners — for critical path benchmarks, use self-hosted CI runners with no other workloads. This eliminates noisy neighbors entirely but costs more infrastructure.

Conservative thresholds — set regression thresholds higher on shared CI (20%+) than on dedicated runners (10%). Tight thresholds on noisy environments produce false positives that erode trust. GitHub-hosted runners show ~2-3% coefficient of variation in the best case; to guarantee <1% false positive rate, you need a 7%+ performance gate.

Never "retry until pass" — rerunning benchmarks until they pass introduces selection bias. If a benchmark is flaky, fix the noise source (more iterations, dedicated runner, wider threshold) rather than retrying.

System Tuning for Self-Hosted Runners

WARNING: These commands modify kernel and CPU settings. Apply them ONLY on dedicated CI runners, NEVER on developer machines or shared servers.

When you control the CI hardware, these settings dramatically reduce benchmark variance by eliminating the main sources of non-determinism.

Disable CPU frequency scaling

Variable CPU frequency makes benchmark times meaningless — the same code runs at different speeds depending on load and thermals:

# Set all CPUs to "performance" governor (fixed maximum frequency)
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Disable Turbo Boost

Turbo Boost temporarily increases clock speed but throttles under sustained load, creating variance between the start and end of a benchmark run:

# Intel
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

# AMD
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost

Pin benchmarks to specific CPU cores

Prevents the OS from migrating the benchmark process across cores, which causes cache thrashing (L1/L2 caches are per-core):

# Pin to cores 2 and 3 (leave cores 0-1 for OS and other processes)
taskset -c 2,3 go test -bench=. -count=10 ./...

Disable SMT (Hyper-Threading)

SMT shares execution units between logical cores on the same physical core, causing unpredictable contention:

# Disable SMT system-wide
echo off | sudo tee /sys/devices/system/cpu/smt/control

# Or disable individual sibling cores (check /sys/devices/system/cpu/cpu*/topology/thread_siblings_list)
echo 0 | sudo tee /sys/devices/system/cpu/cpu1/online  # if cpu0 and cpu1 are siblings

Combined CI setup script

#!/bin/bash
# benchmark-setup.sh — run on self-hosted CI runner before benchmarks
set -euo pipefail

echo "=== Configuring CPU for stable benchmarks ==="
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo 2>/dev/null || true
echo off | sudo tee /sys/devices/system/cpu/smt/control 2>/dev/null || true

echo "=== Running benchmarks on isolated cores ==="
taskset -c 2,3 go test -bench=. -benchmem -count=10 ./... | tee bench.txt

Compiler Analysis Reference

The Go compiler provides diagnostic flags that reveal optimization decisions — escape analysis, inlining, SSA intermediate representation, and generated assembly. These are essential for understanding why a function allocates or why the compiler won't inline it.

Use compiler diagnostics when pprof shows a hot function and you need to understand the compiler's decisions about that function. These tools are free (no runtime overhead) — they analyze at compile time.

Escape Analysis

Escape analysis determines whether a variable can live on the stack (cheap — freed when the function returns) or must be allocated on the heap (expensive — requires GC). "Moved to heap" means the compiler decided the variable might outlive the function.

Commands

# Show escape decisions — one line per escaped variable
go build -gcflags="-m" ./... 2>&1 | grep "escapes to heap"
go build -gcflags="-m" ./... 2>&1 | grep "moved to heap"

# Verbose mode — shows the reason for each escape decision
go build -gcflags="-m -m" ./...

# Filter to a specific package
go build -gcflags="-m" ./pkg/parser 2>&1 | grep "escapes"

# Filter to a specific file
go build -gcflags="-m" ./pkg/parser/parse.go 2>&1

# Apply to all dependencies too (usually too noisy, but useful for debugging)
go build -gcflags="all=-m" ./...

# Combine with grep for a specific function
go build -gcflags="-m" ./pkg/parser 2>&1 | grep "Parse"

# Combine with grep to see what stays on the stack (does NOT escape)
go build -gcflags="-m" ./pkg/parser 2>&1 | grep "does not escape"

Reading the output

./pkg/parser/parse.go:15:6: can inline Parse
./pkg/parser/parse.go:42:13: &result escapes to heap
./pkg/parser/parse.go:42:13:   flow: ~r0 = &result:
./pkg/parser/parse.go:42:13:     from &result (address-of) at ./pkg/parser/parse.go:42:13
./pkg/parser/parse.go:42:13:     from return &result (return) at ./pkg/parser/parse.go:42:6

The -m -m (verbose) output shows the escape chain — why the compiler decided the variable escapes. In this example: result has its address taken (&result), and that pointer is returned, so result must survive beyond the function — it escapes to heap.

Common escape causes

Cause	Example	Why it escapes
Returning a pointer to a local	`return &result`	The local must outlive the function call — caller holds a reference
Interface boxing	`var x any = myStruct`	Concrete type stored in `interface{}` allocates a copy on the heap
Closure capturing a local	`go func() { use(localVar) }()`	The goroutine may run after the enclosing function returns
Slice append beyond capacity	`s = append(s, item)` when len == cap	Triggers a new backing array allocation on the heap
Passing pointer to unanalyzable function	`json.Marshal(&data)`	Compiler can't prove the pointer won't be retained across package boundary
Storing in a struct field that escapes	`obj.Field = &local`	If `obj` is heap-allocated, anything it points to must also be on the heap
fmt.Sprintf and friends	`fmt.Sprintf("%d", n)`	Arguments are boxed into `any` (interface boxing) + result string is heap-allocated
Sending pointer on channel	`ch <- &data`	Channel receiver may be a different goroutine with a different lifetime

Not all escapes are problems. Only investigate escapes in functions that pprof identifies as allocation-heavy. A function called once at startup can escape freely.

Inlining Decisions

Inlining replaces a function call with the function body at the call site. This eliminates call overhead and enables further optimizations (escape analysis improves, dead code elimination, constant folding). Functions that aren't inlined in hot paths may benefit from simplification.

Commands

# Show which functions CAN be inlined
go build -gcflags="-m" ./... 2>&1 | grep "can inline"

# Show which functions CANNOT be inlined (with the reason)
go build -gcflags="-m" ./... 2>&1 | grep "cannot inline"

# Show inlining decisions for a specific package
go build -gcflags="-m" ./pkg/handler 2>&1 | grep "inline"

# Show where inlining was actually applied (function was inlined into caller)
go build -gcflags="-m" ./... 2>&1 | grep "inlining call to"

# Verbose mode — shows the cost budget and why inlining was blocked
go build -gcflags="-m -m" ./... 2>&1 | grep "inline"

# Filter to a specific function
go build -gcflags="-m" ./pkg/handler 2>&1 | grep "HandleRequest"

# Show both inlining and escape analysis together (they interact)
go build -gcflags="-m" ./pkg/handler 2>&1 | grep -E "(inline|escape|moved to heap)"

Reading the output

./pkg/handler/handler.go:20:6: can inline validateInput
./pkg/handler/handler.go:35:6: cannot inline HandleRequest: function too complex: cost 120 exceeds budget 80
./pkg/handler/handler.go:42:19: inlining call to validateInput

The inline cost budget is approximately 80–82 AST nodes (as of Go 1.22+; has increased in later releases). Functions with higher cost (more AST nodes, complex control flow) are not inlined. Check the actual threshold with -gcflags="-m -m".

Common inlining blockers

Blocker	Why it prevents inlining	Mitigation
Function too complex	Body cost exceeds budget (80)	Split into smaller functions; extract the cold path
`defer` statement	Adds cleanup code that complicates inlining	Remove `defer` from tiny hot functions; call cleanup directly
`recover()` call	Forces stack frame preservation	Move `recover()` to a wrapper function
`go` statement	Goroutine launch has implicit complexity	Extract goroutine body into a separate function
Type switch / interface method call	Dynamic dispatch can't be resolved at compile time	Use concrete types in hot paths
`select` statement	Complex runtime interaction	Simplify channel patterns in hot functions
Large function body	Many statements add up in cost	Break into smaller functions — the hot inner function may inline

Value receivers vs pointer receivers: Receiver choice can affect copying, aliasing, escape analysis, and inlining, but pointer receiver methods can inline too and value receivers do not guarantee inlining. Check real compiler decisions with -gcflags="-m -m".

SSA Dump

The SSA (Static Single Assignment) dump shows the compiler's intermediate representation after each optimization pass — dead code elimination, bounds check removal, constant folding, register allocation. Use this when you need to understand exactly what the compiler generates.

Commands

# Generate SSA dump for a specific function — creates ssa.html in current directory
GOSSAFUNC=Parse go build ./pkg/parser
# Open ssa.html in browser — shows each optimization pass side by side

# Generate for a method on a type
GOSSAFUNC='(*Parser).Parse' go build ./pkg/parser

# Generate for a function in a specific package (when names collide)
GOSSAFUNC=myapp/pkg/parser.Parse go build ./...

# Combine with a specific output directory
GOSSAFUNC=Parse GOSSADIR=/tmp/ssa go build ./pkg/parser
# Creates /tmp/ssa/ssa.html

Reading ssa.html

The HTML file shows the function's code at each compiler pass:

1. Source — original Go code 2. AST — abstract syntax tree 3. Start — initial SSA form 4. Opt — after optimization passes (dead code, constant prop, bounds check elimination) 5. Lower — architecture-specific lowering 6. Regalloc — after register allocation 7. Genssa — final generated code

Click on a value in any pass to highlight it across all passes — see how the compiler transforms it. Red values were eliminated (dead code). Green values are new (introduced by a pass).

What to look for:

Bounds checks remaining — IsInBounds or IsSliceInBounds operations that weren't eliminated. Adding explicit bounds checks or using _ = s[n-1] hints can help
Dead code not eliminated — values computed but never used (should be eliminated; if not, check for side effects)
Constant folding — computations on constants should be resolved at compile time
Register spills — values moved to stack because not enough registers; indicates heavy register pressure

Assembly Output

View the actual machine code the compiler generates. Use for verifying SIMD instructions, bounds checks, register allocation, and micro-optimization decisions.

Commands

# Full assembly output for a package (very verbose)
go build -gcflags="-S" ./pkg/parser 2>&1 | head -200

# Assembly for a specific function (grep for the function name)
go build -gcflags="-S" ./pkg/parser 2>&1 | grep -A 50 '"".Parse'

# Assembly for all packages (including dependencies — very verbose)
go build -gcflags="all=-S" ./... 2>&1 | grep -A 50 'myapp/pkg/parser.Parse'

# Disassemble a compiled binary (alternative to -gcflags="-S")
go build -o myapp ./cmd/server
go tool objdump -s Parse myapp

# Disassemble with source interleaving
go tool objdump -S -s Parse myapp

# Disassemble a specific symbol
go tool objdump -s 'myapp/pkg/parser.Parse' myapp

# Disassemble a specific text range (by address)
go tool objdump -start 0x4a3b00 -end 0x4a3c00 myapp

# List all symbols in a binary
go tool nm myapp | grep Parse

# Cross-compile and inspect assembly for a different architecture
GOARCH=arm64 go build -gcflags="-S" ./pkg/parser 2>&1 | head -200

Reading assembly output

"".Parse STEXT size=240 args=0x18 locals=0x48
    0x0000 MOVQ (TLS), CX           ; goroutine stack check
    0x0009 LEAQ -64(SP), AX
    0x000e CMPQ AX, 16(CX)          ; stack overflow check
    0x0012 JLS  228                  ; jump to stack growth
    0x0018 SUBQ $72, SP             ; allocate stack frame
    0x001c MOVQ BP, 64(SP)          ; save base pointer
    0x0021 LEAQ 64(SP), BP          ; set new base pointer
    ; ... function body ...
    0x00e0 CALL runtime.makeslice(SB) ; heap allocation!

What to look for:

CALL runtime.makeslice or CALL runtime.newobject — heap allocations in the hot path
CALL runtime.growslice — slice capacity exceeded, triggering copy
PCDATA / FUNCDATA — GC metadata (ignore for performance analysis)
Bounds check sequences: CMPQ + JCC before array/slice access — can sometimes be eliminated
SIMD instructions: VMOVDQU, VPSHUFB, VPADDB, etc. — verify auto-vectorization or manual SIMD
CALL runtime.morestack_noctxt — stack growth (normal, but frequent calls indicate deep recursion)

Comparing assembly before/after optimization

# Before your change
go build -gcflags="-S" ./pkg/parser 2>&1 > asm-before.txt

# After your change
go build -gcflags="-S" ./pkg/parser 2>&1 > asm-after.txt

# Diff the assembly
diff asm-before.txt asm-after.txt

Investigation Session Setup

Tools and techniques for temporary deep-dive performance investigation — not everyday monitoring. These are things you enable for hours or days while debugging a specific issue, then disable.

Setting Up a Session

Before diving into profiles, set up the environment to collect high-resolution data:

1. Reduce Prometheus scrape interval to <=10s on the target instance (normally 15-30s). More data points during a short investigation window reveal patterns that 30s intervals miss. Revert after investigation.

2. Enable pprof via environment variable — no recompile needed:

   kubectl set env deployment/my-service PPROF_ENABLED=true
   kubectl rollout restart deployment/my-service

3. Enable continuous profiling on the target instance only — not fleet-wide. Pyroscope/Parca on a single instance is manageable; on 50 replicas it overwhelms the backend.

   kubectl set env deployment/my-service PYROSCOPE_ENABLED=true
   kubectl rollout restart deployment/my-service

4. Enable debug logging via env var if needed — but only on the target instance. Debug logging has significant throughput impact:

   kubectl set env deployment/my-service LOG_LEVEL=debug
   kubectl rollout restart deployment/my-service

Key principle: all costly debug features (pprof HTTP, continuous profiling, debug log level, trace collection) SHOULD be configurable via environment variables. This allows instant toggle without recompile. Design your application to support this from day one.

Prometheus Go Runtime Collector

The prometheus/client_golang library automatically registers collectors that expose Go runtime metrics. These are invaluable during investigation sessions — they provide a time-series view of memory, GC, goroutines, and CPU that complements point-in-time profiles.

When using prometheus/client_golang, refer to the library's official documentation to verify collector setup and available options.

Key Series

→ See prometheus-go-metrics.md for the exhaustive reference of all Go runtime metrics (verified from official sources). Note: runtime/metrics list varies by Go version — use metrics.All() at runtime for your specific Go version.

Performance note: go_memstats_* metrics internally call runtime.ReadMemStats(), which triggers a short stop-the-world pause. In Go 1.17+, the runtime/metrics collector (collectors.NewGoCollector()) uses runtime/metrics instead, which is cheaper. Prefer the modern collector in high-throughput services:

import "github.com/prometheus/client_golang/prometheus/collectors"

// Use runtime/metrics-based collector (lower overhead)
reg := prometheus.NewRegistry()
reg.MustRegister(collectors.NewGoCollector(
    collectors.WithGoCollectorRuntimeMetrics(collectors.MetricsAll),
))
reg.MustRegister(collectors.NewProcessCollector(collectors.ProcessCollectorOpts{}))

PromQL Deep-Dive Queries

Use these during investigation sessions with the reduced scrape interval. Each query includes what to look for and what the result means.

GC pressure

PromQL	What to look for
`rate(go_gc_duration_seconds_count[5m])`	GC cycles/s. >2/s sustained = excessive allocation rate. Reduce allocations per request.
`rate(go_gc_duration_seconds_sum[5m]) / rate(go_gc_duration_seconds_count[5m])`	Average GC pause. Increasing trend = heap growing or too many pointers to scan.
`go_gc_duration_seconds{quantile="1"}`	Worst-case GC pause. Spikes here cause tail latency (P99).

Memory leak detection

PromQL	What to look for
`go_memstats_alloc_bytes`	Should be roughly stable under constant load. Continuous increase = memory leak.
`rate(go_memstats_alloc_bytes_total[5m])`	Allocation rate (bytes/s). Compare before/after deploy — significant increase = new allocation pattern.
`process_resident_memory_bytes - go_memstats_sys_bytes`	Gap = non-Go memory (cgo, mmap). Growing gap = non-Go leak.

Goroutine leak detection

PromQL	What to look for
`go_goroutines`	Should correlate with load. Growing independently of traffic = leak.
`delta(go_goroutines[1h])`	Net goroutine change over 1h. Positive without load increase = leak.

CPU saturation

PromQL	What to look for
`rate(process_cpu_seconds_total[5m])`	CPU cores consumed. Compare to GOMAXPROCS.
`rate(process_cpu_seconds_total[5m]) / <GOMAXPROCS>`	CPU utilization ratio. >0.8 sustained = CPU-saturated.

Post-deploy regression detection

PromQL	What to look for
`rate(go_memstats_alloc_bytes_total[5m])`	Compare before/after deploy window. Significant increase = new allocation pattern introduced.
`histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))`	P99 latency increase after deploy = performance regression. Requires app-level histogram.

Example alerting rules

# GC taking too much time
- alert: HighGCPauseTime
  expr: rate(go_gc_duration_seconds_sum[5m]) / rate(go_gc_duration_seconds_count[5m]) > 0.01
  for: 10m
  annotations:
    summary: "Average GC pause >10ms — reduce allocations or tune GOGC"

# Goroutine leak
- alert: GoroutineLeak
  expr: go_goroutines > 10000
  for: 5m
  annotations:
    summary: "Goroutine count >10K — check for leaked goroutines"

# Memory approaching container limit
- alert: MemoryNearLimit
  expr: predict_linear(process_resident_memory_bytes[1h], 3600) > <container_limit_bytes>
  for: 15m
  annotations:
    summary: "RSS projected to exceed container limit within 1h"

Adjust thresholds to your application — a data pipeline has different baselines than an API server.

Host-Level Correlation

Go runtime metrics alone don't show the full picture. Host-level metrics reveal whether the problem is in your application or the infrastructure.

`node_exporter` — host CPU, memory, disk I/O, network. Correlate with Go app metrics: high node_cpu_seconds_total with low process_cpu_seconds_total = noisy neighbor, not your app.
`process-exporter` — per-process metrics on Linux. Useful when multiple Go services share a host.

Cost Warnings

Profiles and traces are expensive to collect. Keep them short-term and localized:

pprof CPU profiling — CPU-intensive during the capture window. Don't run 30s profiles back-to-back in production. Space them out.
Pyroscope continuous profiling — ~2-5% CPU overhead per instance, always-on. At scale (hundreds of instances), this adds up in compute cost and backend storage. Enable on a subset of instances or on-demand via environment variable. → See samber/cc-skills-golang@golang-observability skill for Pyroscope setup.
Execution traces — generate large files quickly (MB/s). Capture 5-10s max. Longer traces are unwieldy and slow to analyze.
Debug log level — significant throughput impact due to allocation and I/O overhead. Never leave on permanently.
All costly features SHOULD be toggleable via environment variables for instant on/off without recompile. Design for this from day one.

pprof Reference

go tool pprof is the primary tool for understanding where CPU time, memory, and contention go in Go programs. This file covers how to use the CLI and interpret the output. For enabling pprof endpoints on running services (net/http/pprof import, authentication, security), → See samber/cc-skills-golang@golang-troubleshooting skill.

Profile Types

Each profile type answers a different performance question. Choosing the wrong profile type wastes investigation time — match the symptom to the profile before capturing.

Profile	Flag / Endpoint	Use when	Why this profile and not another
CPU	`-cpuprofile` or `/debug/pprof/profile?seconds=30`	High CPU usage, slow functions	Samples which functions are on-CPU at 100Hz; misses off-CPU time (I/O, sleep)
Heap (alloc_objects)	`-memprofile` then `pprof -alloc_objects`	GC pressure, too many allocations	Counts allocation events regardless of size; useful when allocation frequency and object churn dominate
Heap (alloc_space)	`pprof -alloc_space`	Finding largest allocation sites by volume	Measures total bytes allocated; use when you need to reduce peak memory, not just GC frequency
Heap (inuse_space)	`pprof -inuse_space`	Memory growing over time, suspected leaks	Shows currently live heap objects; compare two snapshots to isolate leak sources
Heap (inuse_objects)	`pprof -inuse_objects`	Object count growth, suspected leak of small objects	Counts live objects regardless of size; useful when leak is many small objects not visible in inuse_space
Goroutine	`/debug/pprof/goroutine`	Blocked I/O, goroutine leaks, pool exhaustion	Snapshots all goroutine stacks; look for goroutines piling up on the same call site
Mutex	`/debug/pprof/mutex`	Lock contention between goroutines	Measures cumulative time goroutines waited to acquire mutexes. Must enable first: `runtime.SetMutexProfileFraction(5)`
Block	`/debug/pprof/block`	Goroutines blocked on channels, mutexes, timers, select	Measures cumulative time goroutines spent blocked on synchronization primitives. Must enable first: `runtime.SetBlockProfileRate(1)`
Threadcreate	`/debug/pprof/threadcreate`	Excessive OS thread creation	Shows stack traces that created new OS threads; typically from cgo calls or blocking syscalls that pin a thread

Choosing between alloc_objects and alloc_space

alloc_objects — "where do I allocate the most often?" — use when allocation frequency and object churn are driving GC work
alloc_space — "where do I allocate the most bytes?" — use for reducing peak memory usage and RSS
In practice, start with alloc_objects because GC churn is the most common allocation-related bottleneck in Go.

Choosing between inuse_space and alloc_space

alloc_space is cumulative since program start — it includes objects already freed by GC
inuse_space is a point-in-time snapshot — only currently live objects
Use alloc_space to find allocation hot spots for optimization. Use inuse_space to debug memory leaks.

Enabling mutex and block profiles

These profiles are disabled by default because they add overhead. Enable them before capturing:

import "runtime"

// Mutex profiling: fraction of mutex contention events recorded.
// 5 means 1 out of 5 events is recorded. Higher = less overhead but less detail.
runtime.SetMutexProfileFraction(5)

// Block profiling: time-based sampling rate.
// 1 = record all blocking events. Higher values sample about one event per rate nanoseconds blocked.
// Use 1 for debugging, higher values (e.g. 1000000 = 1ms) for production.
runtime.SetBlockProfileRate(1)

Disable after investigation to eliminate overhead:

runtime.SetMutexProfileFraction(0)
runtime.SetBlockProfileRate(0)

Generating Profiles

From benchmarks (no HTTP server needed)

# CPU profile — measures where compute time goes during benchmark execution
go test -bench=BenchmarkParse -cpuprofile=cpu.prof ./pkg/parser

# Memory profile — captures allocation patterns during benchmark
go test -bench=BenchmarkParse -memprofile=mem.prof ./pkg/parser

# Both at once — but be aware CPU profiling adds ~5% overhead which can skew memory results
go test -bench=BenchmarkParse -cpuprofile=cpu.prof -memprofile=mem.prof ./pkg/parser

From running service

Requires import _ "net/http/pprof" (see samber/cc-skills-golang@golang-troubleshooting skill for secure setup):

# CPU profile — captures 30 seconds of CPU samples
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# Heap profile — snapshots current heap state
go tool pprof -alloc_objects http://localhost:6060/debug/pprof/heap

# Goroutine profile — snapshots all goroutine stacks
go tool pprof http://localhost:6060/debug/pprof/goroutine

# Mutex profile — contention data since last reset
go tool pprof http://localhost:6060/debug/pprof/mutex

# Block profile — blocking data since last reset
go tool pprof http://localhost:6060/debug/pprof/block

From code (programmatic)

import "runtime/pprof"

// CPU profile
f, _ := os.Create("cpu.prof")
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()

// Heap snapshot at a specific point
f, _ := os.Create("heap.prof")
pprof.WriteHeapProfile(f)
f.Close()

// Named profile (goroutine, threadcreate, etc.)
pprof.Lookup("goroutine").WriteTo(f, 0)

Interactive CLI Commands

Open a profile in interactive mode:

go tool pprof cpu.prof
# or from a URL:
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

`top` — self time ranking (start here)

The first command to run. Shows functions ranked by the time (or allocations) spent in the function itself:

(pprof) top
Showing nodes accounting for 4.2s, 84% of 5s total
      flat  flat%   sum%        cum   cum%
     1.50s 30.00% 30.00%      2.80s 56.00%  encoding/json.Marshal
     0.80s 16.00% 46.00%      0.80s 16.00%  runtime.mallocgc
     0.60s 12.00% 58.00%      0.60s 12.00%  runtime.memmove
     0.50s 10.00% 68.00%      0.50s 10.00%  runtime.scanobject
     0.40s  8.00% 76.00%      1.90s 38.00%  myapp/pkg/parser.Parse
     0.30s  6.00% 82.00%      0.30s  6.00%  syscall.syscall
     0.10s  2.00% 84.00%      0.10s  2.00%  runtime.futex

Column	Meaning	How to read it
flat	Time spent in the function itself, excluding callees	High flat = the function's own code is expensive
flat%	flat as percentage of total sample time	Quick way to see relative cost
sum%	Running total of flat% going down the list	"The top 3 functions account for 58% of total time"
cum	Time in function + all functions it calls (cumulative)	High cum with low flat = the function delegates to expensive callees
cum%	cum as percentage of total	Compare with flat% — big gap means the cost is in callees

Limiting output:

(pprof) top 5              # show only top 5 functions
(pprof) top -cum 10        # top 10 by cumulative time
(pprof) top -flat 20       # top 20 by flat time (default sort)

`top -cum` — cumulative time ranking

Critical when top shows runtime functions (runtime.mallocgc, runtime.memmove, runtime.scanobject) dominating. These are symptoms, not causes. top -cum reveals which application functions trigger them:

(pprof) top -cum
      flat  flat%   sum%        cum   cum%
     0.40s  8.00%  8.00%      3.80s 76.00%  myapp/pkg/handler.HandleRequest
     0.10s  2.00% 10.00%      2.80s 56.00%  myapp/pkg/handler.serializeResponse
     1.50s 30.00% 40.00%      2.80s 56.00%  encoding/json.Marshal

Now you can see that HandleRequest → serializeResponse → json.Marshal is the hot path. The optimization target is serializeResponse, not runtime.mallocgc.

`list funcName` — annotated source

Shows the source code of a function with per-line cost annotations. This is how you pinpoint the exact line causing the bottleneck:

(pprof) list serializeResponse
Total: 5s
ROUTINE ======================== myapp/pkg/handler.serializeResponse
     0.10s      2.80s (flat, cum) 56.00% of Total
         .          .     38:func serializeResponse(w http.ResponseWriter, data any) {
         .      0.20s     39:    w.Header().Set("Content-Type", "application/json")
     0.10s      2.60s     40:    buf, err := json.Marshal(data)
         .          .     41:    if err != nil {
         .          .     42:        http.Error(w, err.Error(), 500)
         .          .     43:        return
         .          .     44:    }
         .      0.20s     45:    w.Write(buf)
         .          .     46:}

Left column = flat time (work done by this line itself)
Right column = cumulative time (this line + everything it calls)
Line 40 accounts for 2.60s cumulative because json.Marshal is expensive

Use `list` with a regex to find all matching functions:

(pprof) list Parse.*       # all functions starting with Parse
(pprof) list \.Handle      # all Handle methods across packages

`peek funcName` — callers and callees

Shows who calls a function and what it calls — the one-hop neighborhood in the call graph. Use to trace the responsibility chain when a function appears hot but you're unsure whether the problem is upstream (too many calls) or downstream (expensive callees):

(pprof) peek json.Marshal
Showing nodes accounting for 5s, 100% of 5s total
----------------------------------------------+-------------
                                               |      flat  flat%   sum%        cum   cum%
 myapp/pkg/handler.serializeResponse 2.60s     |
 myapp/pkg/api.buildResponse         0.20s     |     1.50s 30.00% 30.00%      2.80s 56.00%  encoding/json.Marshal
----------------------------------------------+-------------
                                               |
 reflect.Value.MapRange              0.40s     |
 encoding/json.(*encodeState).marshal 0.30s    |
 runtime.mallocgc                     0.80s    |

Top section = callers (who calls json.Marshal). Bottom section = callees (what json.Marshal calls internally).

`tree` — hierarchical call tree

Displays the full call tree with cumulative costs at each level. Useful when you need more context than peek provides:

(pprof) tree
     0.40s  8.00%  8.00%      3.80s 76.00%  myapp/pkg/handler.HandleRequest
              0.10s  myapp/pkg/handler.serializeResponse
                     1.50s  encoding/json.Marshal
                            0.80s  runtime.mallocgc
              0.20s  myapp/pkg/handler.validateInput
              0.10s  myapp/pkg/handler.fetchData

`traces` — raw stack traces

Dumps all raw sample stack traces. Each stack trace shows what the program was doing at the moment it was sampled:

(pprof) traces
-----------+-------------------------------------------------------
     bytes:  1.5MB
     1.50s   encoding/json.Marshal
             myapp/pkg/handler.serializeResponse
             myapp/pkg/handler.HandleRequest
             net/http.(*ServeMux).ServeHTTP
-----------+-------------------------------------------------------

Useful for spotting unexpected call paths (e.g., a function you didn't expect being called from a hot path).

`web` / `svg` — graphical call graph

web opens a call graph in the browser. svg saves it to a file. Both require graphviz installed (brew install graphviz or apt install graphviz).

Visual encoding:

Thicker edges = more time flows through that call
Larger nodes = more time spent in that function
Red/dark nodes = hot spots (high flat time)
Edge labels = time flowing through that call path

Use when the text commands don't reveal the full picture — the visual layout often reveals call patterns that are hard to see in text.

`disasm funcName` — assembly-level

Shows generated assembly with per-instruction cost. Use for micro-optimization: verifying SIMD instructions, bounds check elimination, or inlining at the instruction level:

(pprof) disasm Parse
Total: 5s
ROUTINE ======================== myapp/pkg/parser.Parse
     0.40s      1.90s (flat, cum) 38.00% of Total
     0.10s      0.10s    4a3b20: MOVQ 0x8(SP), AX          ;parser.go:15
     0.20s      0.20s    4a3b28: CMPQ AX, $0x100           ;parser.go:16
         .      0.10s    4a3b2f: JGE 0x4a3b80              ;parser.go:16
     0.10s      1.50s    4a3b35: CALL runtime.makeslice(SB) ;parser.go:17

`weblist funcName` — annotated source in browser

Like list but opens the annotated source in a browser with color-coded cost highlighting. Each line is shaded from white (no cost) to red (hot). More visually immediate than the text version:

(pprof) weblist serializeResponse

Requires a browser. Falls back to list if no browser is available.

`tags` — profile label breakdown

Shows tag values present in the profile. Go runtime profiles carry tags like thread_id; custom profiles can add arbitrary labels via pprof.Do():

labels := pprof.Labels("request_type", "api", "endpoint", "/users")
pprof.Do(ctx, labels, func(ctx context.Context) {
    handleRequest(ctx)
})

(pprof) tags
request_type: api (85%), batch (15%)
endpoint: /users (40%), /orders (35%), /products (25%)

`tagroot` and `tagleaf` — group by labels

Group the profile data by tag values, creating a virtual call tree rooted on tag names:

(pprof) tagroot request_type    # group everything by request_type first
(pprof) top                     # now shows breakdown per request_type
(pprof) tagleaf endpoint        # add endpoint as leaf grouping

Useful for multi-tenant profiling or breaking down by request type without code changes.

`granularity` — control grouping level

Changes how samples are aggregated:

(pprof) granularity=functions    # default — group by function name
(pprof) granularity=filefunctions # group by file:function
(pprof) granularity=files        # group by file only
(pprof) granularity=lines        # group by exact source line
(pprof) granularity=addresses    # group by instruction address (most granular)

lines is especially useful when a single function has multiple hot spots — it reveals which specific lines are expensive without needing list.

`sort` — change sort order

(pprof) sort=flat     # sort by flat time (default for top)
(pprof) sort=cum      # sort by cumulative time (same as top -cum)

`source` — show source for matching regex

Similar to list but searches all functions matching a pattern and shows their annotated source:

(pprof) source handler   # show annotated source for all functions matching "handler"

`focus`, `ignore`, `hide`, `show` — filtering

Narrow the analysis to specific functions or exclude noise. These are stateful — they persist across commands until explicitly cleared:

(pprof) focus=myapp            # only show call paths that pass through "myapp"
(pprof) ignore=runtime         # remove runtime functions from display
(pprof) hide=testing           # hide testing framework noise from graphs
(pprof) show=handler           # only show functions matching "handler"
(pprof) tagfocus=endpoint=/users  # only show samples with this tag value
(pprof) tagignore=request_type=batch  # exclude samples with this tag value

Difference between `focus`, `show`, `hide`, and `ignore`:

focus — keeps only paths that contain a matching function; everything else is dropped
ignore — removes matching functions from the graph entirely; their costs are attributed to callers
show — like focus but only affects display, not cost accounting
hide — like ignore but only hides from display, not cost accounting

Clear all filters:

(pprof) reset

`normalize` — normalize against a base profile

When comparing two profiles with -base, values are deltas by default. normalize scales the base profile to match the total of the main profile, making ratios comparable even if run durations differ:

(pprof) normalize

`sample_index` — switch metric in multi-metric profiles

Heap profiles contain multiple metrics (alloc_objects, alloc_space, inuse_objects, inuse_space). Switch between them without reloading:

(pprof) sample_index=alloc_objects
(pprof) top                       # now shows allocation counts
(pprof) sample_index=inuse_space
(pprof) top                       # now shows live memory

`unit` — change display units

(pprof) unit=ms         # display time in milliseconds
(pprof) unit=seconds    # display in seconds
(pprof) unit=MB         # display memory in megabytes
(pprof) unit=auto       # automatic (default)

`callgrind` — export for KCachegrind

Exports the profile in callgrind format, which can be opened in KCachegrind or QCachegrind for advanced visualization:

(pprof) callgrind
Generating report in callgrind format

`proto` — save processed profile

Save the current profile (after filtering) in protobuf format for sharing or later analysis:

(pprof) proto > filtered.pb.gz

`help` — list all commands

(pprof) help             # full command list with descriptions
(pprof) help top         # detailed help for a specific command

`show_from=regex` — trim callers above match

Hides all frames above the first matching function. Useful when you're only interested in a specific subsystem and want to remove framework/routing noise above it:

(pprof) show_from=handler.Handle   # start the graph from Handle, hide all callers above

`noinlines` — flatten inlined functions

Attributes inlined functions to their first out-of-line caller. Useful when inlined functions create confusing call chains in the graph:

(pprof) noinlines

Full command reference

Every command below works both as a standalone shell command and inside the interactive (pprof) prompt. The interactive form omits go tool pprof and the profile path — e.g., go tool pprof -top cpu.prof becomes just top inside the prompt.

Reporting commands:

# Top functions by self (flat) cost — the first command to run
go tool pprof -top cpu.prof

# Top 20 functions by cumulative cost (self + callees)
go tool pprof -cum -top -nodecount=20 cpu.prof

# Annotated source for a specific function — pinpoints the exact expensive line
go tool pprof -list=json.Marshal cpu.prof

# Callers and callees of a function — trace the responsibility chain
go tool pprof -peek=serializeResponse cpu.prof

# Hierarchical call tree with costs at each level
go tool pprof -tree cpu.prof

# Raw sample stack traces — spot unexpected call paths
go tool pprof -traces cpu.prof

# Per-instruction assembly cost — verify SIMD, bounds checks, inlining
go tool pprof -disasm=Parse cpu.prof

# Annotated source for all functions matching a regex
go tool pprof -source='handler\..*' cpu.prof

# Text output (flat table, alternative to -top)
go tool pprof -text cpu.prof

Graph/export commands:

# SVG call graph (viewable in any browser, no graphviz server needed)
go tool pprof -svg cpu.prof > cpu.svg

# SVG of only the subgraph matching a regex
go tool pprof -svg -focus=handler cpu.prof > handler.svg

# PDF call graph
go tool pprof -pdf cpu.prof > cpu.pdf

# PNG call graph
go tool pprof -png cpu.prof > cpu.png

# GIF call graph
go tool pprof -gif cpu.prof > cpu.gif

# DOT format (for custom graphviz processing: dot -Tsvg cpu.dot > cpu.svg)
go tool pprof -dot cpu.prof > cpu.dot

# Callgrind format (open with KCachegrind / QCachegrind)
go tool pprof -callgrind cpu.prof > cpu.callgrind

# Save current profile (with filters applied) in protobuf format
go tool pprof -proto -focus=handler cpu.prof > handler-only.pb.gz

# Annotated source in browser with color-coded cost per line
go tool pprof -weblist=serializeResponse cpu.prof

Filtering flags — narrow analysis to relevant functions:

# Focus: keep only call paths passing through matching functions
go tool pprof -focus=myapp/pkg/handler -top cpu.prof

# Ignore: remove matching functions — their cost is attributed to callers
go tool pprof -ignore=runtime -top cpu.prof

# Show: display only matching functions (display-only, does not change cost accounting)
go tool pprof -show=handler -top cpu.prof

# Hide: hide matching functions from display (does not change cost accounting)
go tool pprof -hide=testing -svg cpu.prof > clean.svg

# Show_from: trim all frames above the first match — hides framework/routing callers
go tool pprof -show_from=handler.Handle -top cpu.prof

# Noinlines: attribute inlined functions to their first out-of-line caller
go tool pprof -noinlines -top cpu.prof

# Combine multiple filters
go tool pprof -cum -top -nodecount=10 -focus=handler -ignore=runtime cpu.prof

Tag-based filtering — for profiles with labels (via pprof.Do()):

# Show all tag keys and their value distributions
go tool pprof -tags cpu.prof

# Keep only samples tagged with a specific key=value
go tool pprof -tagfocus=endpoint=/users -top cpu.prof

# Exclude samples with a specific tag
go tool pprof -tagignore=request_type=batch -top cpu.prof

# Group by tag — insert pseudo frames at root, breaking down by tag value
go tool pprof -tagroot=request_type -top cpu.prof

# Group by tag as leaf — breaks down each function by tag value
go tool pprof -tagleaf=endpoint -top cpu.prof

# Show/hide tags as annotations in graph output
go tool pprof -tagshow=endpoint -svg cpu.prof > tagged.svg
go tool pprof -taghide=thread_id -svg cpu.prof > clean.svg

Granularity and display control:

# Group by source line instead of function — reveals hot lines in multi-hot-spot functions
go tool pprof -granularity=lines -top cpu.prof

# Group by file:function
go tool pprof -granularity=filefunctions -top cpu.prof

# Group by file only
go tool pprof -granularity=files -top cpu.prof

# Group by instruction address (most granular)
go tool pprof -granularity=addresses -top cpu.prof

# Change display units
go tool pprof -unit=ms -top cpu.prof

# Edge/node fraction cutoffs — hide small contributions from graphs
go tool pprof -edgefraction=0.01 -nodefraction=0.005 -svg cpu.prof > clean.svg

# Disable trimming — show the full graph including tiny nodes
go tool pprof -trim=false -svg cpu.prof > full.svg

Heap profile commands:

# Top allocation sites by object count — diagnose GC churn
go tool pprof -top -alloc_objects mem.prof

# Top allocation sites by bytes — diagnose peak memory
go tool pprof -top -alloc_space mem.prof

# Currently live objects — diagnose memory leaks
go tool pprof -top -inuse_space mem.prof

# Currently live object count — diagnose leak of many small objects
go tool pprof -top -inuse_objects mem.prof

# Annotated source showing allocation sites by object count
go tool pprof -alloc_objects -list=Parse mem.prof

# SVG call graph colored by allocation objects
go tool pprof -alloc_objects -svg mem.prof > allocs.svg

# Compare two heap snapshots — show only growth (memory leak detection)
go tool pprof -top -base heap-baseline.prof heap-after.prof

# Diff with normalization — makes ratios comparable when capture durations differ
go tool pprof -normalize -top -base heap-baseline.prof heap-after.prof

# Diff as SVG — visualize what grew
go tool pprof -base heap-baseline.prof -svg heap-after.prof > leak.svg

# Diff with annotated source for a specific function
go tool pprof -base heap-baseline.prof -list=handleRequest heap-after.prof

Fetching profiles from a running service:

# CPU profile — fetch 30 seconds of samples and open interactive mode
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# CPU profile — fetch and immediately generate SVG (no interactive mode)
go tool pprof -svg http://localhost:6060/debug/pprof/profile?seconds=10 > cpu.svg

# CPU profile — fetch with a timeout
go tool pprof -timeout=60 "http://localhost:6060/debug/pprof/profile?seconds=30"

# Heap profile — fetch and show top allocation sites
go tool pprof -top -alloc_objects http://localhost:6060/debug/pprof/heap

# Goroutine profile — fetch and show top goroutine stacks
go tool pprof -top http://localhost:6060/debug/pprof/goroutine

# Mutex profile — fetch contention data
go tool pprof -top http://localhost:6060/debug/pprof/mutex

# Block profile — fetch blocking data
go tool pprof -top http://localhost:6060/debug/pprof/block

# Fetch and save to a file without analysis (using curl)
curl -o heap.prof http://localhost:6060/debug/pprof/heap

# Human-readable goroutine dump (no go tool pprof needed)
curl http://localhost:6060/debug/pprof/goroutine?debug=1

# Goroutine dump with full stack traces, creation site, and labels
curl http://localhost:6060/debug/pprof/goroutine?debug=2

# Human-readable heap stats
curl http://localhost:6060/debug/pprof/heap?debug=1

# Fetch over TLS with client certificate
go tool pprof -tls_cert=client.crt -tls_key=client.key -tls_ca=ca.crt https://myservice:6060/debug/pprof/profile?seconds=30

# Fetch over TLS skipping server certificate verification
go tool pprof https+insecure://myservice:6060/debug/pprof/profile?seconds=30

Comparison commands (diff two profiles):

# Diff: subtract base from source — all values become deltas
go tool pprof -base cpu-before.prof cpu-after.prof

# Diff base: percentages shown relative to base profile
go tool pprof -diff_base=cpu-before.prof cpu-after.prof

# Diff with normalization — scale base to match source total
go tool pprof -normalize -base heap-before.prof heap-after.prof

# Diff as top report
go tool pprof -top -base cpu-before.prof cpu-after.prof

# Diff as SVG graph
go tool pprof -svg -base cpu-before.prof cpu-after.prof > diff.svg

Web UI:

# Open interactive web UI with flamegraph, graph, source, and disassembly views
go tool pprof -http=:8080 cpu.prof

# Open on a different port
go tool pprof -http=:9090 mem.prof

# Open with a specific sample type pre-selected
go tool pprof -http=:8080 -alloc_objects mem.prof

# Open with filters pre-applied
go tool pprof -http=:8080 -focus=handler cpu.prof

# Open a diff view in the web UI
go tool pprof -http=:8080 -base heap-baseline.prof heap-after.prof

# Open with no browser auto-launch (just start the server)
go tool pprof -http=:8080 -no_browser cpu.prof

Symbolization flags:

# Disable symbolization (show raw addresses)
go tool pprof -symbolize=none cpu.prof

# Only use local binaries for symbolization (don't contact remote)
go tool pprof -symbolize=local cpu.prof

# Contact running service for symbol information
go tool pprof -symbolize=remote http://localhost:6060/debug/pprof/profile?seconds=10

# Show mangled C++ names (relevant for cgo profiles)
go tool pprof -symbolize=demangle=none cpu.prof

# Full demangling without simplification
go tool pprof -symbolize=demangle=full cpu.prof

Environment variables:

Variable	Purpose
`PPROF_BINARY_PATH`	Search path for local binaries used in symbolization (default: `$HOME/pprof/binaries`). Set when profiling remote servers where binaries aren't in the default path.
`PPROF_TOOLS`	Directory containing binutils tools (`addr2line`, `nm`, `objdump`). Set when these tools aren't in `$PATH`.

Graphical / Web UI

When CLI output is insufficient and you need interactive exploration:

# Opens browser with interactive UI
go tool pprof -http=:8080 cpu.prof

# Specify a different port if 8080 is taken
go tool pprof -http=:9090 mem.prof

# Open with specific sample type pre-selected
go tool pprof -http=:8080 -alloc_objects mem.prof

# Open with filters pre-applied
go tool pprof -http=:8080 -focus=handler cpu.prof

# Compare two profiles — open with -base
go tool pprof -http=:8080 -base heap-baseline.prof heap-after.prof

The web UI provides:

Flamegraph (most intuitive) — horizontal width proportional to cost; click to zoom into subtrees; inverted flamegraph available (icicle graph)
Graph — directed call graph with edge weights; nodes and edges sized/colored by cost; interactive zoom and click-to-focus
Top — same as top command but sortable columns, clickable to navigate to source
Source — annotated source with per-line cost; browsable across all functions
Disassembly — same as disasm but browsable across functions
Peek — interactive peek view with expandable callers/callees

Default to CLI commands for quick diagnosis — use the web UI when exploring unfamiliar call graphs, comparing profiles visually, or presenting findings to others.

Comparing Profiles

Memory leak detection with `-base`

Compare two heap profiles to isolate what grew between them:

# Step 1: take a baseline snapshot
curl http://localhost:6060/debug/pprof/heap > heap-baseline.prof

# Step 2: wait for the suspected leak to accumulate (minutes to hours)

# Step 3: take a second snapshot
curl http://localhost:6060/debug/pprof/heap > heap-after.prof

# Step 4: diff — shows only what grew between the two snapshots
go tool pprof -base heap-baseline.prof heap-after.prof
# Then use top, list, peek as usual — all values are deltas

Comparing CPU profiles across code versions

# Before your change
go test -bench=BenchmarkParse -cpuprofile=cpu-before.prof ./pkg/parser

# After your change
go test -bench=BenchmarkParse -cpuprofile=cpu-after.prof ./pkg/parser

# Compare visually — load both in separate browser tabs
go tool pprof -http=:8080 cpu-before.prof
go tool pprof -http=:8081 cpu-after.prof

For statistical comparison of benchmark numbers (not profiles), use benchstat instead.

Common Patterns

Learn to recognize these recurring shapes — they tell you what class of problem you're dealing with before you start fixing.

Flat high + cum high

The function itself is the bottleneck. It does expensive work directly (tight loop, heavy computation, complex string processing). Optimize the function's own code — algorithm, data structure, or implementation.

Flat low + cum high

The function calls slow things but does little work itself. It's a coordinator or dispatcher. Drill into callees with list or peek. The fix is usually in the called functions, or reducing how often they're called.

`alloc_objects` high, `inuse_space` low

Short-lived allocations creating GC churn. Objects are allocated and freed rapidly — each one is cheap individually but the aggregate volume triggers frequent GC cycles. Common sources: fmt.Errorf in hot paths (allocates every call), interface boxing (any arguments), string-to-byte conversions, slice growth without preallocation. → See samber/cc-skills-golang@golang-performance skill for allocation reduction patterns.

`inuse_space` growing over time

Memory leak. Take two heap snapshots minutes apart and compare with -base (see Comparing Profiles above). Growing types reveal the leak source. Common causes: unbounded caches, maps that never shrink (Go maps don't release bucket memory on delete), goroutine leaks holding references.

Mutex/block profile hot

Contention, not CPU. The CPU is waiting, not working. The goroutines are all trying to acquire the same lock or read from the same channel. Reduce critical section scope, shard locks across multiple mutexes, or use lock-free structures (sync/atomic, sync.Map for read-heavy workloads). → See samber/cc-skills-golang@golang-concurrency skill.

Many goroutines blocked on same channel/mutex

Serialization bottleneck. All work funnels through a single point. The throughput ceiling is the speed of that single point. Consider worker pools with multiple independent queues, sharding the work, or buffered channels to smooth bursts.

`runtime.mallocgc` dominates CPU profile

Allocation rate is the bottleneck, not computation. The Go runtime is spending more time allocating and collecting garbage than running your code. Switch to the alloc_objects heap profile to find which functions allocate the most, then → See samber/cc-skills-golang@golang-performance skill for reduction patterns.

`runtime.memmove` high in CPU profile

Large memory copies — usually from slice append growing beyond capacity, copy() of large slices, or string-to-byte conversions. Pre-allocate slices to final capacity, reuse buffers, or work with []byte directly.

`runtime.scanobject` high in CPU profile

GC pointer scanning. The heap contains many pointers that the GC must trace. Reduce pointer density: use value types instead of pointers in slices/maps, flatten nested structures, consider [N]byte arrays instead of string in hot structs.

Which Profile for Which Symptom?

Symptom	Profile	Flag/Command
High CPU, slow function	CPU	`-cpuprofile` or `pprof/profile`
Too many allocations (GC pressure)	Heap (alloc_objects)	`-memprofile` then `pprof -alloc_objects`
Large allocations (memory usage)	Heap (alloc_space)	`pprof -alloc_space`
Memory growing over time (leak)	Heap (inuse_space)	`pprof -inuse_space`, compare with `-base`
Lock contention	Mutex	`pprof/mutex` (enable `SetMutexProfileFraction` first)
Goroutines blocked on sync	Block	`pprof/block` (enable `SetBlockProfileRate` first)
Too many goroutines / leak	Goroutine	`pprof/goroutine`
High latency but low CPU	Goroutine + Block + Trace	Scheduling delays, I/O waits — see Trace Reference
Excessive thread creation	Threadcreate	`pprof/threadcreate`

Prometheus Go Runtime Metrics Reference

Complete listing of Go runtime metrics actually exposed as Prometheus metrics by prometheus/client_golang library.

---

Important Clarification

`runtime/metrics` are NOT Prometheus metrics. They're Go runtime data structures.

The Prometheus Go client library (prometheus/client_golang) selectively converts some runtime/metrics into Prometheus format. By default, it exposes only the traditional go_memstats_* and go_gc_* metrics to keep cardinality low.

This document lists only Prometheus metrics (the ones you actually scrape from /metrics endpoint).

---

Quick Reference

Metrics with Labels

Metric	Label	Values
`go_gc_duration_seconds`	`quantile`	0, 0.25, 0.5, 0.75, 1
`go_info`	`version`	e.g., "go1.21.3"

All Other Metrics

All other metrics have no labels.

---

Default Go Metrics (Always Exposed)

These are exposed by default by prometheus/client_golang.

Memory Allocation

Metric	Type	Description
`go_memstats_alloc_bytes`	gauge	Current bytes allocated on heap
`go_memstats_alloc_bytes_total`	counter	Cumulative bytes allocated
`go_memstats_sys_bytes`	gauge	Total bytes requested from OS

Heap State

Metric	Type	Description
`go_memstats_heap_alloc_bytes`	gauge	Allocated heap bytes
`go_memstats_heap_idle_bytes`	gauge	Idle heap bytes
`go_memstats_heap_inuse_bytes`	gauge	Heap bytes in use
`go_memstats_heap_objects`	gauge	Count of heap objects
`go_memstats_heap_released_bytes`	gauge	Heap bytes released to OS
`go_memstats_heap_sys_bytes`	gauge	Heap bytes reserved from OS

Stack and Metadata

Metric	Type	Description
`go_memstats_stack_inuse_bytes`	gauge	Stack in-use bytes
`go_memstats_stack_sys_bytes`	gauge	Stack reserved bytes
`go_memstats_mspan_inuse_bytes`	gauge	Mspan in-use bytes
`go_memstats_mspan_sys_bytes`	gauge	Mspan reserved bytes
`go_memstats_mcache_inuse_bytes`	gauge	Mcache in-use bytes
`go_memstats_mcache_sys_bytes`	gauge	Mcache reserved bytes
`go_memstats_other_sys_bytes`	gauge	Other runtime bytes
`go_memstats_gc_sys_bytes`	gauge	GC internal bytes
`go_memstats_buck_hash_sys_bytes`	gauge	Profiling bucket hash table bytes

Allocation and Free Counters

Metric	Type	Description
`go_memstats_mallocs_total`	counter	Total malloc calls
`go_memstats_frees_total`	counter	Total free calls

GC Configuration and Timing

Metric	Type	Description
`go_gc_gogc_percent`	gauge	GOGC target percentage
`go_gc_gomemlimit_bytes`	gauge	GOMEMLIMIT soft memory limit
`go_memstats_last_gc_time_seconds`	gauge	Last GC end time (Unix timestamp)
`go_memstats_next_gc_bytes`	gauge	Heap size target for next GC

GC Pause Duration (with labels)

Metric	Type	Labels	Description
`go_gc_duration_seconds`	summary	`quantile` (0, 0.25, 0.5, 0.75, 1)	GC pause durations with quantiles
`go_gc_duration_seconds_count`	counter	—	GC pause count
`go_gc_duration_seconds_sum`	counter	—	GC pause total time

Runtime State

Metric	Type	Description
`go_goroutines`	gauge	Current goroutine count
`go_threads`	gauge	Current OS thread count
`go_sched_gomaxprocs_threads`	gauge	Current GOMAXPROCS value

Version Information (with labels)

Metric	Type	Labels	Description
`go_info`	gauge	`version`	Go version string

---

Optional Go Metrics (Opt-in, Go 1.17+)

Enable via:

prometheus.NewRegistry().MustRegister(
    collectors.NewGoCollector(
        collectors.WithGoCollectorRuntimeMetrics(
            collectors.MetricsAll,
        ),
    ),
)

GC Cycles

Metric	Type	Description
`go_gc_cycles_automatic_gc_cycles_total`	counter	Automatic GC cycles (heap growth)
`go_gc_cycles_forced_gc_cycles_total`	counter	Forced GC cycles (runtime.GC())

Additional Heap Metrics

Metric	Type	Description
`go_gc_heap_allocs_bytes_total`	counter	Cumulative heap allocations (bytes)
`go_gc_heap_allocs_objects_total`	counter	Cumulative heap allocations (count)
`go_gc_heap_frees_bytes_total`	counter	Cumulative heap frees (bytes)
`go_gc_heap_frees_objects_total`	counter	Cumulative heap frees (count)
`go_gc_heap_goal_bytes`	gauge	Heap size target for next GC
`go_gc_heap_live_bytes`	gauge	Live heap bytes
`go_gc_heap_objects_objects`	gauge	Total heap objects count

GC Pauses Distribution

Metric	Type	Description
`go_gc_pauses_seconds`	distribution	GC pause durations

CPU Classes

Metric	Type	Description
`go_cpu_classes_gc_mark_assist_cpu_seconds_total`	counter	GC mark assist CPU time
`go_cpu_classes_gc_mark_dedicated_cpu_seconds_total`	counter	GC dedicated workers CPU time
`go_cpu_classes_gc_mark_idle_cpu_seconds_total`	counter	GC idle workers CPU time
`go_cpu_classes_gc_pause_cpu_seconds_total`	counter	GC pause CPU time
`go_cpu_classes_gc_total_cpu_seconds_total`	counter	Total GC CPU time
`go_cpu_classes_idle_cpu_seconds_total`	counter	Idle CPU time
`go_cpu_classes_scavenge_assist_cpu_seconds_total`	counter	Scavenger assist CPU time
`go_cpu_classes_scavenge_background_cpu_seconds_total`	counter	Background scavenger CPU time
`go_cpu_classes_scavenge_total_cpu_seconds_total`	counter	Total scavenger CPU time
`go_cpu_classes_total_cpu_seconds_total`	counter	Total CPU time (all classes)
`go_cpu_classes_user_cpu_seconds_total`	counter	User-mode CPU time

Memory Classes

Metric	Type	Description
`go_memory_classes_heap_free_bytes`	gauge	Free heap memory
`go_memory_classes_heap_objects_bytes`	gauge	Allocated heap objects
`go_memory_classes_heap_released_bytes`	gauge	Heap released memory
`go_memory_classes_heap_stacks_bytes`	gauge	Stack memory
`go_memory_classes_heap_unused_bytes`	gauge	Unused heap
`go_memory_classes_metadata_mcache_free_bytes`	gauge	Free mcache memory
`go_memory_classes_metadata_mcache_inuse_bytes`	gauge	In-use mcache memory
`go_memory_classes_metadata_mspan_free_bytes`	gauge	Free mspan memory
`go_memory_classes_metadata_mspan_inuse_bytes`	gauge	In-use mspan memory
`go_memory_classes_other_bytes`	gauge	Other memory
`go_memory_classes_total_bytes`	gauge	Total memory

Scheduler Metrics

Metric	Type	Description
`go_sched_goroutines_running_goroutines`	gauge	Running goroutines
`go_sched_goroutines_runnable_goroutines`	gauge	Runnable goroutines waiting
`go_sched_goroutines_goroutines`	gauge	Current total goroutines
`go_sched_goroutines_created_goroutines_total`	counter	Total goroutines ever created
`go_sched_goroutines_waiting_goroutines`	gauge	Goroutines waiting (not runnable)
`go_sched_latencies_seconds`	distribution	Goroutine scheduling latency
`go_sched_pauses_stopping_gc_seconds`	distribution	STW pause time (GC stop)
`go_sched_pauses_stopping_other_seconds`	distribution	STW pause time (other stop)
`go_sched_pauses_total_gc_seconds`	distribution	Total GC pause duration
`go_sched_pauses_total_other_seconds`	distribution	Total other pause duration
`go_sched_threads_total_threads`	counter	Total OS threads ever created
`go_sync_mutex_wait_total_seconds_total`	counter	Total time goroutines waited on mutex

CGO Metrics

Metric	Type	Description
`go_cgo_go_to_c_calls_total`	counter	Total calls from Go to C

---

Process Metrics

Exposed by Prometheus process collector (not Go-specific):

CPU and Memory

Metric	Type	Description
`process_cpu_seconds_total`	counter	Total CPU time (user + system)
`process_resident_memory_bytes`	gauge	RSS (physical memory used)
`process_virtual_memory_bytes`	gauge	Virtual memory allocated
`process_virtual_memory_max_bytes`	gauge	Maximum virtual memory allowed

File Descriptors

Metric	Type	Description
`process_open_fds`	gauge	Open file descriptors
`process_max_fds`	gauge	Maximum file descriptors allowed

Process Information

Metric	Type	Description
`process_start_time_seconds`	gauge	Process start time (Unix timestamp)

Page Faults

Metric	Type	Description
`process_page_faults_total`	counter	Total page faults
`process_page_faults_minor_total`	counter	Minor page faults
`process_page_faults_major_total`	counter	Major page faults

---

Common PromQL Queries

Memory Leak Detection

# Current heap allocation (should be stable under constant load)
go_memstats_alloc_bytes

# Live heap bytes (optional metric)
go_gc_heap_live_bytes

# Heap growth rate
rate(go_memstats_alloc_bytes_total[5m])

GC Pressure

# Worst-case GC pause (quantile 1 = max)
go_gc_duration_seconds{quantile="1"}

# Average GC pause
rate(go_gc_duration_seconds_sum[5m]) / rate(go_gc_duration_seconds_count[5m])

# GC frequency (cycles per second)
rate(go_gc_duration_seconds_count[5m])

Goroutine Leaks

# Current goroutine count
go_goroutines

# Goroutine growth (leak indicator)
delta(go_goroutines[1h])

CPU Usage

# Total CPU time consumed
rate(process_cpu_seconds_total[5m])

# CPU utilization ratio (0-1)
rate(process_cpu_seconds_total[5m]) / <GOMAXPROCS>

File Descriptor Leaks

# FD growth
delta(process_open_fds[1h])

# FD saturation ratio
process_open_fds / process_max_fds

→ See samber/cc-skills@promql-cli skill for executing these queries directly against your Prometheus instance from the CLI.

References

Diagnostic Tools Quick Reference

Use these tools to validate the root cause of a slowdown BEFORE applying any optimization. Do NOT use auto-fix flags (e.g. --fix) — let the coding agent interpret results and apply changes manually with explanatory comments.

For detailed usage of each tool, see the dedicated reference files:

pprof Reference — profiling (CPU, heap, goroutine, mutex, block)
benchstat Reference — statistical benchmark comparison
Trace Reference — execution tracer
Compiler Analysis — escape analysis, inlining, SSA, assembly

GC and Runtime Diagnostics

Configure via environment variables — no recompile needed.

Command	Use for
`GODEBUG=gctrace=1 ./app`	GC frequency, pause times, heap sizes, CPU% — one line per GC cycle
`GODEBUG=gcpacertrace=1 ./app`	Why GC triggers when it does — pacer decisions (trigger ratio, heap goal)
`GODEBUG=schedtrace=1000 ./app`	Load balancing, goroutine distribution across Ps — prints every 1000ms
`GODEBUG=schedtrace=1000,scheddetail=1 ./app`	Per-goroutine state detail on top of schedtrace
Heap/alloc profiles (`go tool pprof -alloc_objects`)	Allocation sites and object churn; use instead of removed/stale allocation trace flags

→ See samber/cc-skills-golang@golang-troubleshooting skill for detailed GODEBUG usage and interpretation.

Programmatic APIs

`runtime.ReadMemStats` — heap size, NumGC, pause durations (PauseNs circular buffer), TotalAlloc (cumulative). Use for dashboards, alerting on heap growth.
`debug.ReadGCStats` — GC-specific statistics: pause percentiles, pause timeline, total pause duration. More focused than ReadMemStats.
`runtime/metrics` (Go 1.16+) — stable API, safe for concurrent reads, lower overhead than ReadMemStats. Keys: /gc/cycles/total:gc-cycles, /gc/heap/allocs:bytes, /gc/pauses:seconds, /sched/latencies:seconds, /memory/classes/heap/released:bytes.
`debug.FreeOSMemory()` — forces GC + returns memory to OS. One-off use after large temporary allocations (not for regular use — let the runtime manage this).
`expvar` — stdlib metrics at /debug/vars as JSON. import _ "expvar" auto-registers. Lightweight, no dependencies. Integrates with Netdata, Telegraf, or custom dashboards.

Static Analysis

Command	Use for
`fieldalignment ./...`	Detect suboptimal struct field ordering (padding waste). Do NOT use `-fix` flag — let the coding agent apply changes manually with explanatory comments.
`unsafe.Sizeof` / `Alignof` / `Offsetof`	Inspect struct memory layout at compile time — compare before/after reordering to quantify savings.
`go vet ./...`	Suspicious constructs: printf format mismatches, unreachable code, unused results, suspicious shifts.
`staticcheck ./...`	Advanced linter: performance pitfalls (SA9003: empty branch, SA4006: unused value, SA1019: deprecated API).
`go test -race ./...`	Data race detection at runtime — also useful for confirming false sharing.

Third-Party Profiling

Tool	What it adds	When to use
fgprof (`github.com/felixge/fgprof`)	Full goroutine profiler — captures both on-CPU and off-CPU (I/O wait) time in a single profile. Standard pprof CPU profiles only show on-CPU time.	pprof CPU profile shows low CPU% but latency is high.
Pyroscope / Parca	Continuous profiling platforms — aggregate pprof profiles over time, compare across deployments, detect regressions.	Production performance monitoring, historical trend analysis. → See `samber/cc-skills-golang@golang-observability` skill for setup.
Linux perf (`perf record -g ./app && perf report`)	Hardware performance counters: cache misses, branch mispredictions, TLB misses. Requires `perf_data_converter` for pprof format.	CPU microarchitecture-level analysis when pprof isn't granular enough.

Execution Trace Reference

go tool trace shows what pprof cannot: scheduling delays, GC stop-the-world phases, goroutine state transitions, and why goroutines are not running. pprof samples what's on-CPU; trace records every state transition at nanosecond precision.

Use the execution tracer when:

pprof shows low CPU% but latency is high (goroutines waiting, not working)
You suspect GC pauses are causing tail latency spikes
You need to understand goroutine scheduling and contention
You want to see the wall-clock timeline of concurrent operations

Generating Traces

From benchmarks

go test -bench=BenchmarkParse -trace=trace.out ./pkg/parser
go tool trace trace.out

From running service

Requires import _ "net/http/pprof":

# Capture 5 seconds of trace data (adjust duration as needed)
curl -o trace.out http://localhost:6060/debug/pprof/trace?seconds=5
go tool trace trace.out

Warning: traces generate data at MB/s. Keep captures short — 5-10 seconds is typical. Longer traces are unwieldy, slow to parse, and may consume significant memory when opened.

From tests

go test -trace=trace.out ./pkg/parser
go tool trace trace.out

From code (programmatic)

import "runtime/trace"

f, _ := os.Create("trace.out")
trace.Start(f)
defer trace.Stop()

Or capture a region of interest:

import "runtime/trace"

// Start tracing only when needed
f, _ := os.Create("trace.out")
trace.Start(f)

doExpensiveWork()

trace.Stop()
f.Close()

Full Command Reference

Opening traces

# Open trace in web browser (default — starts HTTP server, opens browser)
go tool trace trace.out

# Open on a specific port
go tool trace -http=:8080 trace.out

# Open on a specific host:port (e.g., for remote access)
go tool trace -http=0.0.0.0:8080 trace.out

Extracting pprof profiles from traces

go tool trace can convert trace data into pprof-compatible profiles. This bridges the two tools — you capture with the tracer (nanosecond events) and analyze with pprof (statistical aggregation with top, list, peek):

# Network blocking profile — where goroutines wait on network I/O
go tool trace -pprof=net trace.out > net.prof
go tool pprof -top net.prof

# Synchronization blocking profile — mutexes, channels, wait groups
go tool trace -pprof=sync trace.out > sync.prof
go tool pprof -top sync.prof

# Syscall blocking profile — system calls that block goroutines
go tool trace -pprof=syscall trace.out > syscall.prof
go tool pprof -top syscall.prof

# Scheduler latency profile — time between becoming runnable and actually running
go tool trace -pprof=sched trace.out > sched.prof
go tool pprof -top sched.prof

You can chain with any pprof command — e.g., annotated source for a blocking function:

go tool trace -pprof=sync trace.out > sync.prof
go tool pprof -list=handleRequest sync.prof
go tool pprof -svg sync.prof > sync-blocking.svg

Full capture-to-analysis workflows

# Workflow 1: benchmark trace — capture, view, extract blocking profile
go test -bench=BenchmarkParse -trace=trace.out ./pkg/parser
go tool trace trace.out                                         # visual timeline
go tool trace -pprof=sync trace.out > sync.prof                 # extract sync blocking
go tool pprof -top -cum sync.prof                               # find worst sync blockers
go tool pprof -list=processOrder sync.prof                      # annotated source

# Workflow 2: production trace — capture from running service, analyze scheduling
curl -o trace.out http://localhost:6060/debug/pprof/trace?seconds=5
go tool trace trace.out                                         # visual timeline
go tool trace -pprof=sched trace.out > sched.prof               # extract scheduling latency
go tool pprof -top sched.prof                                   # goroutines with worst scheduling delay
go tool pprof -svg sched.prof > sched.svg                       # graph of scheduling bottlenecks

# Workflow 3: test trace — capture during test run
go test -trace=trace.out -run=TestSlowIntegration ./pkg/api
go tool trace trace.out                                         # visual timeline
go tool trace -pprof=net trace.out > net.prof                   # extract network blocking
go tool pprof -top net.prof                                     # find network wait sites

`go tool trace` flags summary

Flag	Example	Purpose
(none)	`go tool trace trace.out`	Open trace in web browser (default)
`-http=:PORT`	`go tool trace -http=:9090 trace.out`	Set HTTP server address for the web UI
`-pprof=TYPE`	`go tool trace -pprof=net trace.out > net.prof`	Extract pprof profile from trace. Types: `net`, `sync`, `syscall`, `sched`

HTTP endpoints served by the web UI

When go tool trace trace.out starts its HTTP server, it exposes these pages:

Endpoint	What it shows
`/`	Index page with links to all views
`/trace`	Interactive timeline viewer (Chrome trace viewer) — the main visualization
`/goroutines`	Goroutine analysis — summary table of all goroutine types, counts, and execution stats
`/goroutine/<id>`	Detailed view of a specific goroutine — its full lifecycle timeline

From /goroutines, click on a goroutine type to see all instances and their execution statistics (total time, scheduled time, blocked time). Click an individual goroutine to see its timeline.

Web UI

Main views

The web UI (opened by go tool trace trace.out) shows a timeline where each horizontal lane represents a processor (P), goroutine, or system event:

Trace viewer (/trace) — interactive timeline with:
P lanes — one per logical processor (GOMAXPROCS), showing which goroutine runs on each P at each moment
Goroutine lanes — each goroutine's lifecycle: created → runnable → running → waiting → running → …
GC events — mark phases, sweep, STW pauses shown as colored bands across all P lanes
System events — syscalls, network I/O, timer events
User annotations — tasks, regions, and log messages from runtime/trace API

Goroutine analysis (/goroutines) — summary table:
Groups goroutines by creation stack trace (type)
Shows count, total execution time, total scheduling wait, total blocking time
Click a type to see individual goroutine statistics
Click an individual goroutine to see its timeline

Navigating the trace viewer

The trace viewer uses the Chrome tracing UI (also used by Chrome DevTools):

Key/Action	Effect
`W` / scroll up	Zoom in (time axis)
`S` / scroll down	Zoom out (time axis)
`A`	Pan left
`D`	Pan right
Click on event	Show details panel at bottom — goroutine ID, duration, stack trace
`Shift+click`	Select a time range — highlights all events in that window
`M`	Mark current selection
`/`	Search for events by name
`?`	Show keyboard shortcuts

Reading the timeline

Color coding:

Green bars on P lanes = goroutine actively executing
Blue bars = syscall (goroutine pinned to OS thread)
Orange/yellow marks = scheduling events (goroutine becoming runnable)
Red bands across all P lanes = GC stop-the-world pause
Light blue bands = GC concurrent mark phase
Purple = user-defined regions (from trace.WithRegion)

Gaps in P lanes = the processor was idle (no runnable goroutines, or goroutines blocked). Many idle gaps with pending runnable goroutines suggests scheduling contention.

What to Look For

Goroutine states

The trace timeline color-codes goroutine states:

Color	State	Meaning	What it indicates
Green	Running	Actively executing on a P	Normal — doing useful work
Yellow/Orange	Runnable	Ready to run but waiting for a P	CPU-saturated — too many runnable goroutines competing for too few processors
Red/Pink	Waiting	Blocked on I/O, channel, mutex, sleep, select	I/O-bound or contention — investigate what it's waiting on
Blue	GC assist	Drafted by GC to help mark/sweep	GC pressure — too many allocations forcing goroutines to help the collector

GC phases

GC events appear as colored bands across all P lanes:

Mark assist — goroutines drafted to help GC scan the heap. Visible as gaps in application goroutine execution. The runtime forces goroutines to assist with GC work in proportion to their allocation rate — heavy allocators get taxed more.
STW (stop-the-world) — brief phases where all goroutines are stopped (mark setup, mark termination). These cause latency spikes visible as vertical bands across all lanes.
Sweep — concurrent sweep of unreachable objects. Usually low overhead but can accumulate if the heap is large.

Diagnosing GC issues from traces:

Frequent GC cycles with long mark assist = too many allocations (reduce allocation rate)
Long STW phases = too many pointers for the GC to scan (reduce pointer density)
GC cycles clustering after specific operations = those operations allocate heavily

Scheduling latency

Time between a goroutine becoming runnable and actually running. High scheduling latency means:

Too many goroutines competing for GOMAXPROCS processors
OS scheduling interference (noisy neighbors, CPU throttling)
Goroutines pinned to busy threads by cgo or long syscalls

What to look for:

Yellow (runnable) gaps before green (running) segments — the longer the yellow gap, the higher the scheduling latency
Many goroutines in runnable state simultaneously — indicates CPU saturation
Uneven distribution across Ps — one P overloaded while others are idle suggests work imbalance

Network/sync blocking

Long red/pink periods on a goroutine = it's blocked waiting. Click the block event to see what it's waiting on (channel receive, mutex lock, network read, etc.)
Many goroutines blocked on the same channel or mutex = serialization bottleneck. All work funnels through one point.
Goroutines blocked on network I/O = external dependency latency. The Go code can't do anything faster — the bottleneck is upstream. Use -pprof=net to generate a pprof profile of network wait locations.

Goroutine creation and destruction

The trace shows goroutine lifecycle events. Look for:

Goroutines created in a loop without bound = potential goroutine leak
Goroutines that are created but never finish = leak — they accumulate over time
Very short-lived goroutines created repeatedly = high overhead from goroutine creation/scheduling (consider batching or worker pools)

Custom Annotations

Add application-level context to traces so you can correlate runtime events with business operations.

Tasks

A task represents a logical operation that may span multiple goroutines:

import "runtime/trace"

func processOrder(ctx context.Context, order Order) error {
    ctx, task := trace.NewTask(ctx, "processOrder")
    defer task.End()

    // All trace events in this context are grouped under the task
    validate(ctx, order)
    charge(ctx, order)
    fulfill(ctx, order)
    return nil
}

Tasks appear as named groups in the trace timeline. You can filter the trace view to show only events belonging to a specific task.

Regions

A region represents a phase within a task or goroutine:

func validate(ctx context.Context, order Order) {
    trace.WithRegion(ctx, "validateAddress", func() {
        // this block is annotated as a region
        validateAddress(order.Address)
    })

    trace.WithRegion(ctx, "validatePayment", func() {
        validatePayment(order.Payment)
    })
}

Regions appear as labeled spans on the goroutine's timeline, making it easy to see which phase of processing takes the most wall-clock time.

Log messages

Add point-in-time log messages to the trace:

trace.Log(ctx, "orderID", order.ID)
trace.Log(ctx, "status", "payment_verified")

Logs appear as markers on the timeline — useful for correlating trace events with specific data.

When to use annotations

Always in server request handlers — wrap each request in a task
Performance-critical paths — add regions to phases you want to measure wall-clock time for
Debugging intermittent latency — add logs at key decision points to see what happened in the slow trace

Annotations add negligible overhead when tracing is disabled (they check a flag and return immediately).

Flight Recorder (Go 1.25+)

The flight recorder solves a fundamental problem with execution traces in long-running services: when a problem occurs (timeout, failed health check), it's already too late to call trace.Start(). The flight recorder keeps a circular buffer of recent trace data in memory, and you snapshot it to disk when something goes wrong — like an airplane's black box.

Setup

import "runtime/trace"

fr := trace.NewFlightRecorder(trace.FlightRecorderConfig{
    MinAge:   10 * time.Second, // keep at least 10s of data
    MaxBytes: 5 << 20,          // cap at 5 MiB to limit memory usage
})
if err := fr.Start(); err != nil {
    return err
}

Sizing guidance:

MinAge — set to ~2x your problem window. For 5-second timeout debugging, use 10 seconds. The runtime may retain more data than MinAge if MaxBytes allows.
MaxBytes — busy services generate ~1-10 MB/s of trace data. Start with 1-5 MiB and adjust. MaxBytes takes precedence over MinAge — when the buffer fills, older data is discarded regardless of age.

Snapshot on error

Capture the trace buffer when something unexpected happens. Use sync.Once to prevent multiple snapshots overwriting each other:

var snapshotOnce sync.Once

func captureSnapshot(fr *trace.FlightRecorder) {
    snapshotOnce.Do(func() {
        f, err := os.Create("snapshot.trace")
        if err != nil {
            log.Printf("snapshot file: %v", err)
            return
        }
        defer f.Close()

        if _, err := fr.WriteTo(f); err != nil {
            log.Printf("snapshot write: %v", err)
            return
        }
        fr.Stop()
        log.Printf("captured snapshot to %s", f.Name())
    })
}

Trigger patterns

// Pattern 1: slow request detection
http.HandleFunc("/api/order", func(w http.ResponseWriter, r *http.Request) {
    start := time.Now()
    // ... handler logic ...

    if fr.Enabled() && time.Since(start) > 100*time.Millisecond {
        go captureSnapshot(fr)
    }
})

// Pattern 2: health check failure
if !healthCheck() && fr.Enabled() {
    go captureSnapshot(fr)
}

// Pattern 3: HTTP endpoint for on-demand capture
http.HandleFunc("/debug/flightrecorder", func(w http.ResponseWriter, r *http.Request) {
    if !fr.Enabled() {
        http.Error(w, "flight recorder not active", http.StatusServiceUnavailable)
        return
    }
    w.Header().Set("Content-Type", "application/octet-stream")
    w.Header().Set("Content-Disposition", "attachment; filename=trace.out")
    fr.WriteTo(w)
})

Analyzing a snapshot

go tool trace snapshot.trace

The snapshot contains the same data as a regular trace — use all the same analysis techniques (timeline viewer, goroutine analysis, pprof extraction). The flight recorder's flow events are particularly useful for diagnosing lock contention and goroutine stalls that caused the anomaly.

Constraints

At most one flight recorder may be active at a time (this restriction may be relaxed in future Go versions)
A flight recorder can run concurrently with trace.Start — both can be active simultaneously
Only one goroutine may call WriteTo at a time — the sync.Once pattern handles this naturally
Stop() blocks until any concurrent WriteTo completes

When to use flight recorder vs regular tracing

Scenario	Tool	Why
Investigating a known slow operation	`go test -trace` or `trace.Start`/`Stop`	You know when to start and stop
Intermittent latency spikes in production	Flight recorder	You don't know when the spike will happen — the buffer captures it retroactively
Post-mortem after a timeout or crash	Flight recorder	The problem already happened; regular tracing would miss it
Continuous performance monitoring	`samber/cc-skills-golang@golang-observability` (Pyroscope)	Flight recorder is for one-shot diagnosis, not continuous collection

Overhead and Practical Limits

Concern	Guidance
Runtime overhead	~1-2% CPU during capture; negligible when not capturing
Data volume	Traces generate MB/s of data. A 10-second trace of a busy service can be 50-100MB
Capture duration	5-10 seconds is typical. Longer traces are slow to open and hard to navigate
Memory to view	`go tool trace` loads the entire trace into memory. Large traces may need 1GB+ RAM
Browser performance	The web UI can struggle with traces >100MB. Use short captures.
Production use	Safe for short captures on a single instance. Do not capture continuously.

Trace vs pprof: When to Use Which

Question	Tool	Why
Where does CPU time go?	pprof CPU profile	Statistical sampling, low overhead, good for aggregate view
Why is latency high but CPU low?	go tool trace	Shows goroutine waiting states — I/O, channels, mutexes
Where do allocations happen?	pprof heap profile	Per-function allocation counts and sizes
Why are GC pauses long?	go tool trace	Shows STW phases, mark assist, GC timeline
Is there lock contention?	pprof mutex/block + trace	pprof quantifies it; trace shows the timeline
Are goroutines leaking?	pprof goroutine + trace	pprof shows the stack; trace shows creation/lifecycle
Which goroutines compete for CPU?	go tool trace	Shows runnable vs running states across all Ps
What's the wall-clock breakdown of a request?	go tool trace (with annotations)	Timeline view with tasks and regions

When in doubt, start with pprof (lower overhead, simpler output). Use trace when pprof doesn't explain the latency or when you need the wall-clock timeline view.

Related skills

TddFollow test-driven development with a strict red-green-refactor loop when creating reliable features or fixing bugs.510k185k

Test Driven DevelopmentEnforce writing failing tests before any production implementation code.176k260k

QaRun conversational QA sessions that turn user-reported bugs into well-written, domain-aware GitHub issues without manual ticket writing.164k185k

Migrate To ShoehornAutomatically update TypeScript test files that rely on unsafe `as` type assertions by replacing them with type-safe partial objects from @total-typescript/shoehorn.151k185k

Webapp TestingVerify frontend behavior, debug UI issues, capture screenshots, and inspect logs of a running local web application using Playwright.121k164k

Playwright CliRun browser automation, generate element snapshots, inspect DOM attributes, and execute Playwright tests from the terminal.96.3k12.2k

How it compares

Use golang-benchmark for unit microbenchmarks; use golang-observability for continuous pprof and Pyroscope profiling in production.

FAQ

Why does golang-benchmark require b.Loop in Go 1.24?

golang-benchmark requires b.Loop() because Go 1.24 introduced it to replace legacy for-range b.N loops that need manual b.ResetTimer calls and sink variables. Agents without the skill often emit outdated benchmark patterns.

What benchmark mistake does golang-benchmark prevent?

golang-benchmark prevents Go agents from using for i := 0; i < b.N or for range b.N without sinks, which skews timing and triggers dead code elimination. The skill asserts b.Loop with proper fixture setup such as 1MB SHA-256 inputs.

Is Golang Benchmark safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Testing & QAtesting

About

Golang Benchmark by the numbers

golang-benchmark capabilities & compatibility

What golang-benchmark says it does

Add your badge

How do you write Go 1.24 benchmarks with b.Loop?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

Go Benchmarking & Performance Measurement

Writing Benchmarks

b.Loop() (Go 1.24+) — preferred

Memory tracking

Sub-benchmarks and table-driven

Running Benchmarks

Documenting Results in Commits

Profiling from Benchmarks

Reference Files

Cross-References

benchstat Reference

Installation

Usage

Basic Workflow

Step 0: Write benchmarks

Step 1: Measure baseline

Step 2: Make your change

Step 3: Measure again

Step 4: Compare

Reading the Output

Unit normalization

When the ~ symbol appears

Flags Reference

Projection flags

Filter flag

Input labeling

Filter Expression Syntax

Matching operators

Logical operators

Filter key types

Filter examples

Projection Examples

Default: before/after file comparison

Compare sub-benchmark parameters within a single file

Simplify rows to base name only

Control column order

Group by GOMAXPROCS

Separate tables per package

Ignore a dimension

Compare three versions

Cross-dimensional comparison

Unit Metadata

assume=exact

assume=nothing (default)

Interleaving Runs

How Many Runs?

Single-File Summary

Common Pitfalls

benchstat in CI

CI Benchmark Regression Detection

benchdiff

cob

gobenchdata

CLI commands

GitHub Action setup

Regression gating on PRs

Dashboard configuration

Tool Selection Guide

Noisy Neighbor Mitigation

Why CI benchmarks are noisy

Strategies

System Tuning for Self-Hosted Runners

Disable CPU frequency scaling

Disable Turbo Boost

Pin benchmarks to specific CPU cores

Disable SMT (Hyper-Threading)

Combined CI setup script

Compiler Analysis Reference

Escape Analysis

`b.Loop()` (Go 1.24+) — preferred

When the `~` symbol appears

`assume=exact`

`assume=nothing` (default)

`top` — self time ranking (start here)

`top -cum` — cumulative time ranking

`list funcName` — annotated source

`peek funcName` — callers and callees

`tree` — hierarchical call tree

`traces` — raw stack traces

`web` / `svg` — graphical call graph

`disasm funcName` — assembly-level

`weblist funcName` — annotated source in browser

`tags` — profile label breakdown

`tagroot` and `tagleaf` — group by labels

`granularity` — control grouping level

`sort` — change sort order

`source` — show source for matching regex

`focus`, `ignore`, `hide`, `show` — filtering

`normalize` — normalize against a base profile

`sample_index` — switch metric in multi-metric profiles

`unit` — change display units

`callgrind` — export for KCachegrind

`proto` — save processed profile

`help` — list all commands

`show_from=regex` — trim callers above match

`noinlines` — flatten inlined functions

Memory leak detection with `-base`

`alloc_objects` high, `inuse_space` low