Golang Performance

Name: Golang Performance
Author: samber

samber/cc-skills-golang

35.1k installs
2.8k repo stars
Updated July 27, 2026
samber/cc-skills-golang

Golang-performance is a code-review skill teaching performance optimization via profiling and benchmarking.

About

Go performance optimization skill for identifying and fixing bottlenecks using profiling-first methodology. Covers allocation reduction, CPU optimization, memory layout, GC tuning, connection pooling, caching strategies, and benchmarking with tools like pprof and benchstat.

Iterative optimization cycle: profile > measure > diagnose > improve
Decision tree mapping bottleneck signals to specific optimization strategies
Memory, CPU, I/O, GC tuning, and caching patterns with benchmarking

Golang Performance by the numbers

35,115 all-time installs (skills.sh)
+608 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #22 of 4,386 Backend & APIs skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/samber/cc-skills-golang --skill golang-performance

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/samber/cc-skills-golang/golang-performance.svg)](https://skillselion.com/skills/samber/cc-skills-golang/golang-performance)

Installs	35.1k
repo stars	★ 2.8k
Security audit	3 / 3 scanners passed
Last updated	July 27, 2026
Repository	samber/cc-skills-golang ↗

How do you detect goroutine leaks in Go production?

Profile production systems, identify bottlenecks, and apply targeted optimizations with measured impact.

Who is it for?

Go backend engineers tuning production services who need agents to write Prometheus alerts and profiling steps for GC, goroutines, and memory.

Skip if: Early prototype code without metrics infrastructure or frontend JavaScript performance tuning unrelated to Go runtime behavior.

When should I use this skill?

Go services show high GC pause times, goroutine counts above 10K, or RSS memory approaching container limits.

What you get

Prometheus alert rules for GC pauses and goroutine counts, profiling guidance, and memory limit forecasts.

Prometheus alert rules
profiling workflow steps
GOGC tuning guidance

By the numbers

GoroutineLeak alert fires when go_goroutines exceeds 10000 for 5 minutes
HighGCPauseTime alert threshold is average GC pause greater than 10ms

Files

SKILL.mdMarkdownGitHub ↗

Persona: You are a Go performance engineer. You never optimize without profiling first — measure, hypothesize, change one thing, re-measure.

Thinking mode: Use ultrathink for performance optimization. Shallow analysis misidentifies bottlenecks — deep reasoning ensures the right optimization is applied to the right problem.

Modes:

Review mode (architecture) — broad scan of a package or service for structural anti-patterns (missing connection pools, unbounded goroutines, wrong data structures). Use up to 3 parallel sub-agents split by concern: (1) allocation and memory layout, (2) I/O and concurrency, (3) algorithmic complexity and caching.
Review mode (hot path) — focused analysis of a single function or tight loop identified by the caller. Work sequentially; one sub-agent is sufficient.
Optimize mode — a bottleneck has been identified by profiling. Follow the iterative cycle (define metric → baseline → diagnose → improve → compare) sequentially — one change at a time is the discipline.

Dependencies:

benchstat: go install golang.org/x/perf/cmd/benchstat@latest

Go Performance Optimization

Core Philosophy

1. Profile before optimizing — intuition about bottlenecks is wrong ~80% of the time. Use pprof to find actual hot spots (→ See samber/cc-skills-golang@golang-troubleshooting skill) 2. Allocation reduction yields the biggest ROI — Go's GC is fast but not free. Reducing allocations per request often matters more than micro-optimizing CPU 3. Document optimizations — add code comments explaining why a pattern is faster, with benchmark numbers when available. Future readers need context to avoid reverting an "unnecessary" optimization

Rule Out External Bottlenecks First

Before optimizing Go code, verify the bottleneck is in your process — if 90% of latency is a slow DB query or API call, reducing allocations won't help.

Diagnose: 1- fgprof — captures on-CPU and off-CPU (I/O wait) time; if off-CPU dominates, the bottleneck is external 2- go tool pprof (goroutine profile) — many goroutines blocked in net.(*conn).Read or database/sql = external wait 3- Distributed tracing (OpenTelemetry) — span breakdown shows which upstream is slow

When external: optimize that component instead — query tuning, caching, connection pools, circuit breakers (→ See samber/cc-skills-golang@golang-database skill, Caching Patterns).

Iterative Optimization Methodology

The cycle: Define Goals → Benchmark → Diagnose → Improve → Benchmark

1. Define your metric — latency, throughput, memory, or CPU? Without a target, optimizations are random 2. Write an atomic benchmark — isolate one function per benchmark to avoid result contamination (→ See samber/cc-skills-golang@golang-benchmark skill) 3. Measure baseline — go test -bench=BenchmarkMyFunc -benchmem -count=6 ./pkg/... | tee /tmp/report-1.txt 4. Diagnose — use the Diagnose lines in each deep-dive section to pick the right tool 5. Improve — apply ONE optimization at a time with an explanatory comment 6. Compare — benchstat /tmp/report-1.txt /tmp/report-2.txt to confirm statistical significance 7. Commit — paste the benchstat output in the commit body so reviewers and future readers see the exact improvement; follow the perf(scope): summary commit type 8. Repeat — increment report number, tackle next bottleneck

Refer to library documentation for known patterns before inventing custom solutions. Keep all /tmp/report-*.txt files as an audit trail.

Decision Tree: Where Is Time Spent?

Bottleneck	Signal (from pprof)	Action
Too many allocations	`alloc_objects` high in heap profile	Memory optimization
CPU-bound hot loop	function dominates CPU profile	CPU optimization
GC pauses / OOM	high GC%, container limits	Runtime tuning
Network / I/O latency	goroutines blocked on I/O	I/O & networking
Repeated expensive work	same computation/fetch multiple times	Caching patterns
Wrong algorithm	O(n²) where O(n) exists	Algorithmic complexity
Lock contention	mutex/block profile hot	→ See `samber/cc-skills-golang@golang-concurrency` skill
Slow queries	DB time dominates traces	→ See `samber/cc-skills-golang@golang-database` skill

Common Mistakes

Mistake	Fix
Optimizing without profiling	Profile with pprof first — intuition is wrong ~80% of the time
Default `http.Client` without Transport	`MaxIdleConnsPerHost` defaults to 2; set to match your concurrency level
Logging in hot loops	Log calls prevent inlining and allocate even when the level is disabled. Use `slog.LogAttrs`
`panic`/`recover` as control flow	panic allocates a stack trace and unwinds the stack; use error returns
`unsafe` without benchmark proof	Only justified when profiling shows >10% improvement in a verified hot path
No GC tuning in containers	Set `GOMEMLIMIT` to 80-90% of container memory to prevent OOM kills
`reflect.DeepEqual` in production	50-200x slower than typed comparison; use `slices.Equal`, `maps.Equal`, `bytes.Equal`

Deep Dives

Memory Optimization — allocation patterns, backing array leaks, sync.Pool, struct alignment
CPU Optimization — inlining, cache locality, false sharing, ILP, reflection avoidance
I/O & Networking — HTTP transport config, streaming, JSON performance, cgo, batch operations
Runtime Tuning — GOGC, GOMEMLIMIT, GC diagnostics, GOMAXPROCS, PGO
Caching Patterns — algorithmic complexity, compiled patterns, singleflight, work avoidance
Production Observability — Prometheus metrics, PromQL queries, continuous profiling, alerting rules

CI Regression Detection

Automate benchmark comparison in CI to catch regressions before they reach production. → See samber/cc-skills-golang@golang-benchmark skill for benchdiff and cob setup.

Cross-References

→ See samber/cc-skills-golang@golang-benchmark skill for benchmarking methodology, benchstat, and b.Loop() (Go 1.24+)
→ See samber/cc-skills-golang@golang-troubleshooting skill for pprof workflow, escape analysis diagnostics, and performance debugging
→ See samber/cc-skills-golang@golang-data-structures skill for slice/map preallocation and strings.Builder
→ See samber/cc-skills-golang@golang-concurrency skill for worker pools, sync.Pool API, goroutine lifecycle, and lock contention
→ See samber/cc-skills-golang@golang-safety skill for defer in loops, slice backing array aliasing
→ See samber/cc-skills-golang@golang-database skill for connection pool tuning and batch processing
→ See samber/cc-skills-golang@golang-observability skill for continuous profiling in production

# GC taking too much time per cycle
- alert: HighGCPauseTime
  expr: rate(go_gc_duration_seconds_sum[5m]) / rate(go_gc_duration_seconds_count[5m]) > 0.01
  for: 10m
  annotations:
    summary: "Average GC pause >10ms — reduce allocations or tune GOGC"

# Goroutine leak
- alert: GoroutineLeak
  expr: go_goroutines > 10000
  for: 5m
  annotations:
    summary: "Goroutine count >10K — check for leaked goroutines"

# Memory approaching container limit
- alert: MemoryNearLimit
  expr: predict_linear(process_resident_memory_bytes[1h], 3600) > <container_limit_bytes>
  for: 15m
  annotations:
    summary: "RSS projected to exceed container limit within 1h"

Caching Patterns

The fastest code is code that doesn't run. Caching pre-computed results, deduplicating concurrent requests, and avoiding unnecessary work are often the highest-leverage performance improvements.

Compiled Pattern Caching

Diagnose: 1- go tool pprof (CPU profile) — look for regexp.Compile, regexp.MustCompile, or template.Parse appearing in hot paths; their presence means patterns are being recompiled per call instead of once 2- go test -bench -benchmem — benchmark per-call compilation vs cached version; expect 10-12x improvement and allocs/op dropping to zero for the compilation step

Regexp at package level

regexp.Compile parses a pattern into a state machine — ~5,700ns per compilation. Match operations on a compiled regexp cost ~450ns. Compiling per-call wastes 10-12x:

// Bad — compiled on every call
func isValid(email string) bool {
    re := regexp.MustCompile(`^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}$`)
    return re.MatchString(email)
}

// Good — compiled once, safe for concurrent use
var emailRegex = regexp.MustCompile(`^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}$`)

func isValid(email string) bool { return emailRegex.MatchString(email) }

Note: regexp.MustCompile panics on invalid patterns — fine for package-level constants (caught at startup). Use regexp.Compile for user-provided patterns. Go's regexp uses linear-time matching (no backtracking).

Template caching

template.Parse is equally expensive. Parse once at startup:

var reportTmpl = template.Must(template.ParseFiles("templates/report.html"))

Precomputed lookup tables

When a computation is pure (same input → same output) and the input space is small, replace calculation with array lookup:

var hexDigit = [16]byte{'0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f'}

func byteToHex(b byte) (byte, byte) {
    return hexDigit[b>>4], hexDigit[b&0x0f] // two array lookups vs branching logic
}

If the table fits in L1/L2 cache, lookup is faster than even simple computation.

Request-Level Caching

Diagnose: 1- go tool pprof (goroutine profile) — look for many goroutines blocked on the same external call (HTTP fetch, DB query); this signals a cache stampede where N goroutines all miss the cache simultaneously 2- fgprof — shows off-CPU wait time; look for the same fetch function dominating wall-clock time across many goroutines, confirming duplicated concurrent work 3- go tool pprof -alloc_objects — check if cache miss handling allocates heavily; high alloc counts on fetch functions confirm the stampede is also generating GC pressure

singleflight for cache stampede prevention

When a cache entry expires, many goroutines may simultaneously discover the miss and all request the same expensive computation. singleflight ensures only one goroutine fetches while others wait:

import "golang.org/x/sync/singleflight"

var (
    cache sync.Map
    sf    singleflight.Group
)

func GetWeather(city string) (string, error) {
    if val, ok := cache.Load(city); ok {
        return val.(string), nil
    }

    // Only one goroutine fetches; others block on the same key
    result, err, _ := sf.Do(city, func() (any, error) {
        data, err := fetchFromAPI(city)
        if err == nil { cache.Store(city, data) }
        return data, err
    })
    return result.(string), err
}

→ See samber/cc-skills-golang@golang-concurrency skill for singleflight API details and sync.Map vs RWMutex decision guidance. → Generics alternative: Use github.com/samber/go-singleflightx to avoid interface{} boxing overhead; expect 2-4x faster result retrieval compared to the standard library's singleflight.Group.

LRU caches

For bounded caches with eviction, the standard library's container/list works but has poor cache locality (each node is a separate heap allocation). For high-performance LRU:

`github.com/hashicorp/golang-lru` — thread-safe, simple API
`github.com/elastic/go-freelru` — merges hashmap and ringbuffer into contiguous memory, ~37x faster than sharded implementations

When using third-party cache libraries, refer to the library's official documentation for current API signatures.

Algorithmic Complexity

Diagnose: 1- go tool pprof (CPU profile) — look for functions with high cumulative time that contain nested loops or repeated linear scans; these are algorithmic complexity bottlenecks 2- go test -bench — benchmark with different input sizes (100, 1K, 10K, 100K); if time grows quadratically (10x input → 100x time), the algorithm is O(n²) and needs replacement

Before micro-optimizing, check that the algorithm itself isn't the bottleneck. A constant-factor improvement on an O(n²) algorithm loses to a naive O(n log n) implementation at scale.

Common complexity traps in Go:

Pattern	Complexity	Fix	Fixed complexity
`slices.Contains` in a loop	O(n·m)	Build `map[T]struct{}` first, then lookup	O(n+m)
Nested loops for matching	O(n²)	Index with a map, sort+binary search, or `slices.BinarySearch`	O(n log n) or O(n)
Repeated `append` without prealloc	O(n²) amortized copies	`make([]T, 0, n)`	O(n)
String concatenation with `+=`	O(n²) total copies	`strings.Builder`	O(n)
Linear scan for min/max/dedup	O(n) per query	Sort once, query many times	O(n log n) + O(log n) per query

Think in Big-O first, then optimize constants. A 10x constant-factor improvement matters; switching from O(n²) to O(n) matters more.

Work Avoidance

Diagnose: 1- go tool pprof (CPU profile) — look for linear scan functions (slices.Contains, slices.Index) or iterator chains (Filter, Map) consuming CPU in hot paths 2- go test -bench — benchmark the current approach vs a map-based or early-return version; expect O(n) → O(1) for membership tests, significant improvement for short-circuit loops

Map lookups over slice scanning

Contains(slice, element) is O(n). Map lookups are O(1). When doing multiple membership tests against the same collection, build a map once:

// Bad — O(n*m), checking Contains per element
for _, item := range subset {
    if !Contains(collection, item) { return false } // O(n) per check
}

// Good — O(n+m), build map once, O(1) lookups
seen := make(map[T]struct{}, len(collection))
for _, item := range collection { seen[item] = struct{}{} }
for _, item := range subset {
    if _, ok := seen[item]; !ok { return false }
}

Use struct{} (0 bytes) instead of bool (1 byte) for set maps.

Early returns and short-circuit loops

Return immediately when the answer is known. Finding the target on iteration 3 of 1000 saves 997 iterations:

// Bad — always iterates full collection
found := false
for _, item := range collection {
    if item == target { found = true }
}
return found

// Good — returns on first match
for i := range collection {
    if collection[i] == target { return true }
}
return false

Avoid iterator chains

Chaining iterator operations (Filter → Map → First) creates closures and intermediate machinery. A direct loop is simpler and faster:

// Bad — creates 2 iterators with closures
result, ok := First(Filter(collection, predicate))

// Good — single pass, early return, no closures
for i := range collection {
    if predicate(collection[i]) { return collection[i], true }
}

Replace indirect function calls with direct loops

When a function wraps another function (e.g., FromSlicePtr calling Map with a closure), the closure indirection prevents inlining. Replace with a direct loop:

// Bad — Map() with closure, per-element function call overhead
func FromSlicePtr(items []*T) []T {
    return Map(items, func(p *T) T { return *p })
}

// Good — direct loop, inlineable, -13% to -17% time
func FromSlicePtr(items []*T) []T {
    result := make([]T, len(items))
    for i := range items { result[i] = *items[i] }
    return result
}

CPU Optimization

CPU-bound bottlenecks show up as functions dominating the CPU profile. The patterns below target the most common causes: missed inlining opportunities, poor cache utilization, and unnecessary computation.

Function Inlining

Diagnose: 1- go tool pprof (CPU profile) — look for hot functions with high cumulative CPU time; if a small helper dominates the profile, it's likely not being inlined 2- go build -gcflags="-m" — grep for "cannot inline" on your hot-path functions; the reason (e.g., "function too complex", "unhandled op") tells you what to simplify

The Go compiler inlines small functions, eliminating call overhead. Functions that are too complex (loops, many statements, or calls to non-inlineable functions) won't be inlined — this matters in tight loops called millions of times.

// Bad — log call prevents inlining
func abs(x int) int {
    if x < 0 {
        log.Printf("negative: %d", x) // blocks inlining
        return -x
    }
    return x
}

// Good — simple enough to inline
func abs(x int) int {
    if x < 0 { return -x }
    return x
}

Check inlining decisions:

go build -gcflags="-m" ./... 2>&1 | grep "can inline"
go build -gcflags="-m" ./... 2>&1 | grep "inlining call"

Move side effects (logging, metrics) outside hot-path functions or guard them with conditional checks.

Value receivers enable inlining

Value receivers allow the compiler to fully inline fluent method chains. Pointer receivers add indirection that blocks inlining:

// Pointer receiver — indirection prevents inlining, constant overhead per call
func (c *config) WithTimeout(d time.Duration) *config { c.timeout = d; return c }

// Value receiver — fully inlined, -80% time in fluent chains
func (c config) WithTimeout(d time.Duration) config { c.timeout = d; return c }

Cache Locality

Diagnose: 1- go tool pprof (CPU profile) — look for loops over slices/matrices consuming disproportionate CPU; cache-miss-heavy code shows high runtime.memmove or flat time in simple index operations 2- go test -bench — benchmark row-first vs column-first traversal; expect 10-50x difference on large matrices purely from cache effects

Modern CPUs fetch data in 64-byte cache lines. Sequential memory access is dramatically faster than random access because the prefetcher can load the next cache line before you need it.

Row-major traversal

Go stores 2D arrays in row-major order. Column-first traversal jumps across memory, causing cache misses:

// Bad — column-first, jumps across memory (~10M cache misses)
for col := 0; col < 1024; col++ {
    for row := 0; row < 1024; row++ {
        sum += matrix[row][col]
    }
}

// Good — row-first, sequential access (~125K cache misses)
for row := 0; row < 1024; row++ {
    for col := 0; col < 1024; col++ {
        sum += matrix[row][col]
    }
}

Performance difference: 10-50x purely from cache effects.

Contiguous 2D allocation

Allocating each row separately scatters data across the heap:

// Bad — N separate allocations, poor cache locality
matrix := make([][]float64, rows)
for i := range matrix { matrix[i] = make([]float64, cols) }

// Good — single contiguous allocation, cache-friendly
data := make([]float64, rows*cols)
matrix := make([][]float64, rows)
for i := range matrix { matrix[i] = data[i*cols : (i+1)*cols] }

Struct of Arrays (SoA) vs Array of Structs (AoS)

When iterating over a single field of a struct, AoS wastes cache space loading unused fields:

// AoS — loading each Point (24 bytes) to read only x (8 bytes) = 66% cache waste
type Point struct { x, y, z float64 }
points := make([]Point, n)
for i := range points { sum += points[i].x }

// SoA — all x values contiguous, 100% cache utilization
type Points struct { xs, ys, zs []float64 }
for i := range ps.xs { sum += ps.xs[i] }

Use SoA when iterating over a subset of fields (physics, graphics, analytics). AoS is fine when accessing all fields together or for small structs.

Pointer-heavy vs value-heavy data

Index-based data structures (nodes stored in a contiguous array, referenced by index) beat pointer-based structures for cache locality:

// Pointer-based tree — each node scattered in heap, random cache misses
type Node struct { value int; left, right *Node }

// Index-based tree — nodes in contiguous array, cache-friendly
type Tree struct { nodes []Node }
type Node struct { value int; left, right int } // indices into nodes

False Sharing

Diagnose: 1- go tool pprof (CPU profile + mutex profile) — look for atomic operations or counter updates consuming unexpectedly high CPU; in the mutex profile, look for contention on variables that shouldn't need locking 2- go test -bench — benchmark concurrent counter increments; if adding goroutines makes it _slower_ instead of faster, false sharing is likely

When goroutines update variables that share the same 64-byte CPU cache line, each write invalidates the other core's cache, causing severe degradation:

// Bad — a and b on same cache line, cores fight for it
type Counters struct { a, b int64 }

// Good — separate cache lines, no interference
type Counters struct {
    a int64    // 8 bytes
    _ [56]byte // 64 - 8 = 56 bytes padding
    b int64    // 8 bytes
}

Only apply cache-line padding when profiling confirms contention on concurrent counters/flags.

Instruction-Level Parallelism

Diagnose: 1- go tool pprof (CPU profile) — look for tight arithmetic loops (sum, dot product) where the loop body itself dominates CPU; these are candidates for multi-accumulator optimization 2- go test -bench — benchmark single vs multi-accumulator versions; expect 2-4x improvement when the loop is truly CPU-bound with a dependency chain

Modern CPUs execute multiple independent instructions simultaneously. A single accumulator creates a dependency chain — each addition waits for the previous one:

// Bad — sequential dependency, CPU pipeline stalls
var total int64
for _, v := range data { total += v }

// Good — 4 independent accumulators, CPU pipelines all 4 in parallel
var s0, s1, s2, s3 int64
limit := len(data) - len(data)%4
for i := 0; i < limit; i += 4 {
    s0 += data[i]; s1 += data[i+1]; s2 += data[i+2]; s3 += data[i+3]
}
for i := limit; i < len(data); i++ { s0 += data[i] }
total := s0 + s1 + s2 + s3

Expect 2-4x improvement for tight arithmetic loops. Only use when profiling shows the loop is a bottleneck.

SIMD (Single Instruction, Multiple Data)

Diagnose: 1- go tool pprof (CPU profile) — confirm a numeric inner loop consumes >20% of CPU; SIMD only helps CPU-bound numeric work, not allocation or I/O bottlenecks 2- go test -bench — measure the loop's baseline ns/op; provides the reference point to validate SIMD gains 3- go build -gcflags="-d=ssa/prove/debug=2" — check if the compiler already auto-vectorized the loop; look for "Proved" bounds-check eliminations that enable vectorization 4- GOSSAFUNC=MyFunc go build — generate SSA dump (ssa.html) to inspect whether the compiler produces vector instructions for the hot loop 5- go tool objdump -s MyFunc ./binary — verify the final assembly contains SIMD instructions (e.g., VMOVAPD, VADDPD on amd64) rather than scalar equivalents

Go 1.26+ includes an experimental simd/archsimd package (requires GOEXPERIMENT=simd flag) providing low-level SIMD intrinsics for amd64 with 128/256/512-bit vectors. For broader portability, the compiler auto-vectorizes simple loops, and several strategies exist.

Options for explicit SIMD in Go:

Experimental `simd/archsimd` (Go 1.26+, speculative) — Direct SIMD intrinsics via vector types with CPU feature detection. Limited to AMD64. Use with caution: this is an experimental, in-progress API (GOEXPERIMENT=simd) whose package path and type names are subject to change before stabilization. Not covered by Go 1 compatibility guarantees, and should never be exposed in public APIs. Verify the actual import path and API against the Go toolchain you are using.

  // Requires: GOEXPERIMENT=simd go build
  // WARNING: experimental API — package path and types may change
  import "simd/archsimd"

  v := archsimd.Int32x4{1, 2, 3, 4}

Let the compiler do it — write simple, idiomatic loops on []float64/[]int32 slices. Check auto-vectorization: go build -gcflags="-d=ssa/prove/debug=2" ./...
`math/bits` — operations like OnesCount, LeadingZeros, RotateLeft map directly to hardware instructions (POPCNT, CLZ, ROL)
Hand-written assembly — .s files with AVX2/NEON instructions for critical inner loops. Libraries like klauspost/compress and minio/sha256-simd use this approach
Third-party vectorized libraries — for common operations (hashing, compression, encoding), use libraries that already have optimized SIMD implementations rather than writing your own

Handling CPU-specific instruction sets

Hand-written assembly unlocks higher performance but couples code to specific CPU features (AVX2, NEON, etc.). Three strategies exist:

1. Compile on a production-similar machine

Build binaries on hardware matching your deployment target, so the compiler generates code for the exact CPU instruction set available at runtime:

# Compiling on production hardware ensures optimal code generation
# for that specific CPU architecture and generation
ssh prod-server "cd /path && go build -o app ."

Tradeoff: Simplest approach, but requires access to production hardware and different binaries per CPU type (Intel vs AMD vs Apple Silicon). Breaks CI/CD portability.

2. Runtime CPU feature detection + multiple implementations

Implement the function multiple times — one for each CPU capability — and dispatch at runtime:

// dispatch.go
var sumImpl func([]int64) int64

func init() {
    if cpu.X86.HasAVX2 {
        sumImpl = sumAVX2
    } else {
        sumImpl = sumGeneric
    }
}

func Sum(data []int64) int64 {
    return sumImpl(data)
}

// sum_generic.go
func sumGeneric(data []int64) int64 {
    var total int64
    for _, v := range data { total += v }
    return total
}

// sum_amd64.s
TEXT ·sumAVX2(SB), NOSPLIT, $0-32
    // AVX2 implementation
    VMOVAPD (SI), Y0
    // ...

Tradeoff: Single binary works everywhere; trades one function-call dispatch overhead for full CPU feature utilization. Libraries like encoding/base64 and sha256 use this pattern.

3. Compile-time selection with `//go:build` tags

Use conditional compilation to generate different code at build time for each target:

// sum_fast.go
//go:build amd64 && !nosimd

package mylib

// AVX2 assembly via cgo or inline
func Sum(data []int64) int64 {
    return sumAVX2(data) // or calls to .s file
}

// sum_generic.go
//go:build !amd64 || nosimd

package mylib

func Sum(data []int64) int64 {
    var total int64
    for _, v := range data { total += v }
    return total
}

Build different binaries per target:

GOOS=linux GOARCH=amd64 go build -o app-avx2 .     # Uses sum_fast.go
GOOS=darwin GOARCH=arm64 go build -o app-neon .    # Uses sum_generic.go
go build -tags=nosimd -o app-safe .                # Fallback everywhere

Tradeoff: Zero runtime overhead; each binary is fully optimized for its target. Requires shipping multiple binaries and coordinating which binary runs where.

When SIMD is NOT worth pursuing:

Go's lack of intrinsics means SIMD requires assembly — high maintenance burden, platform-specific, and harder to debug
Auto-vectorization covers the most common cases (simple numeric loops)
If your bottleneck is allocations or I/O, SIMD won't help

Recommendation: Start with auto-vectorization. For Go 1.26+, evaluate simd/archsimd for AMD64-only workloads (remembering it's experimental). Move to runtime detection (option 2 above) if profiling shows a bottleneck and the code needs to run on heterogeneous hardware. Only use compile-time selection (option 3) if you control the deployment environment and can test each per-binary variant.

Only invest in hand-written SIMD when profiling shows a numeric inner loop consuming >20% of CPU and the compiler isn't auto-vectorizing it.

Tight Loops and the Scheduler

Diagnose: 1- go tool pprof (goroutine profile) — look for many goroutines stuck in "runnable" state (waiting for CPU) while one goroutine monopolizes execution 2- go tool trace — visualize goroutine scheduling over time; look for long uninterrupted execution spans on one goroutine while others show scheduling gaps 3- GODEBUG=schedtrace=1000 — print scheduler state every second; look for unbalanced runqueue counts across P's indicating one P is starved 4- runtime/metrics (/sched/latencies:seconds) — measure how long goroutines wait before getting CPU; high p99 latencies confirm starvation 5- Prometheus rate(process_cpu_seconds_total[2m]) — monitor if CPU usage hits GOMAXPROCS ceiling; if saturated while other goroutines are starved, a tight loop is monopolizing P's

A goroutine running a CPU-intensive tight loop without function calls may not yield to the scheduler, starving other goroutines. Go 1.14+ added asynchronous preemption, but very tight loops with fully inlined operations can still cause issues:

// Potential starvation — pure computation, no function calls
for { x = x*a + b }

// Safe — non-inlined call triggers preemption check
for item := range work {
    processBatch(item) // function call = preemption point
}

When to use non-inlined calls for scheduling: Use non-inlined function calls when:

The loop runs for a long time (hundreds of milliseconds or more of uninterrupted computation)
Other goroutines are waiting to run (e.g., handling requests, I/O completion, channel operations)
The loop contains only arithmetic or memory operations with no function calls

For short bursts of computation (< 10ms), preemption isn't critical and inlining for CPU efficiency takes priority.

Detecting scheduler starvation: Use these tools to confirm goroutines are being starved:

`go tool pprof` goroutine profile — shows goroutines stuck in "runnable" state (waiting for CPU). If many goroutines are runnable while one dominates CPU, starvation is happening
`go tool trace` — visualizes goroutine scheduling over time. Look for gaps where goroutines aren't running because one goroutine monopolized the scheduler
`runtime/metrics` (Go 1.19+) — measure /sched/latencies:seconds to quantify how long goroutines wait for CPU
Observable symptoms — high response latency, requests timing out, uneven request distribution, goroutine counts climbing

Preventing inlining with `//go:noinline`: If you have a function that's normally inlinable (small, hot) but you specifically want it to not inline to force scheduler preemption checks, use the //go:noinline compiler directive:

//go:noinline
func processBatch(item WorkItem) {
    // CPU-intensive work here
    // This call site will NOT be inlined, even if the function is small
    // The function call itself becomes a preemption point for the scheduler
}

// In tight loop
for item := range work {
    processBatch(item) // Guaranteed preemption point
}

Trade-off: Using //go:noinline prevents inlining, which:

Pros: Guarantees scheduler preemption checks; prevents goroutine starvation
Cons: Adds function call overhead (~10-30 CPU cycles); reduces instruction-level parallelism (ILP) in the caller

Only use //go:noinline if profiling shows that scheduler preemption starvation is actually blocking other goroutines. Unnecessary //go:noinline directives penalize throughput and latency.

Reflection and Type Assertions

Diagnose: 1- go tool pprof (CPU profile) — look for reflect.Value.*, reflect.DeepEqual, or fmt.Sprintf (which uses reflect internally) appearing in hot paths 2- go test -bench — compare reflection-based vs typed versions; expect 10-200x difference depending on the reflection operation

`reflect` in hot paths — 10-100x slower due to type introspection and boxing. Replace with generics or typed code
`reflect.DeepEqual` — 50-200x slower than typed comparisons. Use slices.Equal, maps.Equal, bytes.Equal (Go 1.21+)
Type switch vs repeated assertions — type switch dispatches in one evaluation:

// Bad — evaluates interface multiple times
if s, ok := v.(string); ok { return s }
if i, ok := v.(int); ok { return strconv.Itoa(i) }

// Good — single dispatch
switch v := v.(type) {
case string: return v
case int:    return strconv.Itoa(v)
}

Monotonic Time

Diagnose: 1- go test -bench — benchmark time.Since(start) vs time.Now().Sub(start); expect a small but consistent improvement from monotonic clock avoiding wall-clock syscall

time.Since(start) uses the monotonic clock, which is immune to wall-clock adjustments (NTP, DST) and slightly faster:

var appStart = time.Now() // captures monotonic time + wall-clock on program start

func myFunc() {
    // Compare durations, not wall-clock times
    elapsed := time.Since(appStart)
    if elapsed > threshold { ... }
}

I/O & Networking Optimization

Network and I/O bottlenecks show up as goroutines blocked on syscalls or waiting for responses. The key levers are connection reuse, proper timeouts, and streaming instead of buffering.

HTTP Transport Configuration

Diagnose: 1- go tool pprof (goroutine + block profile) — look for goroutines blocked on net/http.(*Transport).dialConn or net/http.(*persistConn).readLoop; many goroutines waiting here means connection pool exhaustion 2- fgprof — captures both on-CPU and off-CPU wait time; look for HTTP calls dominating wall-clock time even when CPU profile shows them as cheap 3- go tool trace — visualize goroutine lifecycles; look for long gaps where goroutines wait for network I/O instead of processing 4- Prometheus go_goroutines — monitor goroutine count in production; steadily rising under stable load suggests connection or goroutine leaks from misconfigured HTTP clients

Connection pooling

The default http.Transport has conservative pool settings — MaxIdleConnsPerHost defaults to 2. Under high concurrency, requests queue waiting for connections instead of running in parallel:

// Bad — default transport, only 2 idle connections per host
client := &http.Client{}

// Good — tuned for high-concurrency service-to-service calls
var apiClient = &http.Client{
    Timeout: 30 * time.Second,
    Transport: &http.Transport{
        MaxIdleConns:          100,             // total idle connections across all hosts
        MaxIdleConnsPerHost:   20,              // per-host idle connections (default is 2!)
        MaxConnsPerHost:       50,              // cap total connections per host (0 = unlimited)
        IdleConnTimeout:       90 * time.Second,
        TLSHandshakeTimeout:  5 * time.Second,
        ResponseHeaderTimeout: 10 * time.Second,
    },
}

For web crawlers hitting many different hosts, disable keep-alive to avoid accumulating idle connections:

crawlerClient := &http.Client{
    Transport: &http.Transport{DisableKeepAlives: true},
}

Timeouts

The zero-value http.Client and http.Server have NO timeouts. A slow or malicious peer holds connections open indefinitely, exhausting file descriptors and memory:

// Server — always set timeouts to prevent Slowloris attacks
server := &http.Server{
    Addr:         ":8080",
    Handler:      handler,
    ReadTimeout:  5 * time.Second,
    WriteTimeout: 10 * time.Second,
    IdleTimeout:  120 * time.Second,
}

Drain response body for connection reuse

Connections are only returned to the pool when the body is fully read. Even if you don't need the body, drain it:

resp, err := client.Get(url)
if err != nil { return err }
defer resp.Body.Close()
_, _ = io.Copy(io.Discard, resp.Body) // drain to enable connection reuse

Streaming vs Buffering

Diagnose: 1- go tool pprof -inuse_space — look for large single allocations (MB-sized) from io.ReadAll, bytes.Buffer.Grow, or json.Unmarshal; these indicate buffering entire payloads instead of streaming

Avoid io.ReadAll for large payloads

io.ReadAll loads the entire stream into memory. For large files or HTTP responses, this causes massive memory spikes:

// Bad — 2GB file = 2GB allocation
data, _ := io.ReadAll(f)

// Good — process line by line, O(1) memory
scanner := bufio.NewScanner(f)
for scanner.Scan() { processLine(scanner.Bytes()) }

// Good — stream between reader and writer (32KB internal buffer)
io.Copy(w, resp.Body)

io.ReadAll is fine for small, bounded payloads (< 1MB) where the size is known.

Streaming JSON

Use json.NewDecoder for large JSON payloads instead of json.Unmarshal (which buffers the entire body):

dec := json.NewDecoder(r)
for dec.More() {
    var item Item
    if err := dec.Decode(&item); err != nil { return err }
    process(item) // one item at a time
}

JSON Performance

Diagnose: 1- go tool pprof (CPU profile) — look for encoding/json.(*Decoder).Decode, reflect.Value.*, or encoding/json.Marshal consuming significant CPU; these indicate reflection-based JSON is the bottleneck 2- go test -bench -benchmem — measure ns/op and allocs/op for marshal/unmarshal; expect high alloc counts from reflection; code-gen alternatives should show 2-5x fewer allocs

The standard encoding/json package uses reflection to inspect struct fields at runtime. For high-throughput services, this creates significant CPU and allocation overhead.

Options for faster JSON:

Custom `MarshalJSON`/`UnmarshalJSON` — hand-written methods for hot-path types eliminate reflection
Code-generation libraries — easyjson, ffjson generate marshal/unmarshal methods at build time, no reflection at runtime
Drop-in replacements — github.com/goccy/go-json, github.com/json-iterator/go, github.com/bytedance/sonic offer 2-5x better performance
`encoding/json/v2` (experimental, behind GOEXPERIMENT=jsonv2) — evaluate deliberately; most production code should keep encoding/json unless the project explicitly opts into the experiment

When using third-party JSON libraries, refer to the library's official documentation for up-to-date API signatures.

Cgo Overhead

Diagnose: 1- go tool pprof (CPU profile + threadcreate profile) — look for runtime.cgocall or runtime.asmcgocall consuming CPU; high threadcreate count means cgo calls are pinning goroutines to OS threads 2- go test -bench — benchmark the cgo call loop vs a pure Go equivalent; expect ~50-100ns overhead per cgo crossing

Each Go-to-C call via cgo costs ~50-100ns due to stack switching, signal mask manipulation, and scheduler coordination:

// Bad — cgo overhead per element dominates for tight loops
for i, v := range values {
    values[i] = float64(C.sqrt(C.double(v))) // ~100ns overhead PER CALL
}

// Good — use pure Go stdlib (math.Sqrt is as fast as C and inlineable)
for i, v := range values { values[i] = math.Sqrt(v) }

// Good — batch when C code is unavoidable
C.batch_sqrt((*C.double)(&values[0]), C.int(len(values))) // amortize overhead

Additional cgo costs: goroutine is pinned to an OS thread, C code cannot be preempted (may delay GC), and function inlining is blocked at the boundary.

Buffered I/O

Diagnose: 1- go test -bench — benchmark buffered vs unbuffered I/O; expect 3-10x improvement from reducing syscall count 2- go tool trace — look for frequent short syscalls (pread, pwrite) in rapid succession; many tiny I/O operations indicate unbuffered access

Unbuffered file reads/writes issue a syscall per operation. bufio.Reader and bufio.Writer batch small operations, reducing syscalls by 10x or more:

// Bad — syscall per line
for _, line := range lines { f.WriteString(line + "\n") }

// Good — buffered, batches writes into larger chunks
w := bufio.NewWriter(f)
for _, line := range lines { w.WriteString(line + "\n") }
w.Flush()

Concurrent Multi-Stage Pipelines

Diagnose: 1- go tool trace — visualize resource utilization across stages; look for sequential idle gaps where CPU, disk, or network sit unused while another resource is busy 2- go tool pprof (CPU + goroutine profile) — confirm each stage saturates a _different_ resource; if multiple stages compete for the same resource (e.g., both CPU-bound), concurrency won't help

In rare scenarios where each pipeline stage saturates a _different_ resource (CPU, disk I/O, network), running stages concurrently instead of sequentially can improve throughput — even with batching between stages.

The unusual scenario

Imagine processing records: Stage A compresses (CPU-bound), Stage B writes to disk (I/O-bound), Stage C uploads to network (network-bound). Sequential execution wastes resources:

Time:    0       10      20      30      40      50
CPU:     AAAAAAAAAA|..........|..........|..........|
Disk:    ..........|BBBBBBBBBB|..........|..........|
Network: ..........|..........|CCCCCCCCCC|..........|

Concurrent stages let resources work in parallel:

Time:    0       10      20      30      40      50
CPU:     AAAAAAAAAA|AA........|
Disk:    ..........|BBBBBBBBBB|BB........|
Network: ..........|..........|CCCCCCCCCC|CC........|

Code pattern:

// Each stage runs in its own goroutine, bounded by channel buffers
compressedCh := make(chan []byte, 100)    // A → B buffer
uploadedCh := make(chan bool, 100)        // B → C buffer

// Stage A: CPU-bound compression
go func() {
    for record := range inputCh {
        compressed := compress(record)    // saturates CPU
        compressedCh <- compressed
    }
    close(compressedCh)
}()

// Stage B: I/O-bound disk writes
go func() {
    for compressed := range compressedCh {
        diskFile.Write(compressed)        // saturates disk I/O
        uploadedCh <- true
    }
    close(uploadedCh)
}()

// Stage C: network-bound uploads
go func() {
    for <-uploadedCh {
        client.Post(uploadURL, ...)       // saturates network
    }
}()

With batching per stage, total throughput = min(A_throughput, B_throughput, C_throughput). Without concurrency, throughput = sequential sum of stages. Concurrent stages only help when bottlenecks don't overlap.

When to use this (and when NOT to)

Use concurrent pipelines only when ALL of these are true:

1. Resource saturation is predictable and non-overlapping — You measured that A saturates one resource (e.g., CPU = 95%), B saturates another (disk I/O = 90%), C saturates a third (network = 85%). Overlapping saturation means concurrency adds no benefit. 2. Bottleneck shifts don't hurt latency — Processing order doesn't matter, or records can flow out-of-order through stages. 3. Buffering overhead is acceptable — Inter-stage channels consume memory. For large records, channel buffers can overflow system limits. 4. You've benchmarked the alternative — Profile both sequential and concurrent versions. Sequential + batching often wins because it is simpler and avoids context-switching overhead.

Avoid concurrent pipelines if:

Records must be ordered — Concurrent processing may reorder records; if downstream expects order, you need synchronization that kills the speedup.
Resources overlap — If A and B both compete for CPU (e.g., both compress), concurrency causes context-switching overhead with no resource utilization gain.
Latency matters more than throughput — A single record now travels through 3 stages in parallel, increasing per-record latency.
Memory is tight — Each stage's channel buffer is a memory budget; deeply buffered channels can exhaust available RAM.

→ See samber/cc-skills-golang@golang-concurrency skill for detailed channel patterns and when to use worker pools instead.

Batch Operations

Diagnose: 1- go test -bench — benchmark single-item vs batched operations; expect N-fold improvement in throughput when amortizing per-operation overhead (syscalls, round-trips) 2- go tool trace — look for repeated short network/disk operations with idle gaps between them; these gaps represent wasted round-trip time that batching eliminates

Batching amortizes per-operation overhead (syscalls, network round-trips, transaction costs) across many items. The pattern applies everywhere: I/O, database, network, and even in-memory processing.

Database: batch inserts over row-by-row

Inserting 1,000 rows one at a time means 1,000 round-trips, 1,000 query parses, and 1,000 transaction commits. A single batch insert does it in one round-trip:

// Bad — 1,000 round-trips, ~500ms
for _, user := range users {
    db.Exec("INSERT INTO users (name, email) VALUES ($1, $2)", user.Name, user.Email)
}

// Good — 1 round-trip with multi-row VALUES, ~5ms
const batchSize = 1000
for i := 0; i < len(users); i += batchSize {
    end := min(i+batchSize, len(users))
    batch := users[i:end]
    // Build multi-row INSERT or use COPY protocol
    tx, _ := db.Begin()
    stmt, _ := tx.Prepare(pq.CopyIn("users", "name", "email"))
    for _, u := range batch { stmt.Exec(u.Name, u.Email) }
    stmt.Exec()
    tx.Commit()
}

→ See samber/cc-skills-golang@golang-database skill for detailed batch patterns and connection pool configuration.

HTTP: batch API calls

Instead of N individual HTTP requests, send one request with N items when the API supports it:

// Bad — 100 HTTP round-trips
for _, id := range ids {
    resp, _ := client.Get(fmt.Sprintf("/api/users/%s", id))
    // ...
}

// Good — 1 HTTP request with all IDs
resp, _ := client.Post("/api/users/batch", "application/json",
    bytes.NewReader(marshalIDs(ids)))

Channel: batch processing from a stream

Accumulate items from a channel and process in bulk to reduce per-item overhead:

func batchProcessor(in <-chan Item, batchSize int) {
    batch := make([]Item, 0, batchSize)
    ticker := time.NewTicker(100 * time.Millisecond) // flush on timeout too
    defer ticker.Stop()
    for {
        select {
        case item, ok := <-in:
            if !ok { flush(batch); return }
            batch = append(batch, item)
            if len(batch) >= batchSize { flush(batch); batch = batch[:0] }
        case <-ticker.C:
            if len(batch) > 0 { flush(batch); batch = batch[:0] }
        }
    }
}

Memory Optimization

Allocation reduction is the single highest-ROI optimization in most Go programs. Every allocation eventually requires garbage collection — reducing allocation count and size directly reduces GC pauses and CPU overhead.

Allocation Patterns

Diagnose: 1- go tool pprof -alloc_objects — rank functions by number of heap allocations; expect hot-path functions (request handlers, serializers) near the top with thousands of alloc/op 2- go build -gcflags="-m -m" — verbose escape analysis showing _why_ variables escape; look for "leaking param", "too large for stack", or "captured by closure" on variables you expect to stay on the stack 3- go test -bench -benchmem — measure allocs/op and B/op per benchmark; expect the target function to show >0 allocs/op that can be eliminated

Reuse slices via append(s[:0], ...)

Reslicing to zero length retains the backing array, turning what would be a new allocation into a no-op:

// Bad — allocates new slice, old one becomes garbage
mode = []T{item}

// Good — reuses existing backing array (0 allocations)
mode = append(mode[:0], item)

Direct indexing vs append

When the output size equals the input size, use make([]T, len(input)) with direct assignment instead of make([]T, 0, len(input)) with append. Direct assignment avoids per-element bounds checking and length increment:

// Slower — append overhead per element
result := make([]T, 0, len(input))
for i := range input { result = append(result, transform(input[i])) }

// Faster — direct assignment
result := make([]T, len(input))
for i := range input { result[i] = transform(input[i]) }

Use append when the result might be smaller (filtering) or when early error return could discard partial results.

Eliminate redundant map lookups

for k := range m { use(m[k]) } does two lookups per iteration. Capture the value from range:

// Bad — two lookups per iteration
for k := range in { result[k] = fn(in[k]) }

// Good — single lookup
for k, v := range in { result[k] = fn(v) }

Map size hints

make(map[K]V) starts with a small number of buckets and rehashes as it grows. Providing a size hint avoids rehashing:

m := make(map[string]int, len(items)) // single allocation, no rehashing

Sentinel errors vs fmt.Errorf

fmt.Errorf allocates on every call. For predictable errors in hot paths, use preallocated sentinels:

var ErrNegative = errors.New("value is negative") // allocated once

func validate(x int) error {
    if x < 0 { return ErrNegative } // zero allocation
    return nil
}

Only use fmt.Errorf when you need dynamic context (field names, values).

Interface boxing

Passing concrete types through any/interface{} forces heap allocation for boxing. In hot paths, use typed parameters or generics:

// Bad — boxes each int, allocates
func sum(values []any) int { ... }

// Good — no boxing, no allocation
func sum(values []int) int { ... }

// Good — generic, still no boxing
func sum[T ~int | ~int64](values []T) T { ... }

Backing Array Leaks

Diagnose: 1- go tool pprof -inuse_space — show currently live heap memory by allocation site; look for unexpectedly large live objects (MB-sized) that should have been GC'd — a sign of backing array retention 2- go tool pprof -alloc_space — show cumulative bytes allocated over time; look for allocation sites producing far more bytes than the final data they hold (e.g., 100MB allocated for 16-byte results)

Slice reslicing retains the entire backing array

A small reslice of a large slice keeps the entire original array in memory:

// Bad — retains entire megabyte-sized backing array
func getHeader(data []byte) []byte { return data[:16] }

// Good — independent copy, original can be GC'd
func getHeader(data []byte) []byte {
    header := make([]byte, 16)
    copy(header, data[:16])
    return header
}

Substring memory leaks

Substrings share the backing array of the original string:

// Bad — keeps entire longMsg in memory
func extractID(msg string) string { return msg[:8] }

// Good — independent copy (Go 1.20+)
func extractID(msg string) string { return strings.Clone(msg[:8]) }

Map never shrinks

Go maps grow but never release bucket memory when entries are deleted. A map that once held millions of entries retains its allocation forever:

// Recreate periodically to reclaim memory
func compact(old map[string]Data) map[string]Data {
    m := make(map[string]Data, len(old))
    for k, v := range old { m[k] = v }
    return m // old map becomes eligible for GC
}

String and Byte Optimization

Diagnose: 1- go tool pprof -alloc_objects — look for string/byte conversion functions (runtime.stringtoslicebyte, runtime.slicebytetostring) appearing as top allocators 2- go test -bench -benchmem — measure allocs/op; expect repeated conversions to show 1+ alloc/op per conversion that can be reduced to zero by caching

Cache string-to-byte conversions — converting between string and []byte allocates a copy each time. Convert once and reuse the result.

Use `bytes` package directly — bytes.Contains, bytes.HasPrefix, bytes.Split, bytes.ToUpper etc. operate on []byte without string conversion. The bytes package mirrors most of strings.

sync.Pool Hot-Path Patterns

Diagnose: 1- go tool pprof -alloc_objects — identify hot allocation sites creating the same object type repeatedly (e.g., []byte buffers, temp structs); expect one site with thousands of allocs/s that can be pooled

sync.Pool recycles objects across GC cycles, reducing allocation pressure. Use it for frequently allocated, short-lived objects in hot paths (HTTP handlers, serialization, logging):

var bufPool = sync.Pool{
    New: func() any {
        buf := make([]byte, 0, 4096)
        return &buf
    },
}

func handleRequest(data []byte) []byte {
    bp := bufPool.Get().(*[]byte)
    buf := (*bp)[:0] // reset length, keep capacity
    defer func() { *bp = buf; bufPool.Put(bp) }()

    // ... process data into buf ...

    result := make([]byte, len(buf))
    copy(result, buf) // return a copy — buf goes back to pool
    return result
}

Rules:

Reset state before Put() — clear references to avoid retaining large object graphs across GC cycles
Return copies, not pooled buffers — callers must not hold references to pooled memory
Don't pool objects >32KB — large allocations bypass the pool's size classes and GC already handles them efficiently
Don't pool infrequently used objects — pool overhead exceeds benefit when allocations are rare

→ See samber/cc-skills-golang@golang-concurrency skill for sync.Pool API reference and basic usage patterns.

Memory Layout

Diagnose: 1- fieldalignment ./... — detect structs with wasted padding bytes; expect warnings like "struct of size 40 could be 24" listing which structs benefit from reordering 2- unsafe.Sizeof/Alignof/Offsetof — measure exact byte sizes and field offsets; use to confirm savings before/after and document them in code comments

Struct field alignment

Go adds padding between fields to satisfy alignment requirements. Reorder fields from largest to smallest:

// Bad — 24 bytes (7 + 3 bytes padding)
type Bad struct {
    a bool    // 1 byte + 7 padding
    b int64   // 8 bytes
    c bool    // 1 byte + 3 padding
    d int32   // 4 bytes
}

// Good — 16 bytes (2 bytes padding)
type Good struct {
    b int64   // 8 bytes
    d int32   // 4 bytes
    a bool    // 1 byte
    c bool    // 1 byte + 2 padding
}

Alignment requirements: bool/byte = 1, int16 = 2, int32/float32 = 4, int64/float64/string/[]T/*T = 8.

Inspect layout: unsafe.Sizeof(T{}), unsafe.Alignof(T{}), unsafe.Offsetof(T{}.field)

Zero-size field at end of struct

If the last field has zero size (struct{}), the compiler adds word-sized padding to prevent a pointer to that field from overlapping the next memory block:

// Bad — 16 bytes (8 for Value + 8 padding for Flag)
type Entry struct { Value int64; Flag struct{} }

// Good — 8 bytes (0 for Flag + 8 for Value)
type Entry struct { Flag struct{}; Value int64 }

Having a struct{} field in a struct is rare and almost useless.

Pointer receivers for large structs

Value receivers copy the entire struct on every method call. Use pointer receivers for structs larger than ~128 bytes. If any method uses a pointer receiver, all methods should for consistency.

Map of pointers for large, frequently updated structs

Map values are not addressable — you cannot modify a field in place. For large structs with frequent updates, map[K]*V avoids the copy-modify-reassign pattern:

players := map[string]*Player{"alice": {Score: 100}}
players["alice"].Score += 10 // direct modification, no copy

Trade-off: each pointer is a separate heap allocation, adding GC pressure. For small, mostly-read structs, map[K]V (value) is better.

Production Observability for Performance

Third-party monitoring tools complement local profiling (pprof, benchmarks) by providing continuous monitoring, historical trends, and regression detection in production.

Prometheus Metrics for Go

Setup: github.com/prometheus/client_golang — expose /metrics endpoint with promhttp.Handler(). Default collectors automatically export Go runtime metrics (go_goroutines, go_memstats_*, go_gc_duration_seconds, process_cpu_seconds_total, etc.).

→ See samber/cc-skills-golang@golang-benchmark skill (investigation-session.md) for the full runtime metrics table, investigation session setup (scrape interval tuning, env-var toggling), and cost warnings for profiling tools.

PromQL Queries for Performance Diagnosis

GC pressure

PromQL	What to look for
`rate(go_gc_duration_seconds_count[5m])`	GC cycles/s — >2/s sustained suggests excessive allocation rate
`rate(go_gc_duration_seconds_sum[5m]) / rate(go_gc_duration_seconds_count[5m])`	Average GC pause — increasing trend means heap is growing or has too many pointers
`go_gc_duration_seconds{quantile="1"}`	Worst-case GC pause — spikes here cause tail latency

Memory leaks

PromQL	What to look for
`go_memstats_alloc_bytes`	Should be roughly stable under constant load; continuous increase = memory leak
`rate(go_memstats_alloc_bytes_total[5m])`	Allocation rate (bytes/s) — drives GC frequency; compare before/after deploy for regressions
`process_resident_memory_bytes - go_memstats_sys_bytes`	Gap = non-Go memory (cgo, mmap); growing gap = non-Go leak

Goroutine leaks

PromQL	What to look for
`go_goroutines`	Should correlate with load; growing independently of traffic = leak
`delta(go_goroutines[1h])`	Net goroutine change over 1h; positive without load increase = leak

CPU saturation

PromQL	What to look for
`rate(process_cpu_seconds_total[5m])`	CPU cores consumed; compare to GOMAXPROCS to detect saturation
`rate(process_cpu_seconds_total[5m]) / <GOMAXPROCS>`	CPU utilization ratio; >0.8 sustained = CPU-saturated

Regression detection (after deploy)

PromQL	What to look for
`rate(go_memstats_alloc_bytes_total[5m])`	Compare before/after deploy; significant increase = new allocation pattern introduced
`histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))`	p99 latency increase after deploy = regression (requires app-level histogram)

Alerting rules (examples)

Example alerting rules — adjust thresholds to your application; a high-throughput data pipeline will have different baselines than a lightweight API server.

→ See samber/cc-skills@promql-cli skill for interactively testing these PromQL expressions against your Prometheus instance from the CLI.

Grafana Dashboards

→ See samber/cc-skills-golang@golang-observability skill for recommended community Grafana dashboards that visualize Go runtime metrics out of the box.

Continuous Profiling

Continuous profiling collects low-overhead samples in production and stores them for historical comparison. Use it to detect regressions across deploys, compare flamegraphs over time, and feed PGO (see Runtime Tuning).

Tool	Model	Overhead	Best for
Grafana Pyroscope	push SDK or pull (via Alloy)	~2-5%	Grafana ecosystem, historical flamegraph comparison
Parca (Polar Signals)	eBPF-based pull	<1%	Infrastructure-wide profiling, no code changes
Datadog Continuous Profiler	push (agent)	~1-2%	Existing Datadog users
Google Cloud Profiler	push (agent)	~1-2%	GCP-hosted Go services

Pyroscope push mode

import "github.com/grafana/pyroscope-go"

pyroscope.Start(pyroscope.Config{
    ApplicationName: "myapp",
    ServerAddress:   "http://pyroscope:4040",
    ProfileTypes: []pyroscope.ProfileType{
        pyroscope.ProfileCPU,
        pyroscope.ProfileAllocObjects,
        pyroscope.ProfileAllocSpace,
        pyroscope.ProfileInuseObjects,
        pyroscope.ProfileInuseSpace,
        pyroscope.ProfileGoroutines,
    },
})

Pyroscope pull mode (via Grafana Alloy)

No code changes required — Alloy scrapes /debug/pprof/* endpoints periodically. Configure Alloy to target your service's pprof endpoint.

When using third-party profiling libraries, refer to the library's official documentation for current API signatures.

Real-Time Visualization (Development)

Tool	What it does
statsviz (`github.com/arl/statsviz`)	Real-time browser dashboard at `/debug/statsviz` — heap, GC pauses, goroutines, scheduler. Register with `statsviz.Register(mux)`. Great for local development
expvar (stdlib `expvar`)	JSON metrics at `/debug/vars` — lightweight, no dependencies. Integrates with Netdata, Telegraf, or custom dashboards

Runtime Tuning

Runtime settings control garbage collection frequency, memory limits, CPU scheduling, and compiler optimizations. Tune them after profiling — the defaults are well-chosen for most workloads.

Garbage Collector Tuning

Diagnose: 1- GODEBUG=gctrace=1 — print one line per GC cycle; look for high GC frequency (cycles/s), high CPU% (>5% means GC is competing for CPU), or heap growing faster than expected 2- runtime.ReadMemStats — inspect Alloc, TotalAlloc, NumGC, PauseNs; compare Alloc vs Sys to see how much memory the GC is reclaiming vs how much the OS allocated 3- go tool trace — visualize GC stop-the-world pauses and GC assist stealing CPU from application goroutines; look for long STW bars or frequent assist marks 4- debug.ReadGCStats — get pause time percentiles (p50, p95, p99); high p99 pauses indicate large heap scans or too many pointers 5- runtime/metrics — programmatic access to GC stats for dashboards; monitor /gc/cycles/total, /gc/heap/allocs, /gc/pauses 6- GODEBUG=gcpacertrace=1 — trace the GC pacer's decisions; useful to understand why GC triggers earlier or later than expected 7- Prometheus rate(go_gc_duration_seconds_count[5m]) — monitor GC frequency in production; >2 cycles/s sustained suggests excessive allocation rate

GOGC (default: 100)

Controls the heap growth ratio that triggers the next GC cycle. GOGC=100 means GC runs when the heap doubles since the last collection. Higher values reduce GC frequency but use more memory:

GOGC=50  ./myapp  # latency-sensitive: more frequent, shorter GC pauses
GOGC=200 ./myapp  # throughput-oriented: less frequent GC, more memory used
GOGC=off ./myapp  # disable GC entirely (testing only!)

GOMEMLIMIT (Go 1.19+)

Soft memory limit — the runtime increases GC frequency to stay under this limit. Essential for containerized applications where exceeding the container limit triggers an OOM kill:

# Container with 512MB limit: leave headroom for non-heap memory (goroutine stacks, OS buffers)
GOMEMLIMIT=450MiB ./myapp

# Container with 1GB limit
GOMEMLIMIT=900MiB ./myapp

The GC pacer adjusts collection timing based on both GOGC and GOMEMLIMIT. When the heap approaches the limit, the GC runs more aggressively regardless of GOGC.

Programmatic control

import "runtime/debug"

debug.SetGCPercent(200)                    // equivalent to GOGC=200
debug.SetMemoryLimit(450 * 1024 * 1024)   // 450 MiB soft limit

Use programmatic control for dynamic tuning based on observed workload, or when environment variables cannot be set.

Ballast pattern (pre-Go 1.19)

Before GOMEMLIMIT, teams allocated a large byte array at startup to inflate the live heap size, reducing GC frequency:

var ballast [1 << 30]byte // 1 GB — obsolete pattern

GOMEMLIMIT is strictly better — it provides the same benefit (fewer GC cycles) without wasting physical memory. Use GOMEMLIMIT instead.

GC Profiling and Diagnostics

GODEBUG=gctrace=1

Prints a line per GC cycle to stderr:

GODEBUG=gctrace=1 ./myapp 2>&1 | head -20

Sample output:

gc 5 @1.234s 2%: 0.012+12+0.9 ms clock, 0.25+8.9/20+18 ms cpu, 45->92->50 MB, 200 MB goal, 8 P

Key fields:

gc 5 — 5th GC cycle
@1.234s — time since program start
2% — total CPU time spent in GC
45->92->50 MB — heap before → peak during collection → after
200 MB goal — target heap size (based on GOGC and GOMEMLIMIT)
8 P — number of processors

Watch for: GC frequency (too often = too many allocations), pause times (high = large heap or many pointers), CPU% (high = tune GOGC or reduce allocations).

runtime.ReadMemStats

Programmatic monitoring for dashboards and alerting:

var m runtime.MemStats
runtime.ReadMemStats(&m)

fmt.Printf("Alloc: %d MB\n", m.Alloc/1024/1024)       // currently allocated
fmt.Printf("TotalAlloc: %d MB\n", m.TotalAlloc/1024/1024) // cumulative
fmt.Printf("Sys: %d MB\n", m.Sys/1024/1024)            // requested from OS
fmt.Printf("NumGC: %d\n", m.NumGC)                      // completed collections
fmt.Printf("LastPause: %d ms\n", m.PauseNs[(m.NumGC+255)%256]/1_000_000)

GC pacing

The GC pacer predicts when to start the next collection based on:

1. Live heap size after the last collection 2. GOGC percentage — how much growth to allow 3. GOMEMLIMIT — soft ceiling (if set) 4. Current allocation rate — how fast the heap is growing

The pacer starts collection early enough to finish before hitting the target. Fast allocation rates cause earlier starts.

Allocation Rate Reduction

Diagnose: 1- go tool pprof -alloc_objects — rank functions by allocation count; the top allocators are where allocation reduction will have the biggest GC impact 2- GODEBUG=gctrace=1 — monitor GC frequency before and after reducing allocations; expect fewer GC cycles per second as allocation rate drops 3- Prometheus rate(go_memstats_alloc_bytes_total[5m]) — track allocation rate trend in production; compare before/after deploy to detect regressions

Reducing allocations helps more than tuning GOGC — it addresses the root cause instead of managing the symptom:

Value types over pointer types where possible — values stay on the stack (no GC), pointers escape to the heap
Pool frequently allocated objects with sync.Pool (see memory.md)
Preallocate slices and maps — → See samber/cc-skills-golang@golang-data-structures skill
Avoid interface boxing in hot paths — use typed parameters or generics

GOMAXPROCS in Containers

Diagnose: 1- go tool pprof (CPU profile) — look for high runtime.schedule or runtime.findRunnable overhead; this indicates too many P's competing for work or too few P's starving goroutines 2- go tool trace — check if goroutines are evenly distributed across P's; uneven distribution suggests GOMAXPROCS is misconfigured for the container 3- GODEBUG=schedtrace=1000 — print scheduler state every second; look for runqueue imbalances or idle P's when work is available 4- runtime.GOMAXPROCS(0) — query the current value; if it returns the host CPU count (e.g., 64) instead of the container limit (e.g., 2), the runtime is over-scheduling 5- Prometheus rate(process_cpu_seconds_total[5m]) — monitor CPU cores consumed in production; if consistently near GOMAXPROCS value, the app is CPU-saturated

Go 1.25+ improves container CPU detection, particularly for cgroup v2. The runtime sets GOMAXPROCS based on:

Logical CPUs on the machine
Process CPU affinity mask
cgroup CPU quota limits (on Linux)

In a container with 2 CPU cores on a 64-core host running Go 1.25+ with cgroup v2, GOMAXPROCS is correctly set to 2 by default. For cgroup v1 environments, validate the detected value at startup and consider using go.uber.org/automaxprocs to ensure correctness.

For Go 1.24 and earlier, use the go.uber.org/automaxprocs library to handle container CPU detection:

// Pre-Go 1.25: explicit container-aware detection
import _ "go.uber.org/automaxprocs"

func main() {
    // GOMAXPROCS is now correctly set to container CPU limit
    startServer()
}

Manual override (if needed):

GOMAXPROCS=2 ./myapp
GODEBUG=updatemaxprocs=0 ./myapp  # disable dynamic updates (Go 1.25+)

Known limitations (Go 1.25): cgroup v1 on certain systems (Oracle OCPUs) may not properly detect Kubernetes CPU limits. Manually set GOMAXPROCS as a workaround in these cases.

Profile-Guided Optimization (PGO)

Diagnose: 1- go tool pprof (CPU profile) — collect a representative production profile (30+ seconds); look for hot interface method calls and deep call chains that PGO can optimize via devirtualization and inlining 2- go test -bench — benchmark before and after placing default.pgo; expect 2-7% improvement on interface-heavy code, less on already-optimized paths

Go 1.21+ supports PGO — the compiler uses a production CPU profile to make better inlining and devirtualization decisions. Expected improvement: 2-7% for minimal effort.

Workflow:

1. Collect a production CPU profile (30+ seconds of representative load):

   curl http://localhost:6060/debug/pprof/profile?seconds=60 > cpu.pprof

2. Place as default.pgo in the main package directory:

   cp cpu.pprof ./cmd/myapp/default.pgo

3. Build — go build auto-detects default.pgo:

   go build ./cmd/myapp

What the compiler optimizes:

Inlining — hot function calls are inlined more aggressively
Devirtualization — interface method calls with high probability of targeting specific types become direct calls

When it helps most: code with many interface calls, hot inlining opportunities, deep call stacks. When it helps least: already-optimized code, memory-bound workloads.

Rebuild profiles after significant code changes — stale profiles can mislead the compiler.

Logging Overhead in Hot Paths

Diagnose: 1- go tool pprof (CPU profile) — look for fmt.Sprintf, log.Printf, or slog.(*Logger).log appearing in hot paths; these indicate log formatting consuming CPU even when the log level filters the message 2- go build -gcflags="-m" — check if log arguments escape to the heap; expect "moved to heap" for arguments boxed into any interface by logging functions 3- go test -bench -benchmem — benchmark with logging enabled vs disabled; if allocs/op doesn't change, the logger is allocating even when the level is off

Log formatting allocates memory and consumes CPU even when the message is discarded because it's below the configured level:

// Bad — fmt.Sprintf runs BEFORE the logger checks the level
logger.Debug(fmt.Sprintf("processing item %d with data %v", item.ID, item.Data))

// Good — slog defers formatting until level check passes (Go 1.21+)
slog.Debug("processing item", slog.Int("id", item.ID), slog.Any("data", item.Data))

// Best — LogAttrs: zero allocations when level is disabled
slog.LogAttrs(ctx, slog.LevelDebug, "processing item",
    slog.Int("id", item.ID))

In hot paths, even slog.Any can allocate. Prefer typed attributes: slog.Int, slog.String, slog.Bool.

Panic/Recover Cost

Diagnose: 1- go tool pprof (CPU profile) — look for runtime.gopanic or runtime.gorecover in the profile; their presence in hot paths means panic/recover is being used for control flow 2- go test -bench — benchmark panic/recover vs error-return versions; expect 10-100x overhead from stack unwinding and defer execution

panic triggers stack unwinding, running all deferred functions up the call stack. recover catches the panic but the unwinding itself is expensive. Never use panic/recover for control flow:

// Bad — panic overhead for a normal condition
defer func() { recover() }()
v, _ := strconv.Atoi(s) // relies on panic for invalid input

// Good — explicit error check, no panic overhead
v, err := strconv.Atoi(s)
if err != nil { continue }

Panic is appropriate only for truly unrecoverable situations (programmer errors, corrupted state). Always convert panics to errors at package boundaries.

Related skills

Lark Openapi ExplorerInstantly explore, test, and generate calls against the full Lark (Feishu) OpenAPI surface without leaving their agent workflow.471k

Lark EventConsume real-time events from Lark/Feishu as structured NDJSON streams inside AI agent workflows.382k15.8k

Lark Openapi ExplorerWhen an existing Lark/Feishu skill or CLI command cannot fulfill a specific requirement and they need to discover and invoke the exact native OpenAPI endpoint.381k15.8k

Just ScrapeQuickly search, crawl, extract structured JSON, or monitor web pages without writing custom scraping code.245k37

Lark AppsQuery the current visibility and permission scope of a Lark (Feishu) app without writing HTTP client code.230k15.8k

SupabaseGet accurate, up-to-date Supabase implementation guidance across database, auth, realtime, storage, edge functions and vector search without relying on outd182k2.4k

How it compares

Pick golang-performance over golang-error-handling when the bottleneck is runtime metrics and profiling rather than middleware error logging.

FAQ

What goroutine threshold does golang-performance alert on?

golang-performance defines a GoroutineLeak Prometheus alert when go_goroutines exceeds 10000 for 5 minutes, prompting checks for leaked goroutines in Go services.

How does golang-performance detect GC problems?

golang-performance uses a HighGCPauseTime alert when the average of rate(go_gc_duration_seconds_sum[5m]) over rate(go_gc_duration_seconds_count[5m]) exceeds 0.01, indicating pauses above 10ms.

Is Golang Performance safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Backend & APIsbackenddevops