Golang Observability

Name: Golang Observability
Author: samber

samber/cc-skills-golang

33.8k installs
2.8k repo stars
Updated July 27, 2026
samber/cc-skills-golang

golang-observability is a Go skill for production observability including logging, metrics, tracing, and profiling.

About

Go production observability including structured logging (slog), Prometheus metrics, OpenTelemetry tracing, pprof profiling, and alerting. Essential for production Go services needing real-time insights into system behavior and performance.

Structured logging with slog and multi-handler routing
Prometheus metrics and OpenTelemetry tracing
pprof profiling and alerting setup

Golang Observability by the numbers

33,788 all-time installs (skills.sh)
+515 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #10 of 1,453 DevOps & CI/CD skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/samber/cc-skills-golang --skill golang-observability

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/samber/cc-skills-golang/golang-observability.svg)](https://skillselion.com/skills/samber/cc-skills-golang/golang-observability)

Installs	33.8k
repo stars	★ 2.8k
Security audit	3 / 3 scanners passed
Last updated	July 27, 2026
Repository	samber/cc-skills-golang ↗

How do you add OpenTelemetry and Prometheus to Go?

Production Go services require comprehensive observability covering logs, metrics, traces, and profiling.

Who is it for?

production services,complex distributed systems,performance debugging,incident response

Skip if: local development,simple scripts,non-critical services

When should I use this skill?

Instrumenting Go services for production, migrating zap or logrus to slog, or adding metrics, traces, profiling, or GDPR-compliant RUM tracking.

What you get

slog logging, Prometheus metric exporters, OpenTelemetry trace propagation, Grafana dashboards, and alerting hooks in Go services.

instrumented Go service
Grafana dashboard config
OpenTelemetry trace setup

By the numbers

Covers 161 lines of observability guidance
Includes slog, Prometheus, OpenTelemetry, and pprof

Files

SKILL.mdMarkdownGitHub ↗

Persona: You are a Go observability engineer. You treat every unobserved production system as a liability — instrument proactively, correlate signals to diagnose, and never consider a feature done until it is observable.

Modes:

Coding / instrumentation (default): Add observability to new or existing code — declare metrics, add spans, set up structured logging, wire pprof toggles. Follow the sequential instrumentation guide.
Review mode — reviewing a PR's instrumentation changes. Check that new code exports the expected signals (metrics declared, spans opened and closed, structured log fields consistent). Sequential.
Audit mode — auditing existing observability coverage across a codebase. Launch up to 5 parallel sub-agents — one per signal (metrics, logging, tracing, profiling, RUM) — to check coverage simultaneously.

Community default. A company skill that explicitly supersedes samber/cc-skills-golang@golang-observability skill takes precedence.

Go Observability Best Practices

Observability is the ability to understand a system's internal state from its external outputs. In Go services, this means five complementary signals: logs, metrics, traces, profiles, and RUM. Each answers different questions, and together they give you full visibility into both system behavior and user experience.

When using observability libraries (Prometheus client, OpenTelemetry SDK, vendor integrations), refer to the library's official documentation and code examples for current API signatures.

Best Practices Summary

1. Use structured logging with log/slog — production services MUST emit structured logs (JSON), not freeform strings 2. Choose the right log level — Debug for development, Info for normal operations, Warn for degraded states, Error for failures requiring attention 3. Log with context — use slog.InfoContext(ctx, ...) to correlate logs with traces 4. Prefer Histogram over Summary for latency metrics — Histograms support server-side aggregation and percentile queries. Every HTTP endpoint MUST have latency and error rate metrics. 5. Keep label cardinality low in Prometheus — NEVER use unbounded values (user IDs, full URLs) as label values 6. Track percentiles (P50, P90, P99, P99.9) using Histograms + histogram_quantile() in PromQL 7. Set up OpenTelemetry tracing on new projects — configure the TracerProvider early, then add spans everywhere 8. Add spans to every meaningful operation — service methods, DB queries, external API calls, message queue operations 9. Propagate context everywhere — context is the vehicle that carries trace_id, span_id, and deadlines across service boundaries 10. Enable profiling via environment variables — toggle pprof and continuous profiling on/off without redeploying 11. Correlate signals — inject trace_id into logs, use exemplars to link metrics to traces 12. A feature is not done until it is observable — declare metrics, add proper logging, create spans 13. [awesome-prometheus-alerts](https://samber.github.io/awesome-prometheus-alerts/) provides ~500 ready-to-use alerting rules organized by technology for infrastructure and dependency monitoring

Cross-References

See samber/cc-skills-golang@golang-error-handling skill for the single handling rule. See samber/cc-skills-golang@golang-troubleshooting skill for using observability signals to diagnose production issues. See samber/cc-skills-golang@golang-security skill for protecting pprof endpoints and avoiding PII in logs. See samber/cc-skills-golang@golang-context skill for propagating trace context across service boundaries. See samber/cc-skills@promql-cli skill for querying and exploring PromQL expressions against Prometheus from the CLI.

Go 1.26+: slog multi-handler

For simple fan-out to multiple slog handlers, prefer stdlib slog.NewMultiHandler before adding third-party handler-composition dependencies.

logger := slog.New(slog.NewMultiHandler(
    slog.NewJSONHandler(os.Stdout, nil),
    auditHandler,
))

Use third-party slog handler libraries only when the stdlib handler composition is insufficient.

The Five Signals

Signal	Question it answers	Tool	When to use
Logs	What happened?	`log/slog`	Discrete events, errors, audit trails
Metrics	How much / how fast?	Prometheus client	Aggregated measurements, alerting, SLOs
Traces	Where did time go?	OpenTelemetry	Request flow across services, latency breakdown
Profiles	Why is it slow / using memory?	pprof, Pyroscope	CPU hotspots, memory leaks, lock contention
RUM	How do users experience it?	PostHog, Segment	Product analytics, funnels, session replay

Detailed Guides

Each signal has a dedicated guide with full code examples, configuration patterns, and cost analysis:

[Structured Logging](references/logging.md) — Why structured logging matters for log aggregation at scale. Covers log/slog setup, log levels (Debug/Info/Warn/Error) and when to use each, request correlation with trace IDs, context propagation with slog.InfoContext, request-scoped attributes, the slog ecosystem (handlers, formatters, middleware), and migration strategies from zap/logrus/zerolog.

[Metrics Collection](references/metrics.md) — Prometheus client setup and the four metric types (Counter for rate-of-change, Gauge for snapshots, Histogram for latency aggregation). Deep dive: why Histograms beat Summaries (server-side aggregation, supports histogram_quantile PromQL), naming conventions, the PromQL-as-comments convention (write queries above metric declarations for discoverability), production-grade PromQL examples, multi-window SLO burn rate alerting, and the high-cardinality label problem (why unbounded values like user IDs destroy performance).

[Distributed Tracing](references/tracing.md) — When and how to use OpenTelemetry SDK to trace request flows across services. Covers spans (creating, attributes, status recording), otelhttp middleware for HTTP instrumentation, error recording with span.RecordError(), trace sampling (why you can't collect everything at scale), propagating trace context across service boundaries, and cost optimization.

[Profiling](references/profiling.md) — On-demand profiling with pprof (CPU, heap, goroutine, mutex, block profiles) — how to enable it in production, secure it with auth, and toggle via environment variables without redeploying. Continuous profiling with Pyroscope for always-on performance visibility. Cost implications of each profiling type and mitigation strategies.

[Real User Monitoring](references/rum.md) — Understanding how users actually experience your service. Covers product analytics (event tracking, funnels), Customer Data Platform integration, and critical compliance: GDPR/CCPA consent checks, data subject rights (user deletion endpoints), and privacy checklist for tracking. Server-side event tracking (PostHog, Segment) and identity key best practices.

[Alerting](references/alerting.md) — Proactive problem detection. Covers the four golden signals (latency, traffic, errors, saturation), awesome-prometheus-alerts provides ~500 ready-to-use rules by technology, Go runtime alerts (goroutine leaks, GC pressure, OOM risk), severity levels, and common mistakes that break alerting (using irate instead of rate, missing for: duration to avoid flapping).

[Grafana Dashboards](references/dashboards.md) — Prebuilt dashboards for Go runtime monitoring (heap allocation, GC pause frequency, goroutine count, CPU). Explains the standard dashboards to install, how to customize them for your service, and when each dashboard answers a different operational question.

Correlating Signals

Signals are most powerful when connected. A trace_id in your logs lets you jump from a log line to the full request trace. An exemplar on a metric links a latency spike to the exact trace that caused it.

Logs + Traces: `otelslog` bridge

import "go.opentelemetry.io/contrib/bridges/otelslog"

// Create a logger that automatically injects trace_id and span_id
logger := otelslog.NewHandler("my-service")
slog.SetDefault(slog.New(logger))

// Now every slog call with context includes trace correlation
slog.InfoContext(ctx, "order created", "order_id", orderID)
// Output includes: {"trace_id":"abc123", "span_id":"def456", "msg":"order created", ...}

Metrics + Traces: Exemplars

// When recording a histogram observation, attach the trace_id as an exemplar
// so you can jump from a P99 spike directly to the offending trace
obs := histogram.WithLabelValues("POST", "/orders")
if eo, ok := obs.(prometheus.ExemplarObserver); ok {
    eo.ObserveWithExemplar(duration, prometheus.Labels{"trace_id": traceID})
} else {
    obs.Observe(duration)
}

Migrating Legacy Loggers

If the project currently uses zap, logrus, or zerolog, migrate to log/slog. It is the standard library logger since Go 1.21, has a stable API, and the ecosystem has consolidated around it. Continuing with third-party loggers means maintaining an extra dependency for no benefit.

Migration strategy:

1. Add slog as the new logger with slog.SetDefault() 2. Bridge handlers during migration route slog output through the existing logger: samber/slog-zap, samber/slog-logrus, samber/slog-zerolog 3. Gradually replace all zap.L().Info(...) / logrus.Info(...) / log.Info().Msg(...) calls with slog.Info(...) 4. Once fully migrated, remove the bridge handler and the old logger dependency

Definition of Done for Observability

A feature is not production-ready until it is observable. Before marking a feature as done, verify:

[ ] Metrics declared — counters for operations/errors, histograms for latencies, gauges for saturation. Each metric var has PromQL queries and alert rules as comments above its declaration.
[ ] Logging is proper — structured key-value pairs with slog, context variants used (slog.InfoContext), no PII in logs, errors MUST be either logged OR returned (NEVER both).
[ ] Spans created — every service method, DB query, and external API call has a span with relevant attributes, errors recorded with span.RecordError().
[ ] Dashboards and alerts exist — the PromQL from your metric comments is wired into Grafana dashboards and Prometheus alerting rules. Ready-to-use alert rules for common infrastructure dependencies are available at awesome-prometheus-alerts.
[ ] RUM events tracked — key business events tracked server-side (PostHog/Segment), identity key is user_id (not email), consent checked before tracking.

Common Mistakes

// ✗ Bad — log AND return (error gets logged multiple times up the chain)
if err != nil {
    slog.Error("query failed", "error", err)
    return fmt.Errorf("query: %w", err)
}

// ✓ Good — return with context, log once at the top level
if err != nil {
    return fmt.Errorf("querying users: %w", err)
}

// ✗ Bad — high-cardinality label (unbounded user IDs)
httpRequests.WithLabelValues(r.Method, r.URL.Path, userID).Inc()

// ✓ Good — bounded label values only
httpRequests.WithLabelValues(r.Method, routePattern).Inc()

// ✗ Bad — not passing context (breaks trace propagation)
result, err := db.Query("SELECT ...")

// ✓ Good — context flows through, trace continues
result, err := db.QueryContext(ctx, "SELECT ...")

// ✗ Bad — using Summary for latency (can't aggregate across instances)
prometheus.NewSummary(prometheus.SummaryOpts{
    Name:       "http_request_duration_seconds",
    Objectives: map[float64]float64{0.99: 0.001},
})

// ✓ Good — use Histogram (aggregatable, supports histogram_quantile)
prometheus.NewHistogram(prometheus.HistogramOpts{
    Name:    "http_request_duration_seconds",
    Buckets: prometheus.DefBuckets,
})

[
  {
    "id": 1,
    "name": "histogram-bucket-misconfiguration-sub-millisecond",
    "description": "Tests whether the model catches that prometheus.DefBuckets are wrong for sub-millisecond operations",
    "prompt": "I'm adding a Prometheus histogram for an in-memory cache lookup that typically completes in 50-500 microseconds. Here's my declaration:\n\nvar cacheLookupDuration = promauto.NewHistogram(prometheus.HistogramOpts{\n    Namespace: \"myapp\",\n    Name:      \"cache_lookup_duration_seconds\",\n    Buckets:   prometheus.DefBuckets,\n})\n\nDefBuckets are the default so this should be fine, right?",
    "trap": "Without the skill, the model validates the code since DefBuckets are the 'recommended default'. The skill teaches that DefBuckets start at 5ms — all sub-millisecond observations land in the first bucket, making P50/P99 meaningless.",
    "assertions": [
      {"id": "1.1", "text": "Identifies that prometheus.DefBuckets (.005, .01, .025, .05, .1, ... seconds) are wrong for sub-millisecond operations"},
      {"id": "1.2", "text": "Explains that all 50-500µs observations would land in a single bucket, making histogram_quantile() return inaccurate or meaningless percentiles"},
      {"id": "1.3", "text": "Recommends custom bucket boundaries in the microsecond range (e.g., 0.0001, 0.0002, 0.0005, 0.001, 0.002 seconds)"},
      {"id": "1.4", "text": "Shows how to define custom buckets using prometheus.LinearBuckets, prometheus.ExponentialBuckets, or an explicit []float64 slice"},
      {"id": "1.5", "text": "Explains the general rule: buckets should cover the expected range of values with sufficient resolution around the target percentiles"}
    ]
  },
  {
    "id": 2,
    "name": "promql-comments-convention-discoverability",
    "description": "Tests the PromQL-as-comments convention above metric variable declarations",
    "prompt": "My team declares Prometheus metrics scattered across many files. When writing alerts or dashboards, engineers have to grep the codebase to find the metric name and then guess what PromQL to write. How do I make metrics more self-documenting without adding a wiki or external doc?",
    "trap": "Without the skill, the model suggests external documentation, a README section, or godoc comments with metric descriptions — missing the PromQL-as-comments convention taught by the skill",
    "assertions": [
      {"id": "2.1", "text": "Recommends placing PromQL queries and alert expressions as comments directly above each metric variable declaration"},
      {"id": "2.2", "text": "Shows example with Dashboard: and Alert: comment lines containing actual PromQL above the var block"},
      {"id": "2.3", "text": "Explains that colocating PromQL with the metric declaration means queries are reviewed in PRs alongside metric changes"},
      {"id": "2.4", "text": "Mentions that when the metric name or labels change, the PromQL comments change in the same commit — preventing stale queries"},
      {"id": "2.5", "text": "Shows that this enables new team members to understand the metric's purpose and how to query it at a glance"}
    ]
  },
  {
    "id": 3,
    "name": "high-cardinality-label-trap",
    "description": "Tests whether the model catches high-cardinality label usage in Prometheus metrics",
    "prompt": "I'm adding Prometheus metrics to my Go API. For tracking request counts, I'm using:\n\nhttpRequests.WithLabelValues(r.Method, r.URL.Path, userID).Inc()\n\nThis gives me per-user, per-endpoint visibility. Any concerns?",
    "trap": "Without the skill, the model may praise the granularity or only mention minor concerns, missing the critical cardinality explosion problem",
    "assertions": [
      {"id": "3.1", "text": "Identifies userID as a high-cardinality label that will cause problems"},
      {"id": "3.2", "text": "Identifies r.URL.Path as potentially high-cardinality (should use route template instead)"},
      {"id": "3.3", "text": "Explains that each unique label combination creates a separate time series"},
      {"id": "3.4", "text": "Warns about memory explosion on the Prometheus server from unbounded labels"},
      {"id": "3.5", "text": "Recommends using route patterns/templates (e.g., /users/:id) instead of actual paths"},
      {"id": "3.6", "text": "Suggests using traces (not metrics) for high-cardinality data like user IDs"}
    ]
  },
  {
    "id": 4,
    "name": "production-json-logging",
    "description": "Tests whether the model recommends JSON handler for production and explains why plain text is problematic",
    "prompt": "I'm setting up slog for my Go production service. I like the TextHandler output because it's readable. Here's my setup:\n\nslog.SetDefault(slog.New(slog.NewTextHandler(os.Stdout, nil)))\n\nShould I use this in production?",
    "trap": "Without the skill, the model may say TextHandler is fine since it produces structured key=value output",
    "assertions": [
      {"id": "4.1", "text": "Recommends JSONHandler for production, not TextHandler"},
      {"id": "4.2", "text": "Explains that plain-text multiline logs (e.g., stack traces) get split into separate records by log collectors"},
      {"id": "4.3", "text": "Suggests TextHandler is appropriate for development only"},
      {"id": "4.4", "text": "Shows the correct JSONHandler setup with slog.LevelInfo for production"}
    ]
  },
  {
    "id": 5,
    "name": "slog-context-variant-trace-correlation",
    "description": "Tests whether the model insists on *Context variants of slog for trace correlation",
    "prompt": "I'm adding logging to my Go service that already has OpenTelemetry tracing configured with otelslog bridge. Here's my logging code:\n\nfunc (s *OrderService) Create(ctx context.Context, req CreateOrderRequest) error {\n    slog.Info(\"creating order\", \"order_id\", req.ID)\n    // ... business logic ...\n    slog.Error(\"order creation failed\", \"error\", err)\n    return err\n}\n\nAnything wrong with my logging?",
    "trap": "Without the skill, the model may not flag the missing context parameter since the logging looks correct",
    "assertions": [
      {"id": "5.1", "text": "Identifies that slog.Info and slog.Error should use their *Context variants (slog.InfoContext, slog.ErrorContext)"},
      {"id": "5.2", "text": "Explains that without ctx, trace_id and span_id won't be injected into log records"},
      {"id": "5.3", "text": "Shows the corrected code using slog.InfoContext(ctx, ...) and slog.ErrorContext(ctx, ...)"},
      {"id": "5.4", "text": "Mentions that the otelslog bridge automatically injects trace correlation when context is passed"}
    ]
  },
  {
    "id": 6,
    "name": "multi-window-burn-rate-slo-alerting",
    "description": "Tests knowledge of multi-window burn-rate SLO alerting instead of simple threshold",
    "prompt": "My Go API has a 99.9% availability SLO. I have this alert:\n\n- alert: HighErrorRate\n  expr: rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m]) > 0.001\n  for: 5m\n\nI get too many false positives from brief spikes but also miss slow degradation that stays just under 0.1%. How do I improve my alerting strategy?",
    "trap": "Without the skill, the model suggests adjusting the threshold or the for: duration, missing the multi-window burn-rate approach that the skill specifically teaches",
    "assertions": [
      {"id": "6.1", "text": "Recommends multi-window burn-rate alerting to address both false positives and slow burn scenarios"},
      {"id": "6.2", "text": "Explains error budget and burn rate concepts"},
      {"id": "6.3", "text": "Shows a fast burn window (e.g., 5m + 1h, ~14x burn rate) for critical/page alerts"},
      {"id": "6.4", "text": "Shows a slow burn window (e.g., 2h + 24h, 1-2x burn rate) for warning/ticket alerts"},
      {"id": "6.5", "text": "Explains that ANDing short and long windows eliminates false positives from transient spikes"}
    ]
  },
  {
    "id": 7,
    "name": "irate-vs-rate-for-alerts",
    "description": "Tests whether the model catches irate() usage in alerting rules",
    "prompt": "I'm writing a Prometheus alerting rule for my Go service to detect high error rates:\n\n- alert: HighErrorRate\n  expr: irate(http_requests_total{status=~\"5..\"}[5m]) > 0.01\n\nDoes this look correct?",
    "trap": "Without the skill, the model may approve irate since it's a valid PromQL function, missing that irate is too volatile for alerting",
    "assertions": [
      {"id": "7.1", "text": "Identifies irate() as inappropriate for alerting rules"},
      {"id": "7.2", "text": "Explains that irate reacts to a single scrape interval and is too volatile, causing false positives"},
      {"id": "7.3", "text": "Recommends rate() instead of irate() for alerts"},
      {"id": "7.4", "text": "Recommends adding a for: duration to avoid firing on transient spikes"},
      {"id": "7.5", "text": "Shows the corrected alert rule using rate() with a for: clause"}
    ]
  },
  {
    "id": 8,
    "name": "alert-missing-for-duration",
    "description": "Tests whether the model catches alerts without a for: duration clause",
    "prompt": "Here's my Prometheus alert for high P99 latency:\n\n- alert: HighLatency\n  expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2\n\nShould I deploy this?",
    "trap": "Without the skill, the model may approve it since the PromQL expression itself is correct",
    "assertions": [
      {"id": "8.1", "text": "Identifies the missing for: duration as a problem"},
      {"id": "8.2", "text": "Explains that without for:, a single bad scrape triggers the alert (false positive)"},
      {"id": "8.3", "text": "Recommends adding for: 5m or similar duration"},
      {"id": "8.4", "text": "Distinguishes that binary alerts (service up/down) can use for: 0m, but non-binary alerts need a duration"}
    ]
  },
  {
    "id": 9,
    "name": "promql-comments-convention",
    "description": "Tests whether the model recommends documenting metrics with PromQL comments above declarations",
    "prompt": "I'm declaring Prometheus metrics in my Go service. Here's my pattern:\n\nvar httpRequestsTotal = promauto.NewCounterVec(\n    prometheus.CounterOpts{\n        Namespace: \"myapp\",\n        Subsystem: \"http\",\n        Name:      \"requests_total\",\n        Help:      \"Total number of HTTP requests.\",\n    },\n    []string{\"method\", \"path\", \"status\"},\n)\n\nHow can I make my metrics more discoverable for my team?",
    "trap": "Without the skill, the model may suggest external documentation or wiki pages, missing the PromQL-as-comments convention",
    "assertions": [
      {"id": "9.1", "text": "Recommends adding PromQL queries and alert rules as comments directly above the metric variable declaration"},
      {"id": "9.2", "text": "Shows example Dashboard: and Alert: comment lines above the metric var"},
      {"id": "9.3", "text": "Explains that this keeps PromQL queries reviewed in PRs alongside the metric"},
      {"id": "9.4", "text": "Mentions that queries stay in sync with metric changes (label renames, bucket changes)"},
      {"id": "9.5", "text": "Notes that new team members can understand the metric's purpose at a glance from the comments"}
    ]
  },
  {
    "id": 10,
    "name": "otelslog-bridge-setup",
    "description": "Tests log-trace correlation setup using otelslog bridge",
    "prompt": "I have a Go service with both slog logging and OpenTelemetry tracing. I want to correlate them so that when I see a log line in Grafana Loki, I can jump to the trace in Tempo. How do I connect them?",
    "trap": "Without the skill, the model may suggest manually extracting trace_id from span context and adding it as a slog attribute, missing the otelslog bridge",
    "assertions": [
      {"id": "10.1", "text": "Recommends using the otelslog bridge from go.opentelemetry.io/contrib/bridges/otelslog"},
      {"id": "10.2", "text": "Shows creating a handler with otelslog.NewHandler()"},
      {"id": "10.3", "text": "Shows setting it as default with slog.SetDefault()"},
      {"id": "10.4", "text": "Explains that trace_id and span_id are automatically injected into log records"},
      {"id": "10.5", "text": "Emphasizes using slog.*Context(ctx, ...) variants to enable the automatic injection"}
    ]
  },
  {
    "id": 11,
    "name": "exemplars-metric-trace-link",
    "description": "Tests metrics-to-traces correlation via Prometheus exemplars",
    "prompt": "I have a Prometheus histogram tracking HTTP request latency and OpenTelemetry tracing. When I see a P99 latency spike in Grafana, I want to jump directly to the offending trace. How do I link metrics to traces?",
    "trap": "Without the skill, the model may suggest using metric labels or manual correlation, missing the exemplar mechanism",
    "assertions": [
      {"id": "11.1", "text": "Recommends using Prometheus exemplars to link metrics to traces"},
      {"id": "11.2", "text": "Shows attaching trace_id as an exemplar when recording histogram observations"},
      {"id": "11.3", "text": "Explains that exemplars let you jump from a metric spike directly to the trace that caused it"}
    ]
  },
  {
    "id": 12,
    "name": "span-error-recording-both-calls",
    "description": "Tests that error recording on spans requires both RecordError() and SetStatus(Error)",
    "prompt": "In my Go service using OpenTelemetry, when an operation fails, I do:\n\nif err != nil {\n    span.RecordError(err)\n    return err\n}\n\nIs this the correct way to record errors on spans?",
    "trap": "Without the skill, the model may approve this since RecordError is called, missing that SetStatus must also be called",
    "assertions": [
      {"id": "12.1", "text": "Identifies that span.SetStatus(codes.Error, ...) is also needed alongside RecordError"},
      {"id": "12.2", "text": "Explains that RecordError adds an event but does not mark the span as failed"},
      {"id": "12.3", "text": "Shows the corrected pattern with both span.RecordError(err) and span.SetStatus(codes.Error, ...)"},
      {"id": "12.4", "text": "Notes that on success, no status needs to be set (Unset is fine)"}
    ]
  },
  {
    "id": 13,
    "name": "trace-context-lost-in-background-goroutine",
    "description": "Tests that a new goroutine spawned inside a span loses trace context unless ctx is explicitly propagated",
    "prompt": "My Go service sends a notification after processing an order. I want it to be non-blocking so I spawn a goroutine:\n\nfunc (s *OrderService) Process(ctx context.Context, order Order) error {\n    ctx, span := tracer.Start(ctx, \"process-order\")\n    defer span.End()\n\n    // business logic...\n\n    go func() {\n        s.notifier.Send(context.Background(), order.UserID, \"Order confirmed\")\n    }()\n\n    return nil\n}\n\nI have OpenTelemetry configured. Why doesn't the notification span appear as a child of process-order in my traces?",
    "trap": "Without the skill, the model may suggest the span setup is fine or focus on goroutine lifecycle, missing that context.Background() discards the parent trace context",
    "assertions": [
      {"id": "13.1", "text": "Identifies that context.Background() discards the trace context — the notification has no parent span"},
      {"id": "13.2", "text": "Explains that the trace context (trace_id, span_id) travels only inside the ctx variable"},
      {"id": "13.3", "text": "Shows the fix: capture ctx before the goroutine and pass it in (not context.Background())"},
      {"id": "13.4", "text": "Notes that the notification goroutine may outlive the parent span, but trace propagation still requires passing the original ctx"}
    ]
  },
  {
    "id": 14,
    "name": "db-query-context-propagation",
    "description": "Tests that database calls must use *Context variants for trace propagation",
    "prompt": "My Go service has OpenTelemetry tracing, but I notice my database queries don't appear as spans in traces. Here's my code:\n\nresult, err := db.Query(\"SELECT * FROM users WHERE id = $1\", userID)\n\nWhat am I missing?",
    "trap": "Without the skill, the model may suggest adding manual spans around the query, missing the fundamental issue of not passing context",
    "assertions": [
      {"id": "14.1", "text": "Identifies that db.Query should be db.QueryContext(ctx, ...) to propagate trace context"},
      {"id": "14.2", "text": "Explains that without context, the trace is broken — child spans cannot be created"},
      {"id": "14.3", "text": "Shows the corrected code using db.QueryContext(ctx, ...)"},
      {"id": "14.4", "text": "States that context is the vehicle that carries trace_id and span_id across boundaries"}
    ]
  },
  {
    "id": 15,
    "name": "trace-sampling-cost-control",
    "description": "Tests awareness of trace sampling strategies and cost implications",
    "prompt": "My Go microservice handles 50,000 requests per second. I enabled OpenTelemetry tracing at 100% sampling and my tracing backend costs tripled. How should I control tracing costs without losing visibility?",
    "trap": "Without the skill, the model may suggest only reducing sampling ratio, missing the nuances of ParentBased sampling and the specific recommendation to start at 10%",
    "assertions": [
      {"id": "15.1", "text": "Recommends TraceIDRatioBased sampling with a specific ratio (e.g., 0.1 for 10%)"},
      {"id": "15.2", "text": "Mentions ParentBased sampler to respect parent's sampling decision and keep traces complete across services"},
      {"id": "15.3", "text": "Discusses head-based vs tail-based sampling tradeoffs"},
      {"id": "15.4", "text": "Recommends avoiding large payloads as span attributes — log them instead and correlate via trace_id"},
      {"id": "15.5", "text": "Explains the cost factors: span volume, span attributes, storage and indexing"}
    ]
  },
  {
    "id": 16,
    "name": "where-to-add-spans",
    "description": "Tests knowledge of which operations must have spans in OpenTelemetry",
    "prompt": "I'm adding OpenTelemetry tracing to my existing Go service. Which functions should I add spans to? I don't want to instrument everything unnecessarily.",
    "trap": "Without the skill, the model may give vague guidance like 'important functions', missing the specific categories the skill defines",
    "assertions": [
      {"id": "16.1", "text": "Lists service methods (business logic layer) as requiring spans"},
      {"id": "16.2", "text": "Lists database queries as requiring spans"},
      {"id": "16.3", "text": "Lists external API calls as requiring spans"},
      {"id": "16.4", "text": "Lists message queue publish/consume operations as requiring spans"},
      {"id": "16.5", "text": "States any operation that takes measurable time or could fail should have a span"}
    ]
  },
  {
    "id": 17,
    "name": "four-golden-signals-alerting",
    "description": "Tests knowledge of the four golden signals for service alerting",
    "prompt": "I'm setting up alerting for my new Go API service from scratch. I have Prometheus metrics. What should I alert on? Give me the essential alerts.",
    "trap": "Without the skill, the model may list ad-hoc alerts, missing the structured four golden signals framework",
    "assertions": [
      {"id": "17.1", "text": "References the four golden signals (from Google SRE): latency, traffic, errors, saturation"},
      {"id": "17.2", "text": "Includes a latency alert (e.g., P99 > threshold)"},
      {"id": "17.3", "text": "Includes a traffic alert (e.g., zero requests detection)"},
      {"id": "17.4", "text": "Includes an error rate alert (e.g., 5xx ratio > threshold)"},
      {"id": "17.5", "text": "Includes a saturation alert (e.g., connection pool > 90%)"}
    ]
  },
  {
    "id": 18,
    "name": "awesome-prometheus-alerts-resource",
    "description": "Tests whether the model recommends awesome-prometheus-alerts as a starting point for infrastructure alerting",
    "prompt": "I'm adding PostgreSQL and Redis to my Go service and need alerting rules. Should I write Prometheus alert rules from scratch for each dependency?",
    "trap": "Without the skill, the model will likely suggest writing rules from scratch or generic examples",
    "assertions": [
      {"id": "18.1", "text": "Recommends awesome-prometheus-alerts (samber.github.io/awesome-prometheus-alerts/) as a starting point"},
      {"id": "18.2", "text": "Mentions it contains ~500 ready-to-use Prometheus alerting rules organized by technology"},
      {"id": "18.3", "text": "Suggests the workflow: browse by technology, copy rules, customize thresholds"},
      {"id": "18.4", "text": "Mentions verifying that exporters (postgres_exporter, redis_exporter) are deployed"}
    ]
  },
  {
    "id": 19,
    "name": "go-runtime-alerts",
    "description": "Tests knowledge of Go runtime-specific alerts using default Prometheus client metrics",
    "prompt": "My Go service occasionally becomes unresponsive. I suspect goroutine leaks or GC pressure. What Go runtime-specific Prometheus alerts should I set up?",
    "trap": "Without the skill, the model may suggest only basic goroutine count alerts, missing the full set of runtime alerts",
    "assertions": [
      {"id": "19.1", "text": "Suggests alerting on go_goroutines exceeding a threshold (e.g., > 1000) for goroutine leaks"},
      {"id": "19.2", "text": "Suggests alerting on go_gc_duration_seconds for GC pressure"},
      {"id": "19.3", "text": "Suggests alerting on go_memstats_alloc_bytes / go_memstats_sys_bytes for memory leaks"},
      {"id": "19.4", "text": "Suggests alerting on go_threads for high OS thread count"},
      {"id": "19.5", "text": "Uses for: duration on all non-binary alerts to avoid false positives"}
    ]
  },
  {
    "id": 20,
    "name": "alert-severity-levels",
    "description": "Tests correct severity classification and for: durations",
    "prompt": "I'm categorizing my Prometheus alerts. Should goroutine leaks be critical? What about service down? What for: durations should I use for each severity?",
    "trap": "Without the skill, the model may assign arbitrary severity levels, missing the two-level system with specific for: duration guidance",
    "assertions": [
      {"id": "20.1", "text": "Uses two severity levels: critical (page on-call) and warning (create ticket)"},
      {"id": "20.2", "text": "Critical alerts: for: 2m to 5m for fast detection"},
      {"id": "20.3", "text": "Warning alerts: for: 10m to 30m for confirmed trends"},
      {"id": "20.4", "text": "Classifies service down as critical with short for: duration"},
      {"id": "20.5", "text": "Classifies goroutine leak as warning (not critical)"},
      {"id": "20.6", "text": "States that for: 0m should never be used on non-binary alerts"}
    ]
  },
  {
    "id": 21,
    "name": "multi-window-burn-rate-slo",
    "description": "Tests knowledge of multi-window burn-rate SLO alerting over simple threshold alerts",
    "prompt": "My Go API has a 99.9% availability SLO. I currently alert when error rate exceeds 1%. But I get false positives from brief spikes and miss slow degradation. How should I improve my alerting?",
    "trap": "Without the skill, the model may suggest tuning the threshold or adding for: duration, missing the multi-window burn-rate approach",
    "assertions": [
      {"id": "21.1", "text": "Recommends multi-window burn-rate alerting instead of simple threshold alerts"},
      {"id": "21.2", "text": "Explains the concept of error budget and burn rate"},
      {"id": "21.3", "text": "Includes fast burn window (e.g., 5m + 1h, 14.4x burn rate) as critical/page"},
      {"id": "21.4", "text": "Includes slow burn window (e.g., 2h + 24h, 1x burn rate) as warning/ticket"},
      {"id": "21.5", "text": "Shows PromQL using AND of short and long windows to eliminate false positives from transient blips"}
    ]
  },
  {
    "id": 22,
    "name": "slog-migration-from-zap",
    "description": "Tests the incremental migration strategy from zap to slog using bridge handlers",
    "prompt": "My Go codebase has 500+ files using zap for logging. We want to migrate to slog. How do we do this without a big-bang rewrite?",
    "trap": "Without the skill, the model may suggest a gradual replacement without the bridge handler step, or suggest running both loggers in parallel",
    "assertions": [
      {"id": "22.1", "text": "Recommends a three-step migration: bridge, replace call sites, remove bridge"},
      {"id": "22.2", "text": "Step 1: Use samber/slog-zap bridge handler to route slog output through zap"},
      {"id": "22.3", "text": "Step 2: Gradually replace zap.L().Info(...) calls with slog.Info(...)"},
      {"id": "22.4", "text": "Step 3: Once fully migrated, replace the bridge with native slog JSONHandler and remove zap dependency"},
      {"id": "22.5", "text": "Mentions using parallel sub-agents for large codebase migration (assigning independent packages to each)"}
    ]
  },
  {
    "id": 23,
    "name": "slog-migration-from-logrus",
    "description": "Tests the bridge handler approach for logrus migration",
    "prompt": "We use logrus throughout our Go project and want to standardize on slog. Is there a way to migrate incrementally?",
    "trap": "Without the skill, the model may not know about samber/slog-logrus bridge",
    "assertions": [
      {"id": "23.1", "text": "Recommends using samber/slog-logrus bridge handler for incremental migration"},
      {"id": "23.2", "text": "Explains that slog is the standard library logger since Go 1.21"},
      {"id": "23.3", "text": "Shows the bridge step: route slog output through the existing logrus logger"},
      {"id": "23.4", "text": "Shows the replacement: logrus.WithField(\"key\", val).Info(\"msg\") becomes slog.Info(\"msg\", \"key\", val)"}
    ]
  },
  {
    "id": 24,
    "name": "debug-level-production-cost",
    "description": "Tests awareness of log level cost implications in production",
    "prompt": "I'm setting up slog for my Go production service. To maximize debugging ability, I'm considering setting the log level to Debug so we always have full visibility. What log level should I use?",
    "trap": "Without the skill, the model may suggest Debug with a generic caveat about volume, missing the specific cost analysis",
    "assertions": [
      {"id": "24.1", "text": "Recommends slog.LevelInfo for production, NOT Debug"},
      {"id": "24.2", "text": "Explains that Debug level can generate millions of log lines per minute in busy services"},
      {"id": "24.3", "text": "Mentions cost: CPU for serialization, I/O for disk/network, money for log ingestion/storage"},
      {"id": "24.4", "text": "Mentions Debug can inflate costs by 10-100x"},
      {"id": "24.5", "text": "Suggests samber/slog-sampling as an alternative to sample verbose logs rather than dropping entirely"}
    ]
  },
  {
    "id": 25,
    "name": "ip-address-pii-in-logs",
    "description": "Tests whether the model recognizes IP address as PII that requires care in logs",
    "prompt": "I'm adding observability to my Go authentication service. Here's my access log:\n\nslog.Info(\"login attempt\",\n    \"user_id\", req.UserID,\n    \"ip\", r.RemoteAddr,\n    \"user_agent\", r.UserAgent(),\n    \"success\", success,\n)\n\nThis looks useful for detecting brute-force attacks. Is there a compliance concern?",
    "trap": "Without the skill, the model validates this as good observability practice since IP addresses are legitimately useful for security. The model knows email/SSN are PII but commonly misses that IP address is also regulated PII under GDPR.",
    "assertions": [
      {"id": "25.1", "text": "Identifies IP address (r.RemoteAddr) as PII under GDPR and CCPA"},
      {"id": "25.2", "text": "Notes that user_agent can also be a fingerprinting vector that combines to uniquely identify a user"},
      {"id": "25.3", "text": "Recommends a legal/privacy review before logging IP addresses in European services"},
      {"id": "25.4", "text": "Suggests using hashed, truncated, or anonymized IPs if full IP is not required by the security use case"}
    ]
  },
  {
    "id": 26,
    "name": "gauge-should-not-have-total-suffix",
    "description": "Tests that Gauge metrics must NOT use _total suffix, and counters MUST",
    "prompt": "I'm declaring Prometheus metrics for my Go service. Review these declarations:\n\nvar activeConnections = promauto.NewGauge(prometheus.GaugeOpts{\n    Namespace: \"myapp\",\n    Name:      \"connections_active_total\",\n})\n\nvar requestsProcessed = promauto.NewCounter(prometheus.CounterOpts{\n    Namespace: \"myapp\",\n    Name:      \"requests_processed\",\n})\n\nAre these names correct?",
    "trap": "Without the skill, the model may only flag one issue or swap the corrections. Gauges must NOT use _total; counters MUST use _total. Using _total on a gauge implies it is cumulative when it is not.",
    "assertions": [
      {"id": "26.1", "text": "Flags connections_active_total as incorrect — Gauges must NOT use _total suffix because _total implies a counter (monotonically increasing cumulative value)"},
      {"id": "26.2", "text": "Recommends renaming the gauge to myapp_connections_active (no _total)"},
      {"id": "26.3", "text": "Flags requests_processed as incorrect — Counters MUST end with _total"},
      {"id": "26.4", "text": "Recommends renaming the counter to myapp_requests_processed_total"}
    ]
  },
  {
    "id": 27,
    "name": "pprof-exposed-on-public-mux",
    "description": "Tests that importing net/http/pprof registers on the default mux which may serve public traffic",
    "prompt": "I want to enable pprof for my Go production service. I see that just adding `import _ \"net/http/pprof\"` is enough. My service uses http.ListenAndServe(\":8080\", nil) for its API. Is this safe?",
    "trap": "Without the skill, the model may warn generically about security but miss the specific mechanism: the blank import registers handlers on http.DefaultServeMux which is the nil mux used by ListenAndServe — pprof is served publicly on port 8080",
    "assertions": [
      {"id": "27.1", "text": "Identifies that import _ \"net/http/pprof\" registers /debug/pprof/ routes on http.DefaultServeMux"},
      {"id": "27.2", "text": "Explains that http.ListenAndServe(\":8080\", nil) uses http.DefaultServeMux as its handler — pprof is publicly accessible on port 8080"},
      {"id": "27.3", "text": "Recommends serving pprof on a separate internal port (e.g., :6060) using a dedicated ServeMux or http.Server"},
      {"id": "27.4", "text": "Warns that pprof leaks sensitive runtime information (goroutine stacks, heap profiles, environment) and should never be publicly accessible"}
    ]
  },
  {
    "id": 28,
    "name": "continuous-profiling-env-toggle",
    "description": "Tests the recommendation to toggle continuous profiling via environment variables",
    "prompt": "I want to set up Pyroscope continuous profiling for my Go production service. Should I always have it enabled on all instances?",
    "trap": "Without the skill, the model may recommend always-on profiling on all instances",
    "assertions": [
      {"id": "28.1", "text": "Recommends toggling via environment variable (e.g., PROFILING_ENABLED)"},
      {"id": "28.2", "text": "Mentions ~2-5% CPU overhead for continuous profiling"},
      {"id": "28.3", "text": "Suggests starting with CPU + heap profiles only, adding mutex/block when needed"},
      {"id": "28.4", "text": "For large deployments, recommends enabling on a fraction of replicas (e.g., 1 in 10)"},
      {"id": "28.5", "text": "Shows code that checks the environment variable before starting Pyroscope"}
    ]
  },
  {
    "id": 29,
    "name": "rum-identity-key-email-trap",
    "description": "Tests that RUM distinct_id must be user_id, not email",
    "prompt": "I'm integrating PostHog server-side tracking in my Go service. For the DistinctId, I'm using the user's email since it's a natural identifier users know. Here's my code:\n\nposthogClient.Enqueue(posthog.Capture{\n    DistinctId: user.Email,\n    Event: \"order_completed\",\n})\n\nIs this correct?",
    "trap": "Without the skill, the model may accept email as a valid identifier since it's unique",
    "assertions": [
      {"id": "29.1", "text": "Rejects email as the DistinctId — must use user_id instead"},
      {"id": "29.2", "text": "Explains that email is mutable — users change it, splitting events into two users"},
      {"id": "29.3", "text": "Explains that email is PII, complicating GDPR/CCPA compliance"},
      {"id": "29.4", "text": "Notes that email leaks into third-party analytics systems as the identity key"},
      {"id": "29.5", "text": "Shows corrected code using user.ID (immutable internal identifier)"}
    ]
  },
  {
    "id": 30,
    "name": "gdpr-consent-before-tracking",
    "description": "Tests that GDPR consent must be checked before sending analytics events",
    "prompt": "I'm adding PostHog server-side event tracking to my Go e-commerce service for European users. Here's my order completion handler — it tracks the event after business logic:\n\nfunc (s *OrderService) Complete(ctx context.Context, order Order) error {\n    // ... business logic ...\n    posthogClient.Enqueue(posthog.Capture{\n        DistinctId: order.UserID,\n        Event: \"order_completed\",\n    })\n    return nil\n}\n\nAnything I'm missing for EU compliance?",
    "trap": "Without the skill, the model may suggest a privacy policy or cookie consent without the server-side consent check pattern",
    "assertions": [
      {"id": "30.1", "text": "Identifies that consent must be checked before sending the tracking event"},
      {"id": "30.2", "text": "Shows extracting consent from context and conditionally tracking"},
      {"id": "30.3", "text": "Mentions GDPR fines (up to 4% of global revenue) or CCPA penalties"},
      {"id": "30.4", "text": "References data minimization — only collect what you need"},
      {"id": "30.5", "text": "Mentions data subject rights endpoints (data export and deletion)"}
    ]
  },
  {
    "id": 31,
    "name": "data-subject-rights-endpoints",
    "description": "Tests that GDPR requires data deletion and export endpoints that propagate to all systems",
    "prompt": "A user of my Go SaaS service (with PostHog analytics and Segment CDP) requests deletion of all their data under GDPR. My current implementation just deletes from the database. Is that sufficient?",
    "trap": "Without the skill, the model may say database deletion is sufficient or only mention one additional system",
    "assertions": [
      {"id": "31.1", "text": "States that deletion must propagate to ALL systems holding user data, not just the database"},
      {"id": "31.2", "text": "Lists the analytics platform (PostHog) as needing deletion"},
      {"id": "31.3", "text": "Lists the CDP (Segment) as needing deletion"},
      {"id": "31.4", "text": "References GDPR Article 17 Right to Erasure"},
      {"id": "31.5", "text": "Also mentions the Right of Access (data export endpoint) as a requirement"}
    ]
  },
  {
    "id": 32,
    "name": "five-signals-completeness",
    "description": "Tests knowledge of the five observability signals and their distinct roles",
    "prompt": "I'm building a new Go microservice. What observability signals should I implement for production readiness?",
    "trap": "Without the skill, the model typically covers logs, metrics, traces but misses profiles and RUM",
    "assertions": [
      {"id": "32.1", "text": "Lists all five signals: logs, metrics, traces, profiles, and RUM"},
      {"id": "32.2", "text": "Associates logs with 'what happened' (discrete events, audit trails)"},
      {"id": "32.3", "text": "Associates metrics with 'how much/how fast' (aggregated measurements, alerting, SLOs)"},
      {"id": "32.4", "text": "Associates traces with 'where did time go' (request flow across services)"},
      {"id": "32.5", "text": "Associates profiles with 'why is it slow/using memory' (CPU hotspots, memory leaks)"},
      {"id": "32.6", "text": "Associates RUM with 'how do users experience it' (product analytics, funnels)"}
    ]
  },
  {
    "id": 33,
    "name": "definition-of-done-observability",
    "description": "Tests the observability definition of done checklist before shipping a feature",
    "prompt": "I'm about to ship a new payment processing feature in my Go service. My code works, tests pass, and it's been code-reviewed. Am I ready to deploy?",
    "trap": "Without the skill, the model may say yes or mention generic deployment checks, missing the observability-specific definition of done",
    "assertions": [
      {"id": "33.1", "text": "States that a feature is not production-ready until it is observable"},
      {"id": "33.2", "text": "Checks for metric declarations (counters, histograms, gauges) with PromQL comments"},
      {"id": "33.3", "text": "Checks for proper structured logging with slog and context variants"},
      {"id": "33.4", "text": "Checks for OpenTelemetry spans on service methods, DB queries, and external calls"},
      {"id": "33.5", "text": "Checks for dashboards and alerts being wired up"},
      {"id": "33.6", "text": "Checks that errors are either logged OR returned, never both"}
    ]
  },
  {
    "id": 34,
    "name": "grafana-dashboard-ids",
    "description": "Tests knowledge of specific Grafana dashboard IDs for Go runtime monitoring",
    "prompt": "I want to monitor my Go service's runtime metrics (goroutines, heap, GC) in Grafana. Are there prebuilt dashboards I can use?",
    "trap": "Without the skill, the model will likely suggest building custom dashboards from scratch",
    "assertions": [
      {"id": "34.1", "text": "Recommends specific Grafana dashboard IDs (21221, 6671, or 10826)"},
      {"id": "34.2", "text": "Mentions dashboard 21221 for host + runtime combined view (or similar description)"},
      {"id": "34.3", "text": "Explains that these dashboards use default Go collector metrics from the Prometheus client library"},
      {"id": "34.4", "text": "Shows how to import: Dashboards > New > Import, enter the dashboard ID"}
    ]
  },
  {
    "id": 35,
    "name": "slog-error-as-attr-not-positional",
    "description": "Tests whether the model uses slog.Any/slog.Attr correctly for error values vs positional args",
    "prompt": "I'm logging errors in my Go service using slog. Review this code:\n\nif err != nil {\n    slog.ErrorContext(ctx, \"payment failed\", err, \"order_id\", orderID)\n    return err\n}\n\nDoes this look right?",
    "trap": "Without the skill, the model may miss that passing err directly as a positional argument (not as a named key-value pair) is incorrect. slog expects alternating key-value pairs; err as a positional arg becomes the key with its string representation, losing structured error metadata.",
    "assertions": [
      {"id": "35.1", "text": "Identifies that err is passed as a positional argument without a key name, which is incorrect for slog"},
      {"id": "35.2", "text": "Explains that slog expects alternating key-value pairs — passing err without a key makes slog treat it as a mismatched argument or key"},
      {"id": "35.3", "text": "Shows the corrected form using a named key: slog.ErrorContext(ctx, \"payment failed\", \"error\", err, \"order_id\", orderID)"},
      {"id": "35.4", "text": "Notes that the correct form preserves the error's structured information (message, type) in the log record"}
    ]
  },
  {
    "id": 36,
    "name": "slog-ecosystem-handlers",
    "description": "Tests awareness of the slog handler ecosystem beyond stdlib",
    "prompt": "I need my Go service logs to go to multiple destinations: JSON to stdout, errors to Sentry, and all logs to Datadog. Can slog do this?",
    "trap": "Without the skill, the model may suggest writing custom handlers from scratch",
    "assertions": [
      {"id": "36.1", "text": "Recommends stdlib slog.NewMultiHandler for simple fan-out on Go 1.26+, or samber/slog-multi when stdlib composition is insufficient"},
      {"id": "36.2", "text": "Mentions samber/slog-sentry for sending errors to Sentry"},
      {"id": "36.3", "text": "Mentions samber/slog-datadog for sending logs to Datadog"},
      {"id": "36.4", "text": "Explains that slog supports pluggable handlers"},
      {"id": "36.5", "text": "References the slog ecosystem (go.dev/wiki/Resources-for-slog or similar)"}
    ]
  },
  {
    "id": 37,
    "name": "parallel-observability-audit",
    "description": "Tests the recommendation to use parallel sub-agents for observability audits in large codebases",
    "prompt": "I need to audit observability across a Go monolith with 200+ packages. How should I approach this efficiently?",
    "trap": "Without the skill, the model may suggest a linear, package-by-package approach",
    "assertions": [
      {"id": "37.1", "text": "Recommends using up to 5 parallel sub-agents (via the Agent tool)"},
      {"id": "37.2", "text": "Assigns one sub-agent per signal: metrics, logging, tracing, profiling, RUM"},
      {"id": "37.3", "text": "Sub-agent for metrics: verify metric declarations and PromQL comments"},
      {"id": "37.4", "text": "Sub-agent for logging: check structured logging, PII in logs, error logging patterns"},
      {"id": "37.5", "text": "Sub-agent for tracing: verify span creation in service methods, DB calls, API calls"}
    ]
  },
  {
    "id": 38,
    "name": "predict-linear-for-saturation",
    "description": "Tests knowledge of predict_linear PromQL function for anticipating resource exhaustion",
    "prompt": "My Go service's database connection pool occasionally hits the maximum and requests start failing. I want to be alerted BEFORE it reaches the limit, not after. How can I set up predictive alerting?",
    "trap": "Without the skill, the model may suggest a simple threshold alert at 90%, missing the predict_linear approach",
    "assertions": [
      {"id": "38.1", "text": "Recommends using predict_linear() PromQL function to extrapolate trends"},
      {"id": "38.2", "text": "Shows an expression like: predict_linear(db_connections_active[15m], 600) > db_connections_max"},
      {"id": "38.3", "text": "Explains that predict_linear extrapolates from recent trend to predict future value"},
      {"id": "38.4", "text": "Also suggests a threshold alert (e.g., > 90%) as a complementary alert"}
    ]
  },
  {
    "id": 39,
    "name": "self-hosted-rum-gdpr",
    "description": "Tests the recommendation of self-hosted analytics for GDPR compliance simplification",
    "prompt": "We're building a Go SaaS product targeting EU customers. We need product analytics (funnels, user behavior) but our legal team is concerned about sending user data to US-based analytics vendors. What should we do?",
    "trap": "Without the skill, the model may suggest DPAs and SCCs with SaaS vendors, missing the self-hosted option",
    "assertions": [
      {"id": "39.1", "text": "Recommends self-hosted analytics (PostHog or Matomo) for EU data residency"},
      {"id": "39.2", "text": "Explains that self-hosting eliminates cross-border data transfer concerns"},
      {"id": "39.3", "text": "Compares self-hosted vs SaaS tradeoffs (data residency, cost, maintenance, features)"},
      {"id": "39.4", "text": "Mentions that PostHog can be self-hosted to keep data in your own infrastructure"}
    ]
  },
  {
    "id": 40,
    "name": "oops-structured-errors-tracing",
    "description": "Tests awareness of samber/oops for structured errors in tracing context",
    "prompt": "My Go service records errors on OpenTelemetry spans using span.RecordError(err). But the error messages are generic like 'connection refused' with no stack trace or request context. How can I get richer error information in my traces?",
    "trap": "Without the skill, the model may suggest manually adding attributes to spans or using fmt.Errorf with more context",
    "assertions": [
      {"id": "40.1", "text": "Recommends samber/oops for structured errors with stack traces"},
      {"id": "40.2", "text": "Shows using oops to wrap errors with domain (.In()), error code (.Code()), and structured attributes (.With())"},
      {"id": "40.3", "text": "Explains that oops errors carry stack trace, structured context, and work with span.RecordError()"},
      {"id": "40.4", "text": "Mentions compatibility with errors.Is/errors.As and slog"}
    ]
  }
]

Alerting

See metrics.md for multi-window burn-rate SLO alerting and PromQL patterns for application metrics.

The Four Golden Signals

Alert on what matters to users. Google's SRE book defines four golden signals — every Go service SHOULD have alerts covering all four:

Signal	What it measures	Example metric	Alert trigger
Latency	Time to serve a request	`http_request_duration_seconds` (Histogram)	P99 > 2s for 5 minutes
Traffic	Demand on the system	`http_requests_total` (Counter)	Zero requests for 10 minutes
Errors	Rate of failed requests	`http_requests_total{status=~"5.."}` (Counter)	Error ratio > 1% for 5 minutes
Saturation	How full the system is	`db_connections_active / db_connections_max`	Pool > 90% saturated for 5 minutes

Awesome Prometheus Alerts

awesome-prometheus-alerts is a curated collection of ~500 ready-to-use Prometheus alerting rules. This collection serves as a starting point for infrastructure and dependency alerting.

Category	Rules	Covers
Basic Resource Monitoring	~107	Host metrics, Docker containers, hardware
Databases and Brokers	~233	PostgreSQL, MySQL, Redis, MongoDB, Kafka, RabbitMQ, etc.
Reverse Proxies and Load Balancers	~45	Nginx, Apache, HAProxy, Traefik
Runtimes	~4	PHP-FPM, JVM, Sidekiq
Orchestrators	~74	Kubernetes, Nomad, Consul, ArgoCD
Network, Security, and Storage	~40	Ceph, MinIO, SSL/TLS, DNS

How to Use It

The rules are organized by technology. Each rule is a ready-to-use Prometheus alerting rule in YAML format with customizable threshold values and for: durations.

1. Find by technology — locate the relevant database, message broker, or infrastructure component 2. Adapt the YAML rule — adjust threshold values (> 0.01, > 100, etc.) and the for: duration to match your SLOs and traffic patterns 3. Place in Prometheus config — add to the prometheus/rules/ directory

Integration Example

Prometheus loads alerting rules from YAML files referenced in its config. After copying rules from awesome-prometheus-alerts, place them in your rules directory:

# prometheus/rules/postgresql.yml
groups:
  - name: postgresql
    rules:
      # From awesome-prometheus-alerts — PostgreSQL section
      - alert: PostgresqlDown
        expr: pg_up == 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL down (instance {{ $labels.instance }})"
          description: "PostgreSQL instance is down.\n  VALUE = {{ $value }}"

      - alert: PostgresqlTooManyConnections
        expr: sum by (instance, datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) > pg_settings_max_connections * 0.8
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL too many connections (> 80%) (instance {{ $labels.instance }})"
          description: "PostgreSQL has {{ $value }} connections on {{ $labels.datname }}."

# prometheus.yml
rule_files:
  - "rules/*.yml"

Workflow for New Dependencies

When adding a new infrastructure dependency (database, cache, message broker, reverse proxy) to a Go service:

1. awesome-prometheus-alerts has alert rules organized by technology 2. Adapt the relevant alert rules and thresholds to your environment 3. Verify the exporter is deployed (e.g., postgres_exporter, redis_exporter) — the alerts depend on metrics from these exporters 4. Add the rules to your prometheus/rules/ directory

Go Runtime Alerts

The Prometheus Go client automatically exposes runtime metrics. Alert on these to catch resource leaks and GC pressure before they impact users.

# prometheus/rules/go-runtime.yml
groups:
  - name: go-runtime
    rules:
      # Goroutine leak — count growing steadily indicates a leak
      # Diagnose: GET /debug/pprof/goroutine?debug=1 to see goroutine stack traces
      - alert: GoroutineLeak
        expr: go_goroutines > 1000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High goroutine count (instance {{ $labels.instance }})"
          description: "Goroutine count is {{ $value }}, possible leak."

      # GC taking too long — P99 GC pause > 100ms degrades tail latency
      - alert: HighGCDuration
        expr: go_gc_duration_seconds{quantile="1"} > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High GC duration (instance {{ $labels.instance }})"
          description: "Max GC pause is {{ $value }}s. Check heap allocations."

      # Heap growing unbounded — likely a memory leak
      - alert: HighMemoryUsage
        expr: go_memstats_alloc_bytes / go_memstats_sys_bytes > 0.9
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage (instance {{ $labels.instance }})"
          description: "Allocated heap is {{ $value | humanizePercentage }} of system memory."

      # Too many threads — usually caused by blocking syscalls or cgo
      - alert: HighThreadCount
        expr: go_threads > 500
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High OS thread count (instance {{ $labels.instance }})"
          description: "Thread count is {{ $value }}. Check for blocking syscalls."

Alert Severity Levels

Use two severity levels to separate "wake someone up" from "look at it tomorrow":

Severity	Action	`for:` duration	Example
critical	Page on-call	2-5 minutes	Service down, error rate > 5%, data loss risk
warning	Create ticket	10-30 minutes	P99 latency high, connection pool > 80%, goroutine leak

The for: duration controls how long a condition must be true before the alert fires. Short durations catch fast incidents but risk false positives from transient spikes. Long durations reduce noise but delay response.

Guidelines:

Critical alerts: for: 2m to for: 5m — fast detection, wake someone up
Warning alerts: for: 10m to for: 30m — confirmed trend, create a ticket
NEVER set for: 0m on non-binary alerts — one bad scrape triggers a false page
Binary alerts (service up/down) can use for: 0m or for: 1m

Common Mistakes

# Bad -- irate() is too volatile for alerts, reacts to a single scrape interval
# A brief spike or a single slow request triggers the alert
- alert: HighErrorRate
  expr: irate(http_requests_total{status=~"5.."}[5m]) > 0.01

# Good -- rate() smooths over the full window, reducing false positives
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
  for: 5m

# Bad -- no "for:" duration, fires on a single bad scrape
- alert: HighLatency
  expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2

# Good -- must be true for 5 minutes to fire
- alert: HighLatency
  expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
  for: 5m

# Bad -- alerting on raw gauge without trend analysis (flaps constantly)
- alert: HighQueueDepth
  expr: myapp_queue_messages_pending > 1000

# Good -- alert on sustained growth trend
- alert: HighQueueDepth
  expr: myapp_queue_messages_pending > 1000
  for: 10m

Grafana Dashboards for Go Services

These community Grafana dashboards visualize Go runtime performance out of the box. They display the metrics automatically exposed by github.com/prometheus/client_golang — no custom instrumentation needed.

Recommended Dashboards

Dashboard	ID	What it shows
Go Host & Runtime Metrics	21221	Host metrics + Go runtime (goroutines, heap, GC, threads) in one view
Go Processes	6671	Multi-process comparison — CPU, memory, goroutines, GC across all Go services
Go Metrics	10826	Focused Go runtime view — memory breakdown, GC pauses, allocations, goroutines

How to Install

Dashboards are imported via Dashboards > New > Import in Grafana using the dashboard ID (e.g., 21221), then selecting the Prometheus data source.

These dashboards require the default Go collector metrics (go_goroutines, go_memstats_*, go_gc_duration_seconds, process_*). If you use the Prometheus client library with default collectors, everything works out of the box.

When to Use Each

21221 (Host & Runtime) — day-to-day monitoring of a single Go service alongside its host. Best as the default Go dashboard.
6671 (Go Processes) — comparing multiple Go services or replicas side by side. Useful during deployments to spot regressions across instances.
10826 (Go Metrics) — deep-diving into memory and GC behavior of a single service. Best for investigating performance issues.

Structured Logging with `slog`

→ See samber/cc-skills-golang@golang-error-handling skill for the single handling rule.

Why Structured Logging

Structured logs emit key-value pairs instead of freeform strings. Log management systems (Datadog, Grafana Loki, CloudWatch) can index, filter, and aggregate structured fields — something impossible with log.Printf output.

// ✗ Bad — freeform string, impossible to filter by user_id
log.Printf("ERROR: failed to create user %s: %v", userID, err)

// ✓ Good — structured key-value pairs, machine-parseable
slog.Error("user creation failed",
    "user_id", userID,
    "error", err,
)
// JSON output: {"time":"2025-01-15T10:30:00Z","level":"ERROR","msg":"user creation failed","user_id":"u-123","error":"connection refused"}

Handler Setup

// Production MUST use JSON — because plain-text multiline logs (eg. stack traces) would be split into separate records by log collectors
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
    Level: slog.LevelInfo,
}))

// Development — human-readable text
logger := slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{
    Level: slog.LevelDebug,
}))

slog.SetDefault(logger)

Log Levels

slog.Debug("cache lookup", "key", cacheKey, "hit", false)
slog.Info("order created", "order_id", orderID, "total", amount)
slog.Warn("rate limit approaching", "current_usage", 0.92, "limit", 1000)
slog.Error("payment failed", "order_id", orderID, "error", err)

Rule of thumb: if you're unsure between Warn and Error, ask "did the operation succeed?" If yes (even with degradation), use Warn. If no, use Error.

Cost of Logging

Logging is not free. Each log line costs CPU (serialization), I/O (disk/network), and money (log ingestion/storage in your aggregation platform). The cost scales with volume, which is directly controlled by log level.

Debug level in production can generate millions of log lines per minute in a busy service, overwhelming your log pipeline and inflating costs by 10-100x
Info level is the typical production default — it provides enough visibility without excessive volume
Debug level SHOULD be disabled in production — use slog.LevelInfo in production and slog.LevelDebug only in development or when actively debugging a specific issue
For high-throughput services, consider samber/slog-sampling to sample verbose logs (e.g., emit 1 in 100 Debug logs) rather than dropping them entirely

Logging with Context

MUST use the *Context variants to correlate logs with the current trace. When an OpenTelemetry bridge is configured, trace_id and span_id are automatically injected into log records.

// ✗ Bad — no trace correlation
slog.Error("query failed", "error", err)

// ✓ Good — trace_id/span_id attached automatically when OTel bridge is active
slog.ErrorContext(ctx, "query failed", "error", err)

Adding Request-Scoped Attributes

Use slog.With() to create a child logger that includes attributes on every log line. Middleware can inject request-scoped fields so all downstream logs carry the same context.

func LoggingMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        logger := slog.With(
            "request_id", r.Header.Get("X-Request-ID"),
            "method", r.Method,
            "path", r.URL.Path,
        )
        // Store enriched logger in context for downstream use
        ctx := WithLogger(r.Context(), logger)
        next.ServeHTTP(w, r.WithContext(ctx))
    })
}

Log Sinks and the `slog` Ecosystem

slog supports pluggable handlers. The Go community provides handlers for most log backends:

Standard library:

slog.JSONHandler — JSON to stdout/stderr
slog.TextHandler — human-readable key=value

Log record handling:

samber/slog-multi — fan-out to multiple handlers, routing, failover
samber/slog-sampling — sample high-volume logs to reduce cost
samber/slog-formatter — format/transform log attributes

HTTP middleware:

samber/slog-http — HTTP server middleware (net/http, chi, fiber, echo, gin)
samber/slog-gin — Gin framework middleware
samber/slog-echo — Echo framework middleware
samber/slog-fiber — Fiber framework middleware
samber/slog-chi — Chi router middleware

Third-party log sinks (see go.dev/wiki/Resources-for-slog):

lmittmann/tint — colorized terminal output
samber/slog-datadog — send logs to Datadog
samber/slog-sentry — send errors to Sentry
samber/slog-loki — send logs to Grafana Loki
samber/slog-nats — send logs to NATS
samber/slog-syslog — send logs to syslog
samber/slog-fluentd — send logs to Fluentd
samber/slog-logrus — bridge to Logrus
samber/slog-zap — bridge to Zap
samber/slog-zerolog — bridge to Zerolog
samber/slog-slack — send critical logs to Slack

Migrating from zap / logrus / zerolog

log/slog is the standard library logger since Go 1.21. If the project uses zap, logrus, or zerolog, migrate to slog — it has a stable API, broad ecosystem support, and eliminates an unnecessary dependency.

Step 1: Bridge — route slog output through the existing logger so you can migrate call sites incrementally without changing log output:

// Example: bridge slog → zap (same pattern for logrus/zerolog)
import slogzap "github.com/samber/slog-zap/v2"

zapLogger, _ := zap.NewProduction()
slog.SetDefault(slog.New(
    slogzap.Option{Level: slog.LevelInfo, Logger: zapLogger}.NewZapHandler(),
))

Available bridges: samber/slog-zap, samber/slog-logrus, samber/slog-zerolog

Step 2: Replace call sites — change all logger calls to slog:

// zap → slog
// Before: zap.L().Info("order created", zap.String("order_id", id))
// After:
slog.Info("order created", "order_id", id)

// logrus → slog
// Before: logrus.WithField("order_id", id).Info("order created")
// After:
slog.Info("order created", "order_id", id)

// zerolog → slog
// Before: log.Info().Str("order_id", id).Msg("order created")
// After:
slog.Info("order created", "order_id", id)

Step 3: Remove the bridge — once all call sites are migrated, replace the bridge handler with a native slog handler and remove the old logger dependency:

slog.SetDefault(slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
    Level: slog.LevelInfo,
})))

Common Logging Mistakes

// ✗ Bad — errors MUST be either logged OR returned, NEVER both (single handling rule violation)
if err != nil {
    slog.Error("query failed", "error", err)
    return fmt.Errorf("query: %w", err) // error gets logged twice up the chain
}

// ✓ Good — return with context, log at the top level
if err != nil {
    return fmt.Errorf("querying users: %w", err)
}

// ✗ Bad — NEVER log PII (emails, SSNs, passwords, tokens)
slog.Info("user logged in", "email", user.Email, "ssn", user.SSN)

// ✓ Good — log identifiers, not sensitive data
slog.Info("user logged in", "user_id", user.ID)

Metrics with Prometheus

→ See samber/cc-skills-golang@golang-troubleshooting skill for using metrics to diagnose production issues. → See samber/cc-skills@promql-cli skill for executing and testing PromQL queries via CLI.

When using the Prometheus client library, refer to the library's official documentation for up-to-date API signatures and examples.

Metric Types

Type	What it measures	Example	When to use
Counter	Cumulative total (only goes up)	Total requests, total errors	Counting events
Gauge	Current value (goes up and down)	In-flight requests, queue size, temperature	Current state
Histogram	Distribution of values in configurable buckets	Request duration, response size	Latency, sizes — when you need percentiles
Summary	Client-computed quantiles	Request duration (pre-computed P50, P99)	Rarely — prefer Histogram

Histogram vs Summary

This is one of the most common sources of confusion. Both measure distributions, but they work very differently.

Histogram stores observations in configurable buckets (e.g., 5ms, 10ms, 25ms, 50ms, 100ms, ...). Percentiles are computed at query time by Prometheus using histogram_quantile(). Because the raw bucket counts are stored server-side, histograms can be aggregated across multiple instances — essential for services running multiple replicas.

Summary computes quantiles (P50, P99, etc.) on the client side before sending them to Prometheus. This means the quantile values are pre-baked and cannot be aggregated — if you have 10 instances, you cannot combine their P99 values into a meaningful overall P99.

Recommendation: Histogram SHOULD be preferred over Summary in almost all cases. Summary is only useful when you need exact quantiles for a single instance and don't care about cross-instance aggregation.

Tracking Percentiles (P50, P90, P99, P99.9)

Define a Histogram with appropriate buckets, then query percentiles with histogram_quantile():

import "github.com/prometheus/client_golang/prometheus"
import "github.com/prometheus/client_golang/prometheus/promauto"

var httpRequestDuration = promauto.NewHistogramVec(
    prometheus.HistogramOpts{
        Namespace: "myapp",
        Subsystem: "http",
        Name:      "request_duration_seconds",
        Help:      "HTTP request duration in seconds.",
        Buckets:   prometheus.DefBuckets, // .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10
    },
    []string{"method", "path", "status"},
)

// In your handler or middleware:
func instrumentHandler(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        sw := &statusWriter{ResponseWriter: w, status: 200}
        next.ServeHTTP(sw, r)
        httpRequestDuration.WithLabelValues(
            r.Method,
            r.URL.Path,
            strconv.Itoa(sw.status),
        ).Observe(time.Since(start).Seconds())
    })
}

PromQL queries for percentiles:

# P50 (median) over the last 5 minutes
histogram_quantile(0.50, rate(myapp_http_request_duration_seconds_bucket[5m]))

# P90
histogram_quantile(0.90, rate(myapp_http_request_duration_seconds_bucket[5m]))

# P99
histogram_quantile(0.99, rate(myapp_http_request_duration_seconds_bucket[5m]))

# P99.9
histogram_quantile(0.999, rate(myapp_http_request_duration_seconds_bucket[5m]))

# P99 broken down by path
histogram_quantile(0.99, sum(rate(myapp_http_request_duration_seconds_bucket[5m])) by (le, path))

Naming Conventions

Metric names MUST follow the Prometheus naming best practices. The pattern is: <namespace>_<subsystem>_<name>_<unit>

Rules:

Use a single-word application prefix (namespace) relevant to the domain
A metric must refer to a single unit and single quantity
Include the unit as a suffix, in plural form
MUST use base units — not derived units

Always use base units:

Measurement	Use	Not
Time	`_seconds`	`_milliseconds`, `_minutes`
Data size	`_bytes`	`_kilobytes`, `_megabytes`
Temperature	`_celsius`	`_fahrenheit`
Ratio/percent	`_ratio` (0–1)	`_percent` (0–100)
Mass	`_grams`	`_kilograms`

Suffix conventions:

Suffix	When to use	Example
`_total`	Counters MUST use this suffix	`myapp_http_requests_total`
`_seconds`	Duration measurements	`myapp_http_request_duration_seconds`
`_bytes`	Data sizes	`myapp_response_size_bytes`
`_info`	Pseudo-metrics exposing metadata	`myapp_build_info`
`_created`	Creation timestamp of a counter	`myapp_http_requests_created`

// ✓ Good — namespace, subsystem, descriptive name, base unit suffix
myapp_http_requests_total              // Counter
myapp_http_request_duration_seconds    // Histogram — seconds, not milliseconds
myapp_http_response_size_bytes         // Histogram — bytes, not kilobytes
myapp_db_connections_active            // Gauge
myapp_queue_messages_pending           // Gauge
process_cpu_seconds_total              // Counter — total CPU time in seconds

// ✗ Bad
request_count             // no namespace, no unit suffix
httpDuration              // camelCase, no unit
request_duration_ms       // milliseconds instead of seconds
myapp_request_size_kb     // kilobytes instead of bytes

Label naming: do not embed label names into the metric name. Use labels to differentiate characteristics:

// ✗ Bad — operation embedded in metric name
myapp_http_get_requests_total
myapp_http_post_requests_total

// ✓ Good — use a label
myapp_http_requests_total{method="GET"}
myapp_http_requests_total{method="POST"}

Semantic consistency: sum() or avg() over all label dimensions of a metric should be meaningful. If not, split into separate metrics.

Exposing Metrics

import "github.com/prometheus/client_golang/prometheus/promhttp"

mux.Handle("/metrics", promhttp.Handler())

Document Metrics with PromQL Comments

EVERY METRIC declaration SHOULD include the relevant PromQL queries and alert rules as comments directly above the variable. This makes metrics self-documenting — when a developer reads the code, they immediately see how the metric is used in dashboards and alerts, without hunting through Grafana or alert configurations.

// ✗ Bad — metric exists but nobody knows how to query or alert on it
var httpRequestsTotal = promauto.NewCounterVec(...)

// ✓ Good — PromQL queries and alert rules are part of the code
//
// Dashboard: rate(myapp_http_requests_total[5m])
// Dashboard: sum by (status) (rate(myapp_http_requests_total[5m]))
// Alert:     sum(rate(myapp_http_requests_total{status=~"5.."}[5m])) / sum(rate(myapp_http_requests_total[5m])) > 0.01
var httpRequestsTotal = promauto.NewCounterVec(...)

This convention has practical benefits: PromQL queries are reviewed in PRs alongside the metric, queries stay in sync with metric changes (label renames, bucket changes), and new team members can understand the metric's purpose at a glance.

Metric Examples and PromQL Queries

Production-ready metrics covering all four types with comprehensive PromQL for dashboards and alerts.

For infrastructure and dependency alerting (databases, caches, message brokers, reverse proxies, Kubernetes), awesome-prometheus-alerts provides a curated collection of ~500 ready-to-use Prometheus alerting rules organized by technology. See alerting.md for integration details and Go runtime alerts.

NEVER use irate(...) for alerts — use rate(...) instead.

Counters — tracking events

// Dashboard: rate(myapp_http_requests_total[5m])
// Dashboard: sum by (status) (rate(myapp_http_requests_total[5m]))
// Dashboard: sum by (path) (rate(myapp_http_requests_total[5m]))
// Dashboard: topk(5, sum by (path) (rate(myapp_http_requests_total[5m])))
// Dashboard: increase(myapp_http_requests_total[1h])
// SLI:      1 - (sum(rate(myapp_http_requests_total{status=~"5.."}[5m])) / sum(rate(myapp_http_requests_total[5m])))
// Alert:    sum(rate(myapp_http_requests_total{status=~"5.."}[5m])) / sum(rate(myapp_http_requests_total[5m])) > 0.01
// Alert:    sum(rate(myapp_http_requests_total{status=~"5.."}[1m])) / sum(rate(myapp_http_requests_total[1m])) > 0.05
var httpRequestsTotal = promauto.NewCounterVec(
    prometheus.CounterOpts{
        Namespace: "myapp",
        Subsystem: "http",
        Name:      "requests_total",
        Help:      "Total number of HTTP requests.",
    },
    []string{"method", "path", "status"},
)

// Dashboard: sum by (type) (rate(myapp_errors_total[5m]))
// Dashboard: topk(3, sum by (type) (rate(myapp_errors_total[5m])))
// Alert:    rate(myapp_errors_total{type="database"}[5m]) > 0.5
var errorsTotal = promauto.NewCounterVec(
    prometheus.CounterOpts{
        Namespace: "myapp",
        Name:      "errors_total",
        Help:      "Total number of errors by type.",
    },
    []string{"type"}, // "database", "external_api", "validation"
)

// Dashboard: sum by (payment_method) (rate(myapp_orders_created_total[5m]))
// Dashboard: increase(myapp_orders_created_total[24h])
// Alert:    rate(myapp_orders_created_total[30m]) == 0
var ordersCreated = promauto.NewCounterVec(
    prometheus.CounterOpts{
        Namespace: "myapp",
        Subsystem: "orders",
        Name:      "created_total",
        Help:      "Total number of orders created.",
    },
    []string{"payment_method"},
)

Key PromQL patterns for counters:

# Requests per second (smoothed over 5 minutes)
rate(myapp_http_requests_total[5m])

# Traffic by status code — see distribution of 2xx/4xx/5xx
sum by (status) (rate(myapp_http_requests_total[5m]))

# Top 5 busiest endpoints
topk(5, sum by (path) (rate(myapp_http_requests_total[5m])))

# Absolute request count in the last hour (useful for reports)
increase(myapp_http_requests_total[1h])

# Error ratio — fraction of requests returning 5xx (SLI)
sum(rate(myapp_http_requests_total{status=~"5.."}[5m]))
/
sum(rate(myapp_http_requests_total[5m]))

# 4xx error ratio — client errors (useful for spotting bad deployments)
sum(rate(myapp_http_requests_total{status=~"4.."}[5m]))
/
sum(rate(myapp_http_requests_total[5m]))

# Alert: error rate > 1% for 5 minutes (for: 5m)
sum(rate(myapp_http_requests_total{status=~"5.."}[5m]))
/
sum(rate(myapp_http_requests_total[5m]))
> 0.01

# Alert: spike detection — error rate > 5% over 1 minute (for: 2m)
sum(rate(myapp_http_requests_total{status=~"5.."}[1m]))
/
sum(rate(myapp_http_requests_total[1m]))
> 0.05

# Alert: zero orders for 30 minutes — business is broken (for: 30m)
rate(myapp_orders_created_total[30m]) == 0

Gauges — tracking current state

// Dashboard: myapp_http_in_flight_requests
// Alert:    myapp_http_in_flight_requests > 500
var httpInFlightRequests = promauto.NewGauge(
    prometheus.GaugeOpts{
        Namespace: "myapp",
        Subsystem: "http",
        Name:      "in_flight_requests",
        Help:      "Number of HTTP requests currently being processed.",
    },
)

// Dashboard: myapp_db_connections_active
// Dashboard: myapp_db_connections_active / myapp_db_connections_max
// Alert:    myapp_db_connections_active{pool="write"} / myapp_db_connections_max{pool="write"} > 0.9
// Alert:    predict_linear(myapp_db_connections_active[15m], 600) > myapp_db_connections_max
var dbConnectionsActive = promauto.NewGaugeVec(
    prometheus.GaugeOpts{
        Namespace: "myapp",
        Subsystem: "db",
        Name:      "connections_active",
        Help:      "Number of active database connections.",
    },
    []string{"pool"}, // "read", "write"
)

var dbConnectionsMax = promauto.NewGaugeVec(
    prometheus.GaugeOpts{
        Namespace: "myapp",
        Subsystem: "db",
        Name:      "connections_max",
        Help:      "Maximum database connections in the pool.",
    },
    []string{"pool"},
)

// Dashboard: myapp_queue_messages_pending
// Dashboard: deriv(myapp_queue_messages_pending[5m])
// Alert:    myapp_queue_messages_pending{queue_name="orders"} > 1000
// Alert:    deriv(myapp_queue_messages_pending[10m]) > 50
// Alert:    predict_linear(myapp_queue_messages_pending[30m], 3600) > 10000
var queueSize = promauto.NewGaugeVec(
    prometheus.GaugeOpts{
        Namespace: "myapp",
        Subsystem: "queue",
        Name:      "messages_pending",
        Help:      "Number of messages waiting to be processed.",
    },
    []string{"queue_name"},
)

// Dashboard: myapp_workers_active / myapp_workers_max
// Alert:    myapp_workers_active / myapp_workers_max > 0.8
var workersActive = promauto.NewGauge(
    prometheus.GaugeOpts{
        Namespace: "myapp",
        Name:      "workers_active",
        Help:      "Number of worker goroutines currently processing jobs.",
    },
)

// Usage in middleware:
func instrumentMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        httpInFlightRequests.Inc()
        defer httpInFlightRequests.Dec()
        next.ServeHTTP(w, r)
    })
}

Key PromQL patterns for gauges:

# Current value — gauges are queried directly
myapp_http_in_flight_requests

# Saturation — what fraction of the pool is in use
myapp_db_connections_active{pool="write"} / myapp_db_connections_max{pool="write"}

# Rate of change — is the queue growing or shrinking? (items/second)
deriv(myapp_queue_messages_pending[5m])

# Prediction — will the connection pool be exhausted in 10 minutes?
# predict_linear extrapolates the trend from the last 15 minutes
predict_linear(myapp_db_connections_active[15m], 600) > myapp_db_connections_max

# Prediction — will the queue exceed 10k items in 1 hour?
predict_linear(myapp_queue_messages_pending[30m], 3600) > 10000

# Alert: connection pool > 90% saturated (for: 5m)
myapp_db_connections_active{pool="write"} / myapp_db_connections_max{pool="write"} > 0.9

# Alert: queue depth growing faster than 50 items/sec (for: 10m)
deriv(myapp_queue_messages_pending[10m]) > 50

# Alert: worker pool saturated (for: 5m)
myapp_workers_active / myapp_workers_max > 0.8

Histograms — tracking distributions (recommended for latency)

// Dashboard: histogram_quantile(0.50, sum(rate(myapp_http_request_duration_seconds_bucket[5m])) by (le))
// Dashboard: histogram_quantile(0.90, sum(rate(myapp_http_request_duration_seconds_bucket[5m])) by (le))
// Dashboard: histogram_quantile(0.99, sum(rate(myapp_http_request_duration_seconds_bucket[5m])) by (le))
// Dashboard: histogram_quantile(0.99, sum(rate(myapp_http_request_duration_seconds_bucket[5m])) by (le, path))
// SLI:      sum(rate(myapp_http_request_duration_seconds_bucket{le="0.3"}[5m])) / sum(rate(myapp_http_request_duration_seconds_count[5m]))
// Alert:    histogram_quantile(0.99, sum(rate(myapp_http_request_duration_seconds_bucket[5m])) by (le)) > 2
var httpRequestDuration = promauto.NewHistogramVec(
    prometheus.HistogramOpts{
        Namespace: "myapp",
        Subsystem: "http",
        Name:      "request_duration_seconds",
        Help:      "HTTP request duration in seconds.",
        Buckets:   []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
    },
    []string{"method", "path", "status"},
)

// Dashboard: histogram_quantile(0.95, sum(rate(myapp_external_call_duration_seconds_bucket[5m])) by (le, service))
// Alert:    histogram_quantile(0.99, sum(rate(myapp_external_call_duration_seconds_bucket[5m])) by (le, service)) > 5
var externalAPICallDuration = promauto.NewHistogramVec(
    prometheus.HistogramOpts{
        Namespace: "myapp",
        Subsystem: "external",
        Name:      "call_duration_seconds",
        Help:      "Duration of external API calls in seconds.",
        Buckets:   []float64{.01, .05, .1, .25, .5, 1, 2.5, 5, 10, 30},
    },
    []string{"service", "endpoint"},
)

// Dashboard: histogram_quantile(0.95, sum(rate(myapp_orders_amount_dollars_bucket[5m])) by (le))
var orderAmount = promauto.NewHistogramVec(
    prometheus.HistogramOpts{
        Namespace: "myapp",
        Subsystem: "orders",
        Name:      "amount_dollars",
        Help:      "Order amount in dollars.",
        Buckets:   []float64{1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000},
    },
    []string{"payment_method"},
)

Key PromQL patterns for histograms:

# Percentile latencies — the core latency dashboard
histogram_quantile(0.50, sum(rate(myapp_http_request_duration_seconds_bucket[5m])) by (le))  # P50
histogram_quantile(0.90, sum(rate(myapp_http_request_duration_seconds_bucket[5m])) by (le))  # P90
histogram_quantile(0.95, sum(rate(myapp_http_request_duration_seconds_bucket[5m])) by (le))  # P95
histogram_quantile(0.99, sum(rate(myapp_http_request_duration_seconds_bucket[5m])) by (le))  # P99
histogram_quantile(0.999, sum(rate(myapp_http_request_duration_seconds_bucket[5m])) by (le)) # P99.9

# P99 latency broken down by endpoint — find the slowest paths
histogram_quantile(0.99, sum(rate(myapp_http_request_duration_seconds_bucket[5m])) by (le, path))

# Average latency (mean) — useful alongside percentiles
sum(rate(myapp_http_request_duration_seconds_sum[5m]))
/
sum(rate(myapp_http_request_duration_seconds_count[5m]))

# Apdex-like SLI — fraction of requests under 300ms (target threshold)
sum(rate(myapp_http_request_duration_seconds_bucket{le="0.3"}[5m]))
/
sum(rate(myapp_http_request_duration_seconds_count[5m]))

# Request throughput from histogram (requests/sec)
sum(rate(myapp_http_request_duration_seconds_count[5m]))

# External API P95 latency per service
histogram_quantile(0.95, sum(rate(myapp_external_call_duration_seconds_bucket[5m])) by (le, service))

# Alert: P99 latency > 2s (for: 5m)
histogram_quantile(0.99, sum(rate(myapp_http_request_duration_seconds_bucket[5m])) by (le)) > 2

# Alert: P95 latency > 500ms (for: 10m)
histogram_quantile(0.95, sum(rate(myapp_http_request_duration_seconds_bucket[5m])) by (le)) > 0.5

# Alert: external API P99 > 5s (for: 5m)
histogram_quantile(0.99, sum(rate(myapp_external_call_duration_seconds_bucket[5m])) by (le, service)) > 5

# Alert: less than 95% of requests under 300ms (SLO breach) (for: 10m)
(
  sum(rate(myapp_http_request_duration_seconds_bucket{le="0.3"}[5m]))
  /
  sum(rate(myapp_http_request_duration_seconds_count[5m]))
) < 0.95

Summary — client-side quantiles (use sparingly)

Summaries compute quantiles on the client and cannot be aggregated across instances. Use them only for single-process diagnostics where exact quantiles matter. Prefer Histogram in all other cases.

// Dashboard: myapp_jobs_processing_seconds{quantile="0.5"}
// Dashboard: myapp_jobs_processing_seconds{quantile="0.99"}
// Note: these quantiles CANNOT be aggregated across instances
var jobProcessingDuration = promauto.NewSummary(
    prometheus.SummaryOpts{
        Namespace:  "myapp",
        Subsystem:  "jobs",
        Name:       "processing_seconds",
        Help:       "Job processing duration in seconds.",
        Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
        MaxAge:     10 * time.Minute,
    },
)

Multi-Window Burn-Rate SLO Alerting

For critical services, simple threshold alerts ("error rate > 1%") fire too late for fast incidents and too early for slow ones. Multi-window burn-rate alerting scales alert urgency to how fast you're consuming your error budget.

For a 99.9% availability SLO (0.1% error budget over 30 days):

Window	Burn rate	Error rate	Severity	Meaning
5m + 1h	14.4x	> 1.44%	Critical (page)	Budget exhausted in ~2 days
30m + 6h	6x	> 0.6%	Critical (page)	Budget exhausted in ~5 days
2h + 24h	1x	> 0.1%	Warning (ticket)	On track to exhaust budget

# Fast burn — page immediately (for: 2m)
# Both short and long windows must fire to avoid noise from brief spikes
(
  (1 - sum(rate(myapp_http_requests_total{status=~"2.."}[5m])) / sum(rate(myapp_http_requests_total[5m]))) > 0.0144
  and
  (1 - sum(rate(myapp_http_requests_total{status=~"2.."}[1h])) / sum(rate(myapp_http_requests_total[1h]))) > 0.0144
)

# Medium burn — page (for: 15m)
(
  (1 - sum(rate(myapp_http_requests_total{status=~"2.."}[30m])) / sum(rate(myapp_http_requests_total[30m]))) > 0.006
  and
  (1 - sum(rate(myapp_http_requests_total{status=~"2.."}[6h])) / sum(rate(myapp_http_requests_total[6h]))) > 0.006
)

# Slow burn — ticket (for: 1h)
(
  (1 - sum(rate(myapp_http_requests_total{status=~"2.."}[2h])) / sum(rate(myapp_http_requests_total[2h]))) > 0.001
  and
  (1 - sum(rate(myapp_http_requests_total{status=~"2.."}[24h])) / sum(rate(myapp_http_requests_total[24h]))) > 0.001
)

The short window catches the incident fast; the long window confirms it's sustained. Together they eliminate false positives from transient blips.

High-Cardinality Labels

NEVER use high-cardinality labels (user IDs, full URLs, request IDs). Every unique combination of label values creates a separate time series in Prometheus. Unbounded labels cause memory explosion on the Prometheus server, slow queries, and can crash the monitoring stack.

// ✗ Bad — unbounded cardinality (millions of unique values)
httpRequestsTotal.WithLabelValues(r.URL.Path)    // /users/alice, /users/bob, /users/charlie...
httpRequestsTotal.WithLabelValues(userID)          // one series per user
httpRequestsTotal.WithLabelValues(r.Header.Get("X-Request-ID")) // one series per request!

// ✓ Good — bounded, normalized labels
httpRequestsTotal.WithLabelValues(routePattern)   // /users/:id (the route template, not the actual path)
httpRequestsTotal.WithLabelValues(r.Method)        // GET, POST, PUT, DELETE (5 values)
httpRequestsTotal.WithLabelValues(statusBucket)    // "2xx", "3xx", "4xx", "5xx" (4 values)

How to limit cardinality:

Use route templates (/users/:id) instead of actual paths (/users/alice)
Bucket status codes (2xx, 4xx, 5xx) instead of exact codes (200, 201, 204, 400, 401, ...)
Never use user IDs, request IDs, session IDs, or email addresses as labels
Use attributes/tags in traces instead — traces handle high cardinality naturally
Rule of thumb: if a label can have more than ~100 unique values, it's too many

Profiling and Continuous Profiling

→ See samber/cc-skills-golang@golang-troubleshooting skill (pprof.md) for on-demand debugging.

What Profiling Is

Profiling analyzes the runtime behavior of your program — where CPU time is spent, how memory is allocated, which goroutines are blocked, and where lock contention occurs. While metrics tell you "the service is slow," profiling tells you "this specific function on line 42 is the bottleneck."

On-Demand Profiling with `pprof`

pprof endpoints MUST be protected with basic auth — NEVER expose them publicly. They leak sensitive runtime information and can be abused for DoS.

→ See samber/cc-skills-golang@golang-troubleshooting pprof.md for the full pprof CLI reference (profile types, capturing, analyzing, commands).

Continuous Profiling with Pyroscope

On-demand profiling requires you to be there when the problem happens. Continuous profiling runs always-on in the background with low overhead (~2-5% CPU), so you can look at profiles after the fact. Toggle it with an environment variable.

import "github.com/grafana/pyroscope-go"

func setupContinuousProfiling() {
    if os.Getenv("PROFILING_ENABLED") != "true" {
        return
    }

    _, err := pyroscope.Start(pyroscope.Config{
        ApplicationName: "my-service",
        ServerAddress:   os.Getenv("PYROSCOPE_URL"), // e.g., http://user:pass@pyroscope:4040
        ProfileTypes: []pyroscope.ProfileType{
            pyroscope.ProfileCPU,
            pyroscope.ProfileAllocObjects,
            pyroscope.ProfileAllocSpace,
            pyroscope.ProfileInuseObjects,
            pyroscope.ProfileInuseSpace,
            pyroscope.ProfileGoroutines,
            pyroscope.ProfileMutexCount,
            pyroscope.ProfileMutexDuration,
            pyroscope.ProfileBlockCount,
            pyroscope.ProfileBlockDuration,
        },
    })
    if err != nil {
        slog.Error("failed to start pyroscope", "error", err)
    } else {
        slog.Info("continuous profiling enabled", "server", os.Getenv("PYROSCOPE_URL"))
    }
}

Cost of Continuous Profiling

Continuous profiling adds overhead to every running instance — CPU for collecting stack samples, memory for buffering, and network for transmitting profiles to the backend. While typically low (~2-5% CPU), this cost is per-instance and always-on.

Cost factors:

CPU overhead — profiling itself consumes CPU cycles. In CPU-bound services, even 2-5% overhead matters.
Network/storage — profile data is continuously shipped to Pyroscope/your backend. High-replica services multiply this.
All profile types enabled — each additional profile type (mutex, block, goroutine) adds incremental overhead.

Mitigation:

Toggle via environment variable (PROFILING_ENABLED) — enable only when needed or on a subset of instances
Start with CPU + heap profiles only; add mutex/block/goroutine profiles when investigating specific issues
In large deployments, enable continuous profiling on a fraction of replicas (e.g., 1 in 10) rather than all of them

When to Profile

1. Metrics show high CPU/memory usage → look at CPU/heap profiles 2. P99 latency spikes → CPU profile + mutex profile to find contention 3. Goroutine count growing → goroutine profile to find leaks 4. Before and after an optimization → compare profiles to verify improvement

Real User Monitoring (RUM) and Product Observability

What RUM Is

Backend observability (logs, metrics, traces, profiles) tells you how your system behaves. RUM tells you how your users experience it. While frontend SDKs capture browser-side signals, the Go backend plays a critical role: tracking server-side business events, feeding Customer Data Platforms, and correlating user sessions with backend traces.

RUM Capabilities

Capability	What it reveals	Example tools
Product Analytics	What users do — page views, clicks, feature adoption, retention	PostHog, Amplitude, Mixpanel
Funnel Analysis	Where users drop off in multi-step flows (signup, checkout, onboarding)	PostHog, Amplitude, Mixpanel
CDP	Unified user profile from all data sources — events, properties, segments	Segment, RudderStack

Identity Key: Use `user_id`, Never Email

The distinct_id (identity key) used across all RUM tracking MUST be your internal, immutable user_id. NEVER use email addresses.

// ✗ Bad — email is mutable, PII, and breaks analytics when users change it
posthogClient.Enqueue(posthog.Capture{
    DistinctId: user.Email, // "alice@example.com" → user changes email → events split into two users
    Event:      "order_completed",
})

// ✓ Good — user_id is immutable, stable, and not PII
posthogClient.Enqueue(posthog.Capture{
    DistinctId: user.ID, // "usr_a1b2c3" — never changes, always the same user
    Event:      "order_completed",
})

Why email is a bad identity key:

Mutable — users change their email. Events before and after the change appear as two different users, breaking funnels, retention analysis, and cohort tracking.
PII — using email as the identity key means every event, session recording, and analytics query contains personally identifiable information. This complicates GDPR/CCPA compliance — you can't anonymize analytics without losing user identity.
Non-unique across systems — the same email might belong to different accounts in different services or environments.
Leaks into third-party systems — the distinct_id is sent to your analytics platform (PostHog, Segment, etc.). If it's an email, you've shared PII with every vendor in your analytics pipeline.

Use user_id as the identity key everywhere: PostHog DistinctId, Segment UserId, Amplitude user_id. Store email as a user property if needed for display, never as the primary key.

Backend Role in RUM

The Go backend tracks server-side events, correlates sessions with traces, and feeds data into CDPs.

1. Server-Side Event Tracking

When critical business events happen server-side (payment completed, subscription upgraded, email sent), track them from Go so they appear in the same analytics pipeline as frontend events.

import "github.com/posthog/posthog-go"

var posthogClient posthog.Client

func initPostHog() {
    var err error
    posthogClient, err = posthog.NewWithConfig(
        os.Getenv("POSTHOG_API_KEY"),
        posthog.Config{Endpoint: os.Getenv("POSTHOG_HOST")},
    )
    if err != nil {
        slog.Error("failed to init PostHog", "error", err)
    }
}

func (s *OrderService) Complete(ctx context.Context, order Order) error {
    // ... business logic ...

    // Track server-side event — appears alongside frontend events in PostHog
    posthogClient.Enqueue(posthog.Capture{
        DistinctId: order.UserID, // immutable user_id, not email
        Event:      "order_completed",
        Properties: posthog.NewProperties().
            Set("order_id", order.ID).
            Set("amount", order.Total).
            Set("payment_method", order.PaymentMethod).
            Set("item_count", len(order.Items)),
    })

    return nil
}

2. Connecting Frontend Sessions to Backend Traces

Pass the frontend session ID or distinct ID through HTTP headers so backend traces can be correlated with RUM sessions. When a user reports "the page was slow," you can find their session recording AND the backend trace for the same request.

func TracingMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        ctx := r.Context()
        span := trace.SpanFromContext(ctx)

        // Attach RUM session ID to the backend span
        if sessionID := r.Header.Get("X-Session-ID"); sessionID != "" {
            span.SetAttributes(attribute.String("rum.session_id", sessionID))
        }

        // Attach analytics distinct ID for user correlation
        if distinctID := r.Header.Get("X-Distinct-ID"); distinctID != "" {
            span.SetAttributes(attribute.String("rum.distinct_id", distinctID))
        }

        next.ServeHTTP(w, r.WithContext(ctx))
    })
}

3. CDP Event Ingestion

If you use a Customer Data Platform (Segment, RudderStack), the Go backend sends events through the CDP's server-side SDK. The CDP unifies these with frontend events into a single user profile.

import "github.com/segmentio/analytics-go/v3"

var segmentClient analytics.Client

func initSegment() {
    segmentClient = analytics.New(os.Getenv("SEGMENT_WRITE_KEY"))
}

func (s *UserService) Upgrade(ctx context.Context, userID string, plan string) error {
    // ... business logic ...

    // Track through CDP — unified with frontend events
    segmentClient.Enqueue(analytics.Track{
        UserId: userID, // immutable user_id, not email
        Event:  "plan_upgraded",
        Properties: analytics.NewProperties().
            Set("plan", plan).
            Set("source", "api"),
    })

    // Update user profile in CDP
    segmentClient.Enqueue(analytics.Identify{
        UserId: userID,
        Traits: analytics.NewTraits().
            Set("plan", plan).
            Set("upgraded_at", time.Now()),
    })

    return nil
}

GDPR and CCPA Compliance

RUM collects user behavior data — clicks, page views, session recordings. This triggers privacy regulation requirements. Compliance is not optional; violations carry heavy fines (GDPR: up to 4% of global revenue, CCPA: $7,500 per intentional violation).

Consent Management

GDPR/CCPA consent SHOULD be obtained before loading RUM SDKs or sending tracking events. This applies to both frontend scripts and server-side event tracking.

// Server-side: check consent before tracking
func (s *OrderService) Complete(ctx context.Context, order Order) error {
    // ... business logic ...

    // Only track if user has consented to analytics
    consent := auth.ConsentFromContext(ctx)
    if consent.Analytics {
        posthogClient.Enqueue(posthog.Capture{
            DistinctId: order.UserID,
            Event:      "order_completed",
            Properties: posthog.NewProperties().
                Set("order_id", order.ID).
                Set("amount", order.Total),
        })
    }

    return nil
}

Data Subject Rights Endpoints

GDPR and CCPA require you to let users access, export, and delete their data. Implement API endpoints that propagate these requests to all systems that hold user data — your database, your analytics platform, your CDP.

// DELETE /api/users/:id/data — GDPR Article 17 "Right to Erasure"
func (h *PrivacyHandler) HandleDataDeletion(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()
    userID := chi.URLParam(r, "id")

    // 1. Delete from your database
    if err := h.userRepo.DeleteAllData(ctx, userID); err != nil {
        slog.ErrorContext(ctx, "failed to delete user data", "user_id", userID, "error", err)
        http.Error(w, "internal error", http.StatusInternalServerError)
        return
    }

    // 2. Delete from analytics platform
    if err := h.posthog.DeleteUser(ctx, userID); err != nil {
        slog.ErrorContext(ctx, "failed to delete analytics data", "user_id", userID, "error", err)
    }

    // 3. Delete from CDP
    if err := h.segment.DeleteUser(ctx, userID); err != nil {
        slog.ErrorContext(ctx, "failed to delete CDP data", "user_id", userID, "error", err)
    }

    slog.InfoContext(ctx, "user data deletion completed", "user_id", userID)
    w.WriteHeader(http.StatusNoContent)
}

// GET /api/users/:id/data — GDPR Article 15 "Right of Access"
func (h *PrivacyHandler) HandleDataExport(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()
    userID := chi.URLParam(r, "id")

    export, err := h.userRepo.ExportAllData(ctx, userID)
    if err != nil {
        slog.ErrorContext(ctx, "failed to export user data", "user_id", userID, "error", err)
        http.Error(w, "internal error", http.StatusInternalServerError)
        return
    }

    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(export)
}

Privacy Checklist

[ ] Consent before tracking — no analytics scripts load and no server-side events fire until the user consents
[ ] Consent + cookie banner — clear opt-in (not pre-checked boxes), separate consent for analytics vs marketing vs functional (frontend responsibility, but backend must respect the consent flag)
[ ] Data minimization — only collect what you need, never track PII in analytics events
[ ] Data retention policy — auto-delete old analytics data (e.g., 2 years for aggregated analytics)
[ ] Data subject rights — endpoints for data export (right of access) and deletion (right to erasure)
[ ] Data processing agreements — signed DPAs with all third-party analytics/CDP vendors
[ ] Privacy policy — lists all RUM tools, what data they collect, and how long it's retained
[ ] Identity key is not PII — use user_id, not email, as the distinct_id across all platforms
[ ] Self-hosted option — consider self-hosting (PostHog, Matomo) to keep data in your infrastructure and simplify compliance

Self-Hosted vs SaaS

Factor	Self-hosted (PostHog, Matomo)	SaaS (Amplitude, Mixpanel)
Data residency	Full control — data stays in your infra	Data on vendor's servers
GDPR compliance	Simpler — no cross-border data transfer	Requires DPA, SCCs, or adequacy decision
Cost	Infrastructure cost, scales with volume	Per-event or per-seat pricing
Maintenance	You manage upgrades, scaling, backups	Vendor handles everything
Features	Catching up but improving fast	Often more polished and feature-rich

For EU-focused products or strict data residency requirements, self-hosting PostHog is the pragmatic choice — it eliminates most GDPR concerns around cross-border data transfer.

Cost of RUM

RUM costs scale with event volume:

Event-based pricing — every page view, click, and custom event counts. A busy SaaS app can generate millions of events/month per user segment.
CDP costs — CDPs charge per tracked user and per event. Segment at scale can cost more than your entire backend infrastructure.

Mitigation:

Use server-side event filtering to drop low-value events before they reach the analytics platform
Self-host where possible to convert per-event pricing into fixed infrastructure cost
Set data retention limits on aggregated analytics

Distributed Tracing with OpenTelemetry

→ See samber/cc-skills-golang@golang-context skill for propagating context across service boundaries. → See samber/cc-skills-golang@golang-samber-oops skill for structured errors with stack traces in spans.

When using the OpenTelemetry Go SDK, refer to the library's official documentation for up-to-date API signatures and examples.

Why Tracing

When a request crosses multiple services, logs from each service are isolated. Tracing connects them: a single trace shows the full request path with timing for every operation. This is how you answer "why was this request slow?" in a microservices architecture.

OTel SDK Setup

Set up the TracerProvider early in your application. On new projects, do this first — then add spans everywhere incrementally.

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.26.0"
)

func initTracer(ctx context.Context) (func(), error) {
    exporter, err := otlptracegrpc.New(ctx)
    if err != nil {
        return nil, fmt.Errorf("creating OTLP exporter: %w", err)
    }

    res, err := resource.New(ctx,
        resource.WithAttributes(
            semconv.ServiceNameKey.String("my-service"),
            semconv.ServiceVersionKey.String("1.0.0"),
        ),
    )
    if err != nil {
        return nil, fmt.Errorf("creating resource: %w", err)
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(res),
    )
    otel.SetTracerProvider(tp)

    shutdown := func() {
        _ = tp.Shutdown(context.Background())
    }
    return shutdown, nil
}

Creating Spans

Every meaningful operation should have a span. Think of spans as the building blocks of a trace — they show where time was spent.

import "go.opentelemetry.io/otel"

var tracer = otel.Tracer("myapp/order-service")

func (s *OrderService) Create(ctx context.Context, req CreateOrderRequest) (*Order, error) {
    ctx, span := tracer.Start(ctx, "OrderService.Create")
    defer span.End()

    // Add attributes that help with debugging
    span.SetAttributes(
        attribute.String("order.payment_method", req.PaymentMethod),
        attribute.Float64("order.amount", req.Amount),
    )

    order, err := s.repo.Insert(ctx, req.ToOrder())
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return nil, fmt.Errorf("inserting order: %w", err)
    }

    return order, nil
}

func (r *OrderRepo) Insert(ctx context.Context, order Order) (*Order, error) {
    ctx, span := tracer.Start(ctx, "OrderRepo.Insert")
    defer span.End()

    _, err := r.db.ExecContext(ctx, "INSERT INTO orders ...", order.ID)
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return nil, fmt.Errorf("exec insert: %w", err)
    }
    return &order, nil
}

Where to add spans — spans MUST be created for:

Every service method (business logic layer)
Every database query
Every external API call
Every message queue publish/consume
Any operation that takes measurable time or could fail

HTTP Middleware with `otelhttp`

Automatically creates spans for incoming and outgoing HTTP requests:

import "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"

// Incoming requests — wrap your handler
mux.Handle("/orders", otelhttp.NewHandler(orderHandler, "CreateOrder"))

// Outgoing requests — HTTP clients MUST use otelhttp for automatic span propagation
client := &http.Client{
    Transport: otelhttp.NewTransport(http.DefaultTransport),
}

Span Status and Recording Errors

import (
    "go.opentelemetry.io/otel/codes"
)

// On success — no need to set status (Unset is fine)

// On error — MUST call both RecordError() and SetStatus(Error)
if err != nil {
    span.RecordError(err)
    span.SetStatus(codes.Error, "operation failed")
    return err
}

Structured Errors with `samber/oops`

Standard Go errors lose critical debugging information: there's no stack trace, no structured context, and no way to attach request-scoped metadata. When an error surfaces in a trace, you see "connection refused" but not where it originated or which user/tenant was affected.

`samber/oops` is a drop-in error library that fills these gaps. Every oops error carries a stack trace, structured attributes, and integrates naturally with both OpenTelemetry spans and slog:

import "github.com/samber/oops"

func (s *OrderService) Create(ctx context.Context, req CreateOrderRequest) (*Order, error) {
    ctx, span := tracer.Start(ctx, "OrderService.Create")
    defer span.End()

    order, err := s.repo.Insert(ctx, req.ToOrder())
    if err != nil {
        // oops wraps the error with stack trace, structured context, and error code
        return nil, oops.
            In("order-service").
            Code("order_insert_failed").
            With("order_id", req.OrderID).
            With("user_id", req.UserID).
            Wrapf(err, "inserting order")
    }

    return order, nil
}

When this error is logged or recorded on a span, you get the full stack trace, the domain (order-service), an error code (order_insert_failed), and structured attributes (order_id, user_id) — all machine-parseable and searchable in your observability platform.

oops errors work with span.RecordError(), errors.Is/errors.As, and slog — see the samber/cc-skills-golang@golang-error-handling and samber/cc-skills-golang@golang-samber-oops skills for full usage patterns.

Trace Sampling

In high-throughput services, tracing every request is expensive. Use sampling to control the volume:

tp := sdktrace.NewTracerProvider(
    // Sample 10% of traces in production
    sdktrace.WithSampler(sdktrace.TraceIDRatioBased(0.1)),
    sdktrace.WithBatcher(exporter),
    sdktrace.WithResource(res),
)

For more nuanced control, use sdktrace.ParentBased() to respect the parent's sampling decision — this keeps traces complete across service boundaries.

Cost of Tracing

Tracing can be one of the most expensive observability signals. Every span generates data that must be serialized, transmitted, stored, and indexed. In a microservices architecture, a single user request can produce dozens or hundreds of spans across services.

Cost factors:

Span volume — a service handling 10k req/s with 5 spans per request generates 50k spans/s. At 100% sampling, this is enormous.
Span attributes — each attribute adds to the payload size. Large attributes (request/response bodies) multiply cost.
Storage and indexing — tracing backends (Jaeger, Tempo, Datadog) charge by volume. Unsampled traces can easily become the largest line item in your observability bill.

Mitigation:

Use sampling (see above) — start with 10% (TraceIDRatioBased(0.1)) and adjust based on traffic volume and budget
For high-throughput services, consider head-based sampling (decide at trace start) or tail-based sampling (decide after the trace completes, keeping only interesting traces like errors or slow requests)
Avoid attaching large payloads as span attributes — log them instead and correlate via trace_id

Related skills

Azure DeploySafely execute production deployments of already-prepared applications to Microsoft Azure.478k1.3k

Azure ValidateRun deep pre-deployment checks on Azure configuration, infrastructure definitions, RBAC roles, and managed identities before pushing to production.477k1.3k

Github Actions DocsGet precise, docs-grounded answers about GitHub Actions workflows, syntax, security, and migration instead of relying on stale knowledge.275k72

Setup Pre CommitAutomatically run Prettier, type checking, and tests on every commit via Husky and lint-staged.161k188k

Deploy To VercelSafely turn any local project into a live Vercel preview with one instruction.97.8k29.5k

Vercel Cli With TokensDeploy projects to Vercel from agents and scripts using token authentication instead of interactive browser login.73.4k29.5k

How it compares

Use golang-observability for production telemetry; use golang-benchmark when writing Go 1.24 microbenchmarks to prove performance regressions.

FAQ

Does golang-observability replace pprof for Go performance work?

golang-observability includes continuous profiling with pprof and Pyroscope for always-on production signals, but the skill explicitly excludes temporary deep-dive performance investigations. Use it for steady-state monitoring, not one-off benchmark hunts.

Which loggers does golang-observability migrate to slog?

golang-observability documents migration from legacy Go loggers including zap, logrus, and zerolog to slog while preserving structured fields and correlating logs with OpenTelemetry trace IDs in production services.

Is Golang Observability safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

DevOps & CI/CDmonitoringdeploy

About

Golang Observability by the numbers

Add your badge

How do you add OpenTelemetry and Prometheus to Go?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

Go Observability Best Practices

Best Practices Summary

Cross-References

Go 1.26+: slog multi-handler

The Five Signals

Detailed Guides

Correlating Signals

Logs + Traces: otelslog bridge

Metrics + Traces: Exemplars

Migrating Legacy Loggers

Definition of Done for Observability

Common Mistakes

Alerting

The Four Golden Signals

Awesome Prometheus Alerts

Categories

How to Use It

Integration Example

Workflow for New Dependencies

Go Runtime Alerts

Alert Severity Levels

Common Mistakes

Grafana Dashboards for Go Services

Recommended Dashboards

How to Install

When to Use Each

Structured Logging with slog

Why Structured Logging

Handler Setup

Log Levels

Cost of Logging

Logging with Context

Adding Request-Scoped Attributes

Log Sinks and the slog Ecosystem

Migrating from zap / logrus / zerolog

Common Logging Mistakes

Metrics with Prometheus

Metric Types

Histogram vs Summary

Tracking Percentiles (P50, P90, P99, P99.9)

Naming Conventions

Exposing Metrics

Document Metrics with PromQL Comments

Metric Examples and PromQL Queries

Counters — tracking events

Gauges — tracking current state

Histograms — tracking distributions (recommended for latency)

Summary — client-side quantiles (use sparingly)

Multi-Window Burn-Rate SLO Alerting

High-Cardinality Labels

Profiling and Continuous Profiling

What Profiling Is

On-Demand Profiling with pprof

Continuous Profiling with Pyroscope

Cost of Continuous Profiling

When to Profile

Real User Monitoring (RUM) and Product Observability

What RUM Is

RUM Capabilities

Identity Key: Use user_id, Never Email

Backend Role in RUM

1. Server-Side Event Tracking

2. Connecting Frontend Sessions to Backend Traces

3. CDP Event Ingestion

GDPR and CCPA Compliance

Consent Management

Data Subject Rights Endpoints

Privacy Checklist

Self-Hosted vs SaaS

Cost of RUM

Distributed Tracing with OpenTelemetry

Logs + Traces: `otelslog` bridge

Structured Logging with `slog`

Log Sinks and the `slog` Ecosystem

On-Demand Profiling with `pprof`

Identity Key: Use `user_id`, Never Email

HTTP Middleware with `otelhttp`

Structured Errors with `samber/oops`