
Python Observability
Instrument Python services with Prometheus-style metrics, decorators, and the four golden signals so solo builders can see latency, traffic, errors, and saturation in production.
Overview
Python Observability is an agent skill most often used in Operate (also Build backend, Ship perf) that teaches Prometheus metrics, golden signals, and request instrumentation for Python services.
Install
npx skills add https://github.com/wshobson/agents --skill python-observabilityWhat is this skill?
- Documents the Four Golden Signals (latency, traffic, errors, saturation) with Prometheus client types
- Includes Histogram, Counter, and Gauge examples with label dimensions for method, endpoint, and status
- Provides a track_request async decorator pattern for automatic request timing and error counting
- Covers DB connection pool saturation as a Gauge for resource pressure visibility
- Four Golden Signals framework (latency, traffic, errors, saturation)
- Histogram bucket set with 10 explicit upper bounds in the example
Adoption & trust: 7.1k installs on skills.sh; 36.5k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your Python API is in production but you cannot answer how slow, how busy, or how error-prone each endpoint is.
Who is it for?
Solo builders shipping Python async APIs who want copy-paste Golden Signals patterns without reading entire observability vendor docs first.
Skip if: Teams that only need browser RUM or log-only debugging with no metrics pipeline, or non-Python stacks.
When should I use this skill?
You are instrumenting Python HTTP or async services and need Prometheus metrics patterns for production monitoring.
What do I get? / Deliverables
You get consistent Prometheus metrics and decorator-based instrumentation so dashboards and alerts can reflect latency, traffic, errors, and saturation.
- Metric definitions (Histogram, Counter, Gauge) with label schemas
- Instrumented endpoint or decorator wrapper emitting request and error series
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Operate / monitoring is the canonical shelf because the worked examples center on production metrics and pool saturation gauges, even though you instrument during Build. Monitoring subphase matches Golden Signals, Histogram/Counter/Gauge patterns, and endpoint instrumentation for live reliability.
Where it fits
Add Histogram and Counter definitions when scaffolding a new async API module before first deploy.
Validate bucket choices and error labels before launch so perf regressions show up in metrics, not only user complaints.
Wire DB pool Gauges and error_counters when investigating saturation during an on-call spike.
How it compares
Skill package for in-repo instrumentation patterns, not a hosted APM product or MCP metrics server.
Common Questions / FAQ
Who is python-observability for?
Indie developers and small teams running Python HTTP services who need Prometheus-friendly metrics during build-out and production operations.
When should I use python-observability?
Use it while building backend routes (Build) to add instrumentation early, and in Operate when tuning alerts and diagnosing saturation under real traffic.
Is python-observability safe to install?
It describes code patterns only; review the Security Audits panel on this page and avoid exposing metric endpoints without auth on public deployments.
SKILL.md
READMESKILL.md - Python Observability
# python-observability — detailed worked examples ## Advanced Patterns ### Pattern 5: The Four Golden Signals with Prometheus Track these metrics for every service boundary: ```python from prometheus_client import Counter, Histogram, Gauge # Latency: How long requests take REQUEST_LATENCY = Histogram( "http_request_duration_seconds", "Request latency in seconds", ["method", "endpoint", "status"], buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10], ) # Traffic: Request rate REQUEST_COUNT = Counter( "http_requests_total", "Total HTTP requests", ["method", "endpoint", "status"], ) # Errors: Error rate ERROR_COUNT = Counter( "http_errors_total", "Total HTTP errors", ["method", "endpoint", "error_type"], ) # Saturation: Resource utilization DB_POOL_USAGE = Gauge( "db_connection_pool_used", "Number of database connections in use", ) ``` Instrument your endpoints: ```python import time from functools import wraps def track_request(func): """Decorator to track request metrics.""" @wraps(func) async def wrapper(request: Request, *args, **kwargs): method = request.method endpoint = request.url.path start = time.perf_counter() try: response = await func(request, *args, **kwargs) status = str(response.status_code) return response except Exception as e: status = "500" ERROR_COUNT.labels( method=method, endpoint=endpoint, error_type=type(e).__name__, ).inc() raise finally: duration = time.perf_counter() - start REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=status).inc() REQUEST_LATENCY.labels(method=method, endpoint=endpoint, status=status).observe(duration) return wrapper ``` ### Pattern 6: Bounded Cardinality Avoid labels with unbounded values to prevent metric explosion. ```python # BAD: User ID has potentially millions of values REQUEST_COUNT.labels(method="GET", user_id=user.id) # Don't do this! # GOOD: Bounded values only REQUEST_COUNT.labels(method="GET", endpoint="/users", status="200") # If you need per-user metrics, use a different approach: # - Log the user_id and query logs # - Use a separate analytics system # - Bucket users by type/tier REQUEST_COUNT.labels( method="GET", endpoint="/users", user_tier="premium", # Bounded set of values ) ``` ### Pattern 7: Timed Operations with Context Manager Create a reusable timing context manager for operations. ```python from contextlib import contextmanager import time import structlog logger = structlog.get_logger() @contextmanager def timed_operation(name: str, **extra_fields): """Context manager for timing and logging operations.""" start = time.perf_counter() logger.debug("Operation started", operation=name, **extra_fields) try: yield except Exception as e: elapsed_ms = (time.perf_counter() - start) * 1000 logger.error( "Operation failed", operation=name, duration_ms=round(elapsed_ms, 2), error=str(e), **extra_fields, ) raise else: elapsed_ms = (time.perf_counter() - start) * 1000 logger.info( "Operation completed", operation=name, duration_ms=round(elapsed_ms, 2), **extra_fields, ) # Usage with timed_operation("fetch_user_orders", user_id=user.id): orders = await order_repository.get_by_user(user.id) ``` ### Pattern 8: OpenTelemetry Tracing Set up distributed tracing with OpenTelemetry. **Note:** OpenTelemetry is actively evolving. Check the [official Python documentation](https://opentelemetry.io/docs/languages/python/) for the latest API patterns and best practices. ```python from opentelemetry import trace from opentelemetry.sdk.trace import Tr