
Python Resilience
Add retries, timeouts, and structured retry logging to Python services and agent tools that call flaky APIs or databases.
Overview
python-resilience is an agent skill most often used in Build (also Operate) that documents Python retry, timeout, and retry-logging patterns for unreliable dependencies.
Install
npx skills add https://github.com/wshobson/agents --skill python-resilienceWhat is this skill?
- Tenacity retries with exponential backoff and stop conditions
- before_sleep hooks for structured logging of each retry attempt
- Reusable async timeout decorators via asyncio.wait_for
- Worked examples for observability-friendly failure handling
- 6+ advanced pattern examples including logging and timeout decorators
Adoption & trust: 6.8k installs on skills.sh; 36.5k GitHub stars; 1/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your Python service calls external APIs or async work without consistent retries, timeouts, or visibility into repeated failures.
Who is it for?
Solo builders writing Python APIs, workers, or agent backends that integrate with third-party services and need standard resilience without reinventing backoff logic.
Skip if: Teams that only need static infra-level circuit breakers with no application code changes, or non-Python stacks where these patterns do not apply.
When should I use this skill?
User is implementing or refactoring Python code that calls external services and needs retries, timeouts, or retry observability.
What do I get? / Deliverables
You adopt tenacity and asyncio timeout patterns with optional structured retry logs so flaky dependencies fail predictably and are easier to debug in production.
- Retry and timeout decorator implementations adapted to your call sites
- Structured retry logging hooks for operations debugging
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Resilience patterns are implemented while writing backend and integration code, before those paths ever reach production traffic. The skill centers on Python decorators and libraries (tenacity, asyncio timeouts, structlog) for external calls and async work—the core backend integration surface.
Where it fits
Wrap a payment or LLM API client with tenacity retries before merging your integration PR.
Add asyncio timeouts so agent tool calls cannot block the whole worker under slow upstreams.
Turn on retry attempt logging to correlate outage spikes with specific exception types in logs.
How it compares
Use for in-process Python retry/timeout decorators instead of ad-hoc sleep loops or untyped except-and-retry blocks.
Common Questions / FAQ
Who is python-resilience for?
Indie developers and small teams shipping Python backends, CLIs, or agent tools that must survive rate limits, network blips, and slow upstreams.
When should I use python-resilience?
During Build when wiring HTTP or async clients; before Ship when hardening integration paths; in Operate when you need logged retry behavior for incident triage.
Is python-resilience safe to install?
Review the Security Audits panel on this Prism page and inspect the skill package in your repo before granting your agent shell or network access.
SKILL.md
READMESKILL.md - Python Resilience
# python-resilience — detailed worked examples ## Advanced Patterns ### Pattern 5: Logging Retry Attempts Track retry behavior for debugging and alerting. ```python from tenacity import retry, stop_after_attempt, wait_exponential import structlog logger = structlog.get_logger() def log_retry_attempt(retry_state): """Log detailed retry information.""" exception = retry_state.outcome.exception() logger.warning( "Retrying operation", attempt=retry_state.attempt_number, exception_type=type(exception).__name__, exception_message=str(exception), next_wait_seconds=retry_state.next_action.sleep if retry_state.next_action else None, ) @retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, max=10), before_sleep=log_retry_attempt, ) def call_with_logging(request: dict) -> dict: """External call with retry logging.""" ... ``` ### Pattern 6: Timeout Decorator Create reusable timeout decorators for consistent timeout handling. ```python import asyncio from functools import wraps from typing import TypeVar, Callable T = TypeVar("T") def with_timeout(seconds: float): """Decorator to add timeout to async functions.""" def decorator(func: Callable[..., T]) -> Callable[..., T]: @wraps(func) async def wrapper(*args, **kwargs) -> T: return await asyncio.wait_for( func(*args, **kwargs), timeout=seconds, ) return wrapper return decorator @with_timeout(30) async def fetch_with_timeout(url: str) -> dict: """Fetch URL with 30 second timeout.""" async with httpx.AsyncClient() as client: response = await client.get(url) return response.json() ``` ### Pattern 7: Cross-Cutting Concerns via Decorators Stack decorators to separate infrastructure from business logic. ```python from functools import wraps from typing import TypeVar, Callable import structlog logger = structlog.get_logger() T = TypeVar("T") def traced(name: str | None = None): """Add tracing to function calls.""" def decorator(func: Callable[..., T]) -> Callable[..., T]: span_name = name or func.__name__ @wraps(func) async def wrapper(*args, **kwargs) -> T: logger.info("Operation started", operation=span_name) try: result = await func(*args, **kwargs) logger.info("Operation completed", operation=span_name) return result except Exception as e: logger.error("Operation failed", operation=span_name, error=str(e)) raise return wrapper return decorator # Stack multiple concerns @traced("fetch_user_data") @with_timeout(30) @retry(stop=stop_after_attempt(3), wait=wait_exponential_jitter()) async def fetch_user_data(user_id: str) -> dict: """Fetch user with tracing, timeout, and retry.""" ... ``` ### Pattern 8: Dependency Injection for Testability Pass infrastructure components through constructors for easy testing. ```python from dataclasses import dataclass from typing import Protocol class Logger(Protocol): def info(self, msg: str, **kwargs) -> None: ... def error(self, msg: str, **kwargs) -> None: ... class MetricsClient(Protocol): def increment(self, metric: str, tags: dict | None = None) -> None: ... def timing(self, metric: str, value: float) -> None: ... @dataclass class UserService: """Service with injected infrastructure.""" repository: UserRepository logger: Logger metrics: MetricsClient async def get_user(self, user_id: str) -> User: self.logger.info("Fetching user", user_id=user_id) start = time.perf_counter() try: user = await self.repository.get(user_id) self.metrics.increment("user.fetch.success") return user except Exception as e: self.metrics.increment("user.fetch.error")