Python Resilience

Name: Python Resilience
Author: wshobson

wshobson/agents

8.6k installs
38.3k repo stars
Updated July 22, 2026
wshobson/agents

python-resilience is an agent skill that Python resilience patterns including automatic retries, exponential backoff, timeouts, and fault-tolerant decorators. Use when adding retry logic, implementing .

About

Python resilience patterns including automatic retries, exponential backoff, timeouts, and fault-tolerant decorators. Use when adding retry logic, implementing timeouts, building fault-tolerant services, or handling transient failures. --- name: python-resilience description: Python resilience patterns including automatic retries, exponential backoff, timeouts, and fault-tolerant decorators. Use when adding retry logic, implementing timeouts, building fault-tolerant services, or handling transient failures. --- # Python Resilience Patterns Build fault-tolerant Python applications that gracefully handle transient failures, network issues, and service outages. Resilience patterns keep systems running when dependencies are unreliable. ## When to Use This Skill - Adding retry logic to external service calls - Implementing timeouts for network operations - Building fault-tolerant microservices - Handling rate limiting and backpressure - Creating infrastructure decorators - Designing circuit breakers ## Core Concepts ### 1. Transient vs Permanent Failures Retry transient errors (network timeouts, temporary service issues). Don't retry permanent errors (invalid credentials, bad requests).

Python Resilience Patterns
Adding retry logic to external service calls
Implementing timeouts for network operations
Building fault-tolerant microservices
Handling rate limiting and backpressure

Python Resilience by the numbers

8,553 all-time installs (skills.sh)
+218 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #80 of 1,041 Cloud & Infrastructure skills by installs in the Skillselion catalog
Security screen: CRITICAL risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

At a glance

python-resilience capabilities & compatibility

Capabilities: python resilience patterns · adding retry logic to external service calls · implementing timeouts for network operations · building fault tolerant microservices · handling rate limiting and backpressure
Use cases: documentation

From the docs

What python-resilience says it does

--- name: python-resilience description: Python resilience patterns including automatic retries, exponential backoff, timeouts, and fault-tolerant decorators.

SKILL.md

Use when adding retry logic, implementing timeouts, building fault-tolerant services, or handling transient failures.

SKILL.md

--- # Python Resilience Patterns Build fault-tolerant Python applications that gracefully handle transient failures, network issues, and service outages.

SKILL.md

Resilience patterns keep systems running when dependencies are unreliable.

SKILL.md

npx skills add https://github.com/wshobson/agents --skill python-resilience

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/wshobson/agents/python-resilience.svg)](https://skillselion.com/skills/wshobson/agents/python-resilience)

Installs	8.6k
repo stars	★ 38.3k
Security audit	1 / 3 scanners passed
Last updated	July 22, 2026
Repository	wshobson/agents ↗

What problem does python-resilience solve for developers using this skill?

Who is it for?

Developers who need python-resilience patterns described in the cached skill documentation.

Skip if: Skip when docs are empty or the task is outside the skill's documented scope.

When should I use this skill?

What you get

Actionable workflows and conventions from SKILL.md for python-resilience.

retry decorators
logging callbacks
timeout guards

Files

SKILL.mdMarkdownGitHub ↗

Python Resilience Patterns

Build fault-tolerant Python applications that gracefully handle transient failures, network issues, and service outages. Resilience patterns keep systems running when dependencies are unreliable.

When to Use This Skill

Adding retry logic to external service calls
Implementing timeouts for network operations
Building fault-tolerant microservices
Handling rate limiting and backpressure
Creating infrastructure decorators
Designing circuit breakers

Core Concepts

1. Transient vs Permanent Failures

Retry transient errors (network timeouts, temporary service issues). Don't retry permanent errors (invalid credentials, bad requests).

2. Exponential Backoff

Increase wait time between retries to avoid overwhelming recovering services.

3. Jitter

Add randomness to backoff to prevent thundering herd when many clients retry simultaneously.

4. Bounded Retries

Cap both attempt count and total duration to prevent infinite retry loops.

Quick Start

from tenacity import retry, stop_after_attempt, wait_exponential_jitter

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential_jitter(initial=1, max=10),
)
def call_external_service(request: dict) -> dict:
    return httpx.post("https://api.example.com", json=request).json()

Fundamental Patterns

Pattern 1: Basic Retry with Tenacity

Use the tenacity library for production-grade retry logic. For simpler cases, consider built-in retry functionality or a lightweight custom implementation.

from tenacity import (
    retry,
    stop_after_attempt,
    stop_after_delay,
    wait_exponential_jitter,
    retry_if_exception_type,
)

TRANSIENT_ERRORS = (ConnectionError, TimeoutError, OSError)

@retry(
    retry=retry_if_exception_type(TRANSIENT_ERRORS),
    stop=stop_after_attempt(5) | stop_after_delay(60),
    wait=wait_exponential_jitter(initial=1, max=30),
)
def fetch_data(url: str) -> dict:
    """Fetch data with automatic retry on transient failures."""
    response = httpx.get(url, timeout=30)
    response.raise_for_status()
    return response.json()

Pattern 2: Retry Only Appropriate Errors

Whitelist specific transient exceptions. Never retry:

ValueError, TypeError - These are bugs, not transient issues
AuthenticationError - Invalid credentials won't become valid
HTTP 4xx errors (except 429) - Client errors are permanent

from tenacity import retry, retry_if_exception_type
import httpx

# Define what's retryable
RETRYABLE_EXCEPTIONS = (
    ConnectionError,
    TimeoutError,
    httpx.ConnectTimeout,
    httpx.ReadTimeout,
)

@retry(
    retry=retry_if_exception_type(RETRYABLE_EXCEPTIONS),
    stop=stop_after_attempt(3),
    wait=wait_exponential_jitter(initial=1, max=10),
)
def resilient_api_call(endpoint: str) -> dict:
    """Make API call with retry on network issues."""
    return httpx.get(endpoint, timeout=10).json()

Pattern 3: HTTP Status Code Retries

Retry specific HTTP status codes that indicate transient issues.

from tenacity import retry, retry_if_result, stop_after_attempt
import httpx

RETRY_STATUS_CODES = {429, 502, 503, 504}

def should_retry_response(response: httpx.Response) -> bool:
    """Check if response indicates a retryable error."""
    return response.status_code in RETRY_STATUS_CODES

@retry(
    retry=retry_if_result(should_retry_response),
    stop=stop_after_attempt(3),
    wait=wait_exponential_jitter(initial=1, max=10),
)
def http_request(method: str, url: str, **kwargs) -> httpx.Response:
    """Make HTTP request with retry on transient status codes."""
    return httpx.request(method, url, timeout=30, **kwargs)

Pattern 4: Combined Exception and Status Retry

Handle both network exceptions and HTTP status codes.

from tenacity import (
    retry,
    retry_if_exception_type,
    retry_if_result,
    stop_after_attempt,
    wait_exponential_jitter,
    before_sleep_log,
)
import logging
import httpx

logger = logging.getLogger(__name__)

TRANSIENT_EXCEPTIONS = (
    ConnectionError,
    TimeoutError,
    httpx.ConnectError,
    httpx.ReadTimeout,
)
RETRY_STATUS_CODES = {429, 500, 502, 503, 504}

def is_retryable_response(response: httpx.Response) -> bool:
    return response.status_code in RETRY_STATUS_CODES

@retry(
    retry=(
        retry_if_exception_type(TRANSIENT_EXCEPTIONS) |
        retry_if_result(is_retryable_response)
    ),
    stop=stop_after_attempt(5),
    wait=wait_exponential_jitter(initial=1, max=30),
    before_sleep=before_sleep_log(logger, logging.WARNING),
)
def robust_http_call(
    method: str,
    url: str,
    **kwargs,
) -> httpx.Response:
    """HTTP call with comprehensive retry handling."""
    return httpx.request(method, url, timeout=30, **kwargs)

Detailed worked examples and patterns

Detailed sections (starting with ## Advanced Patterns) live in references/details.md. Read that file when the navigation summary above is insufficient.

Best Practices Summary

1. Retry only transient errors - Don't retry bugs or authentication failures 2. Use exponential backoff - Give services time to recover 3. Add jitter - Prevent thundering herd from synchronized retries 4. Cap total duration - stop_after_attempt(5) | stop_after_delay(60) 5. Log every retry - Silent retries hide systemic problems 6. Use decorators - Keep retry logic separate from business logic 7. Inject dependencies - Make infrastructure testable 8. Set timeouts everywhere - Every network call needs a timeout 9. Fail gracefully - Return cached/default values for non-critical paths 10. Monitor retry rates - High retry rates indicate underlying issues

python-resilience — detailed worked examples

Advanced Patterns

Pattern 5: Logging Retry Attempts

Track retry behavior for debugging and alerting.

from tenacity import retry, stop_after_attempt, wait_exponential
import structlog

logger = structlog.get_logger()

def log_retry_attempt(retry_state):
    """Log detailed retry information."""
    exception = retry_state.outcome.exception()
    logger.warning(
        "Retrying operation",
        attempt=retry_state.attempt_number,
        exception_type=type(exception).__name__,
        exception_message=str(exception),
        next_wait_seconds=retry_state.next_action.sleep if retry_state.next_action else None,
    )

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, max=10),
    before_sleep=log_retry_attempt,
)
def call_with_logging(request: dict) -> dict:
    """External call with retry logging."""
    ...

Pattern 6: Timeout Decorator

Create reusable timeout decorators for consistent timeout handling.

import asyncio
from functools import wraps
from typing import TypeVar, Callable

T = TypeVar("T")

def with_timeout(seconds: float):
    """Decorator to add timeout to async functions."""
    def decorator(func: Callable[..., T]) -> Callable[..., T]:
        @wraps(func)
        async def wrapper(*args, **kwargs) -> T:
            return await asyncio.wait_for(
                func(*args, **kwargs),
                timeout=seconds,
            )
        return wrapper
    return decorator

@with_timeout(30)
async def fetch_with_timeout(url: str) -> dict:
    """Fetch URL with 30 second timeout."""
    async with httpx.AsyncClient() as client:
        response = await client.get(url)
        return response.json()

Pattern 7: Cross-Cutting Concerns via Decorators

Stack decorators to separate infrastructure from business logic.

from functools import wraps
from typing import TypeVar, Callable
import structlog

logger = structlog.get_logger()
T = TypeVar("T")

def traced(name: str | None = None):
    """Add tracing to function calls."""
    def decorator(func: Callable[..., T]) -> Callable[..., T]:
        span_name = name or func.__name__

        @wraps(func)
        async def wrapper(*args, **kwargs) -> T:
            logger.info("Operation started", operation=span_name)
            try:
                result = await func(*args, **kwargs)
                logger.info("Operation completed", operation=span_name)
                return result
            except Exception as e:
                logger.error("Operation failed", operation=span_name, error=str(e))
                raise
        return wrapper
    return decorator

# Stack multiple concerns
@traced("fetch_user_data")
@with_timeout(30)
@retry(stop=stop_after_attempt(3), wait=wait_exponential_jitter())
async def fetch_user_data(user_id: str) -> dict:
    """Fetch user with tracing, timeout, and retry."""
    ...

Pattern 8: Dependency Injection for Testability

Pass infrastructure components through constructors for easy testing.

from dataclasses import dataclass
from typing import Protocol

class Logger(Protocol):
    def info(self, msg: str, **kwargs) -> None: ...
    def error(self, msg: str, **kwargs) -> None: ...

class MetricsClient(Protocol):
    def increment(self, metric: str, tags: dict | None = None) -> None: ...
    def timing(self, metric: str, value: float) -> None: ...

@dataclass
class UserService:
    """Service with injected infrastructure."""

    repository: UserRepository
    logger: Logger
    metrics: MetricsClient

    async def get_user(self, user_id: str) -> User:
        self.logger.info("Fetching user", user_id=user_id)
        start = time.perf_counter()

        try:
            user = await self.repository.get(user_id)
            self.metrics.increment("user.fetch.success")
            return user
        except Exception as e:
            self.metrics.increment("user.fetch.error")
            self.logger.error("Failed to fetch user", user_id=user_id, error=str(e))
            raise
        finally:
            elapsed = time.perf_counter() - start
            self.metrics.timing("user.fetch.duration", elapsed)

# Easy to test with fakes
service = UserService(
    repository=FakeRepository(),
    logger=FakeLogger(),
    metrics=FakeMetrics(),
)

Pattern 9: Fail-Safe Defaults

Degrade gracefully when non-critical operations fail.

from typing import TypeVar
from collections.abc import Callable

T = TypeVar("T")

def fail_safe(default: T, log_failure: bool = True):
    """Return default value on failure instead of raising."""
    def decorator(func: Callable[..., T]) -> Callable[..., T]:
        @wraps(func)
        async def wrapper(*args, **kwargs) -> T:
            try:
                return await func(*args, **kwargs)
            except Exception as e:
                if log_failure:
                    logger.warning(
                        "Operation failed, using default",
                        function=func.__name__,
                        error=str(e),
                    )
                return default
        return wrapper
    return decorator

@fail_safe(default=[])
async def get_recommendations(user_id: str) -> list[str]:
    """Get recommendations, return empty list on failure."""
    ...

Related skills

Azure AiIntegrates Azure AI Content Safety, Document Intelligence, Speech, and Search services into Java-based agents and applications.479k1.3k

Azure PrepareGenerate the exact Azure infrastructure files, Dockerfiles, and azure.yaml configuration needed before deploying any new or modernized application.479k1.3k

Azure StorageConnect agents and applications to Azure Blob Storage, File Shares, Queues, Tables, and Data Lake without leaving the coding environment.478k1.3k

Appinsights InstrumentationAutomatically instrument web applications running on Azure App Service with Application Insights for observability without manual configuration.478k1.3k

Azure Resource LookupInstantly list, query, and discover any Azure resources across subscriptions without leaving the agent chat.478k1.3k

Azure AigatewayConfigure Azure API Management as a secure, governed gateway for routing traffic to LLMs, MCP servers, and agent tools.478k1.3k

How it compares

Use python-resilience over generic error-handling snippets when tenacity plus structlog retry telemetry is the target stack.

Python Resilience

About

Python Resilience by the numbers

python-resilience capabilities & compatibility

What python-resilience says it does

Add your badge

What problem does python-resilience solve for developers using this skill?

Who is it for?

When should I use this skill?

What you get

Files

Python Resilience Patterns

When to Use This Skill

Core Concepts

1. Transient vs Permanent Failures

2. Exponential Backoff

3. Jitter

4. Bounded Retries

Quick Start

Fundamental Patterns

Pattern 1: Basic Retry with Tenacity

Pattern 2: Retry Only Appropriate Errors

Pattern 3: HTTP Status Code Retries

Pattern 4: Combined Exception and Status Retry

Detailed worked examples and patterns

Best Practices Summary

python-resilience — detailed worked examples

Advanced Patterns

Pattern 5: Logging Retry Attempts

Pattern 6: Timeout Decorator

Pattern 7: Cross-Cutting Concerns via Decorators

Pattern 8: Dependency Injection for Testability

Pattern 9: Fail-Safe Defaults

Related skills

How it compares

FAQ

What does python-resilience do?

When should I use python-resilience?

Is python-resilience safe to install?

About

Python Resilience by the numbers

python-resilience capabilities & compatibility

What python-resilience says it does

Add your badge

What problem does python-resilience solve for developers using this skill?

Who is it for?

When should I use this skill?

What you get

Files

Python Resilience Patterns

When to Use This Skill

Core Concepts

1. Transient vs Permanent Failures

2. Exponential Backoff

3. Jitter

4. Bounded Retries

Quick Start

Fundamental Patterns

Pattern 1: Basic Retry with Tenacity

Pattern 2: Retry Only Appropriate Errors

Pattern 3: HTTP Status Code Retries

Pattern 4: Combined Exception and Status Retry

Detailed worked examples and patterns

Best Practices Summary

python-resilience — detailed worked examples

Advanced Patterns

Pattern 5: Logging Retry Attempts

Pattern 6: Timeout Decorator

Pattern 7: Cross-Cutting Concerns via Decorators

Pattern 8: Dependency Injection for Testability

Pattern 9: Fail-Safe Defaults

Related skills

How it compares

FAQ

What does python-resilience do?

When should I use python-resilience?

Is python-resilience safe to install?

This week in AI coding