Cost Aware Llm Pipeline

Most solo builders first need this when wiring Claude or GPT calls into product code during Build; cost control is part of integration design, not an afterthought. The skill is about API client patterns and pipeline composition—canonical placement is Build integrations rather than generic PM docs.

Also useful

Also useful

Where it fits

Example use

Wire `select_model` thresholds before shipping a document summarizer that mixes short tweets and long PDFs.

Example use

Estimate whether a 30-item batch prototype can stay on Haiku for a demo deadline.

Example use

Adjust routing constants after monthly API spend overshoots your indie budget.

Example use

GrowAnalytics & insights

Use immutable cost trackers per job type to see which user workflows drive the most token spend.

How it compares

Procedural integration patterns for your codebase, not a hosted gateway product or a one-off prompt template skill.

Common Questions / FAQ

Who is cost-aware-llm-pipeline for?

It is for solo and small-team developers implementing Claude or GPT calls who pay per token and need routing and budget discipline baked into their pipeline code.

When should I use cost-aware-llm-pipeline?

Use it in Build integrations when adding LLM features, in Operate infra when tuning production spend after traffic grows, and in Validate prototype when estimating API cost for a batch MVP; also when processing mixed-complexity item lists.

Is cost-aware-llm-pipeline safe to install?

It documents code patterns and may imply API keys in your app; review the Security Audits panel on this Prism page and never commit live secrets into skills or repos.

SKILL.md

READMESKILL.md - Cost Aware Llm Pipeline

# Cost-Aware LLM Pipeline

Patterns for controlling LLM API costs while maintaining quality. Combines model routing, budget tracking, retry logic, and prompt caching into a composable pipeline.

## When to Activate

- Building applications that call LLM APIs (Claude, GPT, etc.)
- Processing batches of items with varying complexity
- Need to stay within a budget for API spend
- Optimizing cost without sacrificing quality on complex tasks

## Core Concepts

### 1. Model Routing by Task Complexity

Automatically select cheaper models for simple tasks, reserving expensive models for complex ones.

```python
MODEL_SONNET = "claude-sonnet-4-6"
MODEL_HAIKU = "claude-haiku-4-5-20251001"

_SONNET_TEXT_THRESHOLD = 10_000  # chars
_SONNET_ITEM_THRESHOLD = 30     # items

def select_model(
    text_length: int,
    item_count: int,
    force_model: str | None = None,
) -> str:
    """Select model based on task complexity."""
    if force_model is not None:
        return force_model
    if text_length >= _SONNET_TEXT_THRESHOLD or item_count >= _SONNET_ITEM_THRESHOLD:
        return MODEL_SONNET  # Complex task
    return MODEL_HAIKU  # Simple task (3-4x cheaper)
```

### 2. Immutable Cost Tracking

Track cumulative spend with frozen dataclasses. Each API call returns a new tracker — never mutates state.

```python
from dataclasses import dataclass

@dataclass(frozen=True, slots=True)
class CostRecord:
    model: str
    input_tokens: int
    output_tokens: int
    cost_usd: float

@dataclass(frozen=True, slots=True)
class CostTracker:
    budget_limit: float = 1.00
    records: tuple[CostRecord, ...] = ()

    def add(self, record: CostRecord) -> "CostTracker":
        """Return new tracker with added record (never mutates self)."""
        return CostTracker(
            budget_limit=self.budget_limit,
            records=(*self.records, record),
        )

    @property
    def total_cost(self) -> float:
        return sum(r.cost_usd for r in self.records)

    @property
    def over_budget(self) -> bool:
        return self.total_cost > self.budget_limit
```

### 3. Narrow Retry Logic

Retry only on transient errors. Fail fast on authentication or bad request errors.

```python
from anthropic import (
    APIConnectionError,
    InternalServerError,
    RateLimitError,
)

_RETRYABLE_ERRORS = (APIConnectionError, RateLimitError, InternalServerError)
_MAX_RETRIES = 3

def call_with_retry(func, *, max_retries: int = _MAX_RETRIES):
    """Retry only on transient errors, fail fast on others."""
    for attempt in range(max_retries):
        try:
            return func()
        except _RETRYABLE_ERRORS:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff
    # AuthenticationError, BadRequestError etc. → raise immediately
```

### 4. Prompt Caching

Cache long system prompts to avoid resending them on every request.

```python
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": system_prompt,
                "cache_control": {"type": "ephemeral"},  # Cache this
            },
            {
                "type": "text",
                "text": user_input,  # Variable part
            },
        ],
    }
]
```

## Composition

Combine all four techniques in a single pipeline function:

```python
def process(text: str, config: Config, tracker: CostTracker) -> tuple[Result, CostTracker]:
    # 1. Route model
    model = select_model(len(text), estimated_items, config.force_model)

    # 2. Check budget
    if tracker.over_budget:
        raise BudgetExceededError(tracker.total_cost, tracker.budget_limit)

    # 3. Call with retry + caching
    response = call_with_retry(lambda: client.messages.creat

What is this skill?

Model routing by task complexity with configurable Sonnet vs Haiku thresholds (10,000 text chars and 30 items)

Immutable frozen-dataclass style cost tracking per API call

Retry logic and prompt caching patterns bundled into one composable pipeline

Force-model override hook for exceptions to automatic routing

Activates for batch processing, budget caps, and quality-sensitive mixed workloads

Default routing thresholds: 10,000 characters and 30 items for Sonnet-class selection

Documents Haiku as roughly 3–4× cheaper than Sonnet for simple tasks

Compatible agents: Claude Code, Cursor, Codex, Windsurf

Adoption & trust: 4.8k installs on skills.sh; 210k GitHub stars; 3/3 security scanners passed (skills.sh audits).

What do I get? / Deliverables

You implement selective Sonnet versus Haiku routing, cumulative cost visibility, and retry or cache helpers so spend scales with task difficulty instead of flat maximum pricing.

Composable routing, cost tracker, retry, and caching module patterns

Documented model constants and threshold configuration

Spend-aware call path suitable for batch and interactive flows

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

Where it fits

Example use

Wire `select_model` thresholds before shipping a document summarizer that mixes short tweets and long PDFs.

Example use

Estimate whether a 30-item batch prototype can stay on Haiku for a demo deadline.

Example use