
Cost Aware Llm Pipeline
Compose model routing, budget tracking, retries, and prompt caching so LLM-heavy features stay within API spend without dumbing down hard tasks.
Overview
Cost-Aware LLM Pipeline is an agent skill most often used in Build (also Operate infra and Grow analytics) that routes LLM requests by complexity while tracking budget, retries, and prompt cache usage.
Install
npx skills add https://github.com/affaan-m/everything-claude-code --skill cost-aware-llm-pipelineWhat is this skill?
- Model routing by task complexity with configurable Sonnet vs Haiku thresholds (10,000 text chars and 30 items)
- Immutable frozen-dataclass style cost tracking per API call
- Retry logic and prompt caching patterns bundled into one composable pipeline
- Force-model override hook for exceptions to automatic routing
- Activates for batch processing, budget caps, and quality-sensitive mixed workloads
- Default routing thresholds: 10,000 characters and 30 items for Sonnet-class selection
- Documents Haiku as roughly 3–4× cheaper than Sonnet for simple tasks
Adoption & trust: 4.8k installs on skills.sh; 210k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your agent or SaaS feature calls expensive models for every request, so batches and simple prompts burn budget without a routing or tracking strategy.
Who is it for?
Indie builders shipping LLM-backed APIs or agents who need explicit routing thresholds and spend accounting in application code.
Skip if: Teams with flat enterprise contracts and no marginal cost concern, or products with no LLM API usage at all.
When should I use this skill?
Building applications that call LLM APIs, processing batches with varying complexity, needing to stay within API spend, or optimizing cost without sacrificing quality on complex tasks.
What do I get? / Deliverables
You implement selective Sonnet versus Haiku routing, cumulative cost visibility, and retry or cache helpers so spend scales with task difficulty instead of flat maximum pricing.
- Composable routing, cost tracker, retry, and caching module patterns
- Documented model constants and threshold configuration
- Spend-aware call path suitable for batch and interactive flows
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Most solo builders first need this when wiring Claude or GPT calls into product code during Build; cost control is part of integration design, not an afterthought. The skill is about API client patterns and pipeline composition—canonical placement is Build integrations rather than generic PM docs.
Where it fits
Wire `select_model` thresholds before shipping a document summarizer that mixes short tweets and long PDFs.
Estimate whether a 30-item batch prototype can stay on Haiku for a demo deadline.
Adjust routing constants after monthly API spend overshoots your indie budget.
Use immutable cost trackers per job type to see which user workflows drive the most token spend.
How it compares
Procedural integration patterns for your codebase, not a hosted gateway product or a one-off prompt template skill.
Common Questions / FAQ
Who is cost-aware-llm-pipeline for?
It is for solo and small-team developers implementing Claude or GPT calls who pay per token and need routing and budget discipline baked into their pipeline code.
When should I use cost-aware-llm-pipeline?
Use it in Build integrations when adding LLM features, in Operate infra when tuning production spend after traffic grows, and in Validate prototype when estimating API cost for a batch MVP; also when processing mixed-complexity item lists.
Is cost-aware-llm-pipeline safe to install?
It documents code patterns and may imply API keys in your app; review the Security Audits panel on this Prism page and never commit live secrets into skills or repos.
SKILL.md
READMESKILL.md - Cost Aware Llm Pipeline
# Cost-Aware LLM Pipeline Patterns for controlling LLM API costs while maintaining quality. Combines model routing, budget tracking, retry logic, and prompt caching into a composable pipeline. ## When to Activate - Building applications that call LLM APIs (Claude, GPT, etc.) - Processing batches of items with varying complexity - Need to stay within a budget for API spend - Optimizing cost without sacrificing quality on complex tasks ## Core Concepts ### 1. Model Routing by Task Complexity Automatically select cheaper models for simple tasks, reserving expensive models for complex ones. ```python MODEL_SONNET = "claude-sonnet-4-6" MODEL_HAIKU = "claude-haiku-4-5-20251001" _SONNET_TEXT_THRESHOLD = 10_000 # chars _SONNET_ITEM_THRESHOLD = 30 # items def select_model( text_length: int, item_count: int, force_model: str | None = None, ) -> str: """Select model based on task complexity.""" if force_model is not None: return force_model if text_length >= _SONNET_TEXT_THRESHOLD or item_count >= _SONNET_ITEM_THRESHOLD: return MODEL_SONNET # Complex task return MODEL_HAIKU # Simple task (3-4x cheaper) ``` ### 2. Immutable Cost Tracking Track cumulative spend with frozen dataclasses. Each API call returns a new tracker — never mutates state. ```python from dataclasses import dataclass @dataclass(frozen=True, slots=True) class CostRecord: model: str input_tokens: int output_tokens: int cost_usd: float @dataclass(frozen=True, slots=True) class CostTracker: budget_limit: float = 1.00 records: tuple[CostRecord, ...] = () def add(self, record: CostRecord) -> "CostTracker": """Return new tracker with added record (never mutates self).""" return CostTracker( budget_limit=self.budget_limit, records=(*self.records, record), ) @property def total_cost(self) -> float: return sum(r.cost_usd for r in self.records) @property def over_budget(self) -> bool: return self.total_cost > self.budget_limit ``` ### 3. Narrow Retry Logic Retry only on transient errors. Fail fast on authentication or bad request errors. ```python from anthropic import ( APIConnectionError, InternalServerError, RateLimitError, ) _RETRYABLE_ERRORS = (APIConnectionError, RateLimitError, InternalServerError) _MAX_RETRIES = 3 def call_with_retry(func, *, max_retries: int = _MAX_RETRIES): """Retry only on transient errors, fail fast on others.""" for attempt in range(max_retries): try: return func() except _RETRYABLE_ERRORS: if attempt == max_retries - 1: raise time.sleep(2 ** attempt) # Exponential backoff # AuthenticationError, BadRequestError etc. → raise immediately ``` ### 4. Prompt Caching Cache long system prompts to avoid resending them on every request. ```python messages = [ { "role": "user", "content": [ { "type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}, # Cache this }, { "type": "text", "text": user_input, # Variable part }, ], } ] ``` ## Composition Combine all four techniques in a single pipeline function: ```python def process(text: str, config: Config, tracker: CostTracker) -> tuple[Result, CostTracker]: # 1. Route model model = select_model(len(text), estimated_items, config.force_model) # 2. Check budget if tracker.over_budget: raise BudgetExceededError(tracker.total_cost, tracker.budget_limit) # 3. Call with retry + caching response = call_with_retry(lambda: client.messages.creat