
Content Hash Cache Pattern
Add SHA-256 content-hash caching around expensive file processing so repeat runs skip work when bytes are unchanged—even after renames or moves.
Overview
Content-hash-cache-pattern is an agent skill for the Build phase that caches expensive file processing using SHA-256 content hashes as path-independent, auto-invalidating keys.
Install
npx skills add https://github.com/affaan-m/everything-claude-code --skill content-hash-cache-patternWhat is this skill?
- SHA-256 over file contents with 64KB chunked reads for large files—path-independent cache keys
- Survives renames and moves; auto-invalidates when content changes without a separate index file
- Service-layer separation so pure processing functions stay untouched while caching wraps the boundary
- Designed for PDF parsing, text extraction, image analysis, and similar high-cost transforms
- Supports `--cache` / `--no-cache` CLI ergonomics for solo builder tooling
- Uses 64KB (_HASH_CHUNK_SIZE = 65536) chunked reads when computing SHA-256 for large files
Adoption & trust: 4.6k installs on skills.sh; 210k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
Reprocessing the same PDFs or images on every CLI run wastes time, and path-based caches break when files move or get renamed.
Who is it for?
Solo builders shipping Python file pipelines (PDF, images, text) where repeat processing dominates runtime and files get reorganized often.
Skip if: Tiny one-off scripts that never rerun, real-time streams without stable file artifacts, or caches keyed only on URL metadata without reading content.
When should I use this skill?
Building file processing pipelines (PDF, images, text extraction), processing cost is high with repeated files, need `--cache/--no-cache` CLI option, or wrapping existing pure functions with caching.
What do I get? / Deliverables
You wrap processing behind content-keyed cache lookups with optional `--cache` control and keep expensive functions pure behind a service boundary.
- `compute_file_hash` (or equivalent) helper with chunked SHA-256
- Cache-backed service wrapper around processing
- CLI or config flag for cache enable/disable
Recommended Skills
Journey fit
Caching and service-layer separation are implementation concerns while building pipelines and APIs that process files. The pattern targets backend file-processing cost, cache keys, and optional CLI flags—not frontend UI or launch distribution.
How it compares
Use for deterministic file-byte caching in your app, not as an agent brainstorming workflow or a hosted CDN edge cache product.
Common Questions / FAQ
Who is content-hash-cache-pattern for?
Developers building CLIs or backend workers that repeatedly parse PDFs, extract text, or analyze images and need invalidation that follows content, not folder paths.
When should I use content-hash-cache-pattern?
During Build/backend work when processing cost is high, the same files are processed many times, or you want `--cache` / `--no-cache` without rewriting core logic.
Is content-hash-cache-pattern safe to install?
The skill describes local hashing and cache storage patterns—review the Security Audits panel on this page and ensure cache directories do not store secrets or unredacted sensitive documents.
SKILL.md
READMESKILL.md - Content Hash Cache Pattern
# Content-Hash File Cache Pattern Cache expensive file processing results (PDF parsing, text extraction, image analysis) using SHA-256 content hashes as cache keys. Unlike path-based caching, this approach survives file moves/renames and auto-invalidates when content changes. ## When to Activate - Building file processing pipelines (PDF, images, text extraction) - Processing cost is high and same files are processed repeatedly - Need a `--cache/--no-cache` CLI option - Want to add caching to existing pure functions without modifying them ## Core Pattern ### 1. Content-Hash Based Cache Key Use file content (not path) as the cache key: ```python import hashlib from pathlib import Path _HASH_CHUNK_SIZE = 65536 # 64KB chunks for large files def compute_file_hash(path: Path) -> str: """SHA-256 of file contents (chunked for large files).""" if not path.is_file(): raise FileNotFoundError(f"File not found: {path}") sha256 = hashlib.sha256() with open(path, "rb") as f: while True: chunk = f.read(_HASH_CHUNK_SIZE) if not chunk: break sha256.update(chunk) return sha256.hexdigest() ``` **Why content hash?** File rename/move = cache hit. Content change = automatic invalidation. No index file needed. ### 2. Frozen Dataclass for Cache Entry ```python from dataclasses import dataclass @dataclass(frozen=True, slots=True) class CacheEntry: file_hash: str source_path: str document: ExtractedDocument # The cached result ``` ### 3. File-Based Cache Storage Each cache entry is stored as `{hash}.json` — O(1) lookup by hash, no index file required. ```python import json from typing import Any def write_cache(cache_dir: Path, entry: CacheEntry) -> None: cache_dir.mkdir(parents=True, exist_ok=True) cache_file = cache_dir / f"{entry.file_hash}.json" data = serialize_entry(entry) cache_file.write_text(json.dumps(data, ensure_ascii=False), encoding="utf-8") def read_cache(cache_dir: Path, file_hash: str) -> CacheEntry | None: cache_file = cache_dir / f"{file_hash}.json" if not cache_file.is_file(): return None try: raw = cache_file.read_text(encoding="utf-8") data = json.loads(raw) return deserialize_entry(data) except (json.JSONDecodeError, ValueError, KeyError): return None # Treat corruption as cache miss ``` ### 4. Service Layer Wrapper (SRP) Keep the processing function pure. Add caching as a separate service layer. ```python def extract_with_cache( file_path: Path, *, cache_enabled: bool = True, cache_dir: Path = Path(".cache"), ) -> ExtractedDocument: """Service layer: cache check -> extraction -> cache write.""" if not cache_enabled: return extract_text(file_path) # Pure function, no cache knowledge file_hash = compute_file_hash(file_path) # Check cache cached = read_cache(cache_dir, file_hash) if cached is not None: logger.info("Cache hit: %s (hash=%s)", file_path.name, file_hash[:12]) return cached.document # Cache miss -> extract -> store logger.info("Cache miss: %s (hash=%s)", file_path.name, file_hash[:12]) doc = extract_text(file_path) entry = CacheEntry(file_hash=file_hash, source_path=str(file_path), document=doc) write_cache(cache_dir, entry) return doc ``` ## Key Design Decisions | Decision | Rationale | |----------|-----------| | SHA-256 content hash | Path-independent, auto-invalidates on content change | | `{hash}.json` file naming | O(1) lookup, no index file needed | | Service layer wrapper | SRP: extraction stays pure, cache is a separate concern | | Manual JSON serialization | Full control over frozen dataclass serialization | | Corruption retu