Content Hash Cache Pattern

Name: Content Hash Cache Pattern
Author: affaan-m

affaan-m/everything-claude-code

Add SHA-256 content-hash caching around expensive file processing so repeat runs skip work when bytes are unchanged—even after renames or moves.

Overview

Content-hash-cache-pattern is an agent skill for the Build phase that caches expensive file processing using SHA-256 content hashes as path-independent, auto-invalidating keys.

Install

npx skills add https://github.com/affaan-m/everything-claude-code --skill content-hash-cache-pattern

What is this skill?

SHA-256 over file contents with 64KB chunked reads for large files—path-independent cache keys
Survives renames and moves; auto-invalidates when content changes without a separate index file
Service-layer separation so pure processing functions stay untouched while caching wraps the boundary
Designed for PDF parsing, text extraction, image analysis, and similar high-cost transforms
Supports `--cache` / `--no-cache` CLI ergonomics for solo builder tooling
Uses 64KB (_HASH_CHUNK_SIZE = 65536) chunked reads when computing SHA-256 for large files

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 4.6k installs on skills.sh; 210k GitHub stars; 3/3 security scanners passed (skills.sh audits).

What problem does it solve?

Reprocessing the same PDFs or images on every CLI run wastes time, and path-based caches break when files move or get renamed.

Who is it for?

Solo builders shipping Python file pipelines (PDF, images, text) where repeat processing dominates runtime and files get reorganized often.

Skip if: Tiny one-off scripts that never rerun, real-time streams without stable file artifacts, or caches keyed only on URL metadata without reading content.

When should I use this skill?

Building file processing pipelines (PDF, images, text extraction), processing cost is high with repeated files, need `--cache/--no-cache` CLI option, or wrapping existing pure functions with caching.

What do I get? / Deliverables

You wrap processing behind content-keyed cache lookups with optional `--cache` control and keep expensive functions pure behind a service boundary.

`compute_file_hash` (or equivalent) helper with chunked SHA-256
Cache-backed service wrapper around processing
CLI or config flag for cache enable/disable

Recommended Skills

Entra App Registrationmicrosoft/azure-skills

Walkthrough for Microsoft Entra ID app registration, OAuth configuration, and MSAL-based authentication setup.374k installs·1.2k stars

Azure Aigatewaymicrosoft/azure-skills

Quick reference for building and operating Azure API Management as an AI gateway using ARM .NET SDK and policy best prac…373k installs·1.2k stars

Lark Openapi Explorerlarksuite/cli

Escalation path for uncovered Lark/Feishu APIs: browse official OpenAPI docs and execute via lark-cli.208k installs·13.7k stars

Supabasesupabase/agent-skills

Authoritative Supabase agent skill for database, auth, edge functions, and client SSR with emphasis on up-to-date docs, …111k installs·2.2k stars

Firebase Auth Basicsfirebase/agent-skills

firebase-auth-basics is a guided agent skill for solo builders implementing Firebase Authentication in web or mobile-bac…75.7k installs·345 stars

Firebase Data Connectfirebase/agent-skills

Firebase Data Connect is an agent skill that equips solo builders to design relational data layers on Firebase using Gra…73.2k installs·345 stars

Journey fit

Primary fit

BuildBackend, data & payments

Caching and service-layer separation are implementation concerns while building pipelines and APIs that process files. The pattern targets backend file-processing cost, cache keys, and optional CLI flags—not frontend UI or launch distribution.

How it compares

Use for deterministic file-byte caching in your app, not as an agent brainstorming workflow or a hosted CDN edge cache product.

Common Questions / FAQ

Who is content-hash-cache-pattern for?

Developers building CLIs or backend workers that repeatedly parse PDFs, extract text, or analyze images and need invalidation that follows content, not folder paths.

When should I use content-hash-cache-pattern?

During Build/backend work when processing cost is high, the same files are processed many times, or you want `--cache` / `--no-cache` without rewriting core logic.

Is content-hash-cache-pattern safe to install?

The skill describes local hashing and cache storage patterns—review the Security Audits panel on this page and ensure cache directories do not store secrets or unredacted sensitive documents.

SKILL.md

READMESKILL.md - Content Hash Cache Pattern

# Content-Hash File Cache Pattern

Cache expensive file processing results (PDF parsing, text extraction, image analysis) using SHA-256 content hashes as cache keys. Unlike path-based caching, this approach survives file moves/renames and auto-invalidates when content changes.

## When to Activate

- Building file processing pipelines (PDF, images, text extraction)
- Processing cost is high and same files are processed repeatedly
- Need a `--cache/--no-cache` CLI option
- Want to add caching to existing pure functions without modifying them

## Core Pattern

### 1. Content-Hash Based Cache Key

Use file content (not path) as the cache key:

```python
import hashlib
from pathlib import Path

_HASH_CHUNK_SIZE = 65536  # 64KB chunks for large files

def compute_file_hash(path: Path) -> str:
    """SHA-256 of file contents (chunked for large files)."""
    if not path.is_file():
        raise FileNotFoundError(f"File not found: {path}")
    sha256 = hashlib.sha256()
    with open(path, "rb") as f:
        while True:
            chunk = f.read(_HASH_CHUNK_SIZE)
            if not chunk:
                break
            sha256.update(chunk)
    return sha256.hexdigest()
```

**Why content hash?** File rename/move = cache hit. Content change = automatic invalidation. No index file needed.

### 2. Frozen Dataclass for Cache Entry

```python
from dataclasses import dataclass

@dataclass(frozen=True, slots=True)
class CacheEntry:
    file_hash: str
    source_path: str
    document: ExtractedDocument  # The cached result
```

### 3. File-Based Cache Storage

Each cache entry is stored as `{hash}.json` — O(1) lookup by hash, no index file required.

```python
import json
from typing import Any

def write_cache(cache_dir: Path, entry: CacheEntry) -> None:
    cache_dir.mkdir(parents=True, exist_ok=True)
    cache_file = cache_dir / f"{entry.file_hash}.json"
    data = serialize_entry(entry)
    cache_file.write_text(json.dumps(data, ensure_ascii=False), encoding="utf-8")

def read_cache(cache_dir: Path, file_hash: str) -> CacheEntry | None:
    cache_file = cache_dir / f"{file_hash}.json"
    if not cache_file.is_file():
        return None
    try:
        raw = cache_file.read_text(encoding="utf-8")
        data = json.loads(raw)
        return deserialize_entry(data)
    except (json.JSONDecodeError, ValueError, KeyError):
        return None  # Treat corruption as cache miss
```

### 4. Service Layer Wrapper (SRP)

Keep the processing function pure. Add caching as a separate service layer.

```python
def extract_with_cache(
    file_path: Path,
    *,
    cache_enabled: bool = True,
    cache_dir: Path = Path(".cache"),
) -> ExtractedDocument:
    """Service layer: cache check -> extraction -> cache write."""
    if not cache_enabled:
        return extract_text(file_path)  # Pure function, no cache knowledge

    file_hash = compute_file_hash(file_path)

    # Check cache
    cached = read_cache(cache_dir, file_hash)
    if cached is not None:
        logger.info("Cache hit: %s (hash=%s)", file_path.name, file_hash[:12])
        return cached.document

    # Cache miss -> extract -> store
    logger.info("Cache miss: %s (hash=%s)", file_path.name, file_hash[:12])
    doc = extract_text(file_path)
    entry = CacheEntry(file_hash=file_hash, source_path=str(file_path), document=doc)
    write_cache(cache_dir, entry)
    return doc
```

## Key Design Decisions

| Decision | Rationale |
|----------|-----------|
| SHA-256 content hash | Path-independent, auto-invalidates on content change |
| `{hash}.json` file naming | O(1) lookup, no index file needed |
| Service layer wrapper | SRP: extraction stays pure, cache is a separate concern |
| Manual JSON serialization | Full control over frozen dataclass serialization |
| Corruption retu

What is this skill?

SHA-256 over file contents with 64KB chunked reads for large files—path-independent cache keys

Survives renames and moves; auto-invalidates when content changes without a separate index file

Service-layer separation so pure processing functions stay untouched while caching wraps the boundary

Designed for PDF parsing, text extraction, image analysis, and similar high-cost transforms

Supports `--cache` / `--no-cache` CLI ergonomics for solo builder tooling

Uses 64KB (_HASH_CHUNK_SIZE = 65536) chunked reads when computing SHA-256 for large files

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 4.6k installs on skills.sh; 210k GitHub stars; 3/3 security scanners passed (skills.sh audits).

Journey fit

Primary fit

BuildBackend, data & payments

SKILL.md

READMESKILL.md - Content Hash Cache Pattern

# Content-Hash File Cache Pattern

Cache expensive file processing results (PDF parsing, text extraction, image analysis) using SHA-256 content hashes as cache keys. Unlike path-based caching, this approach survives file moves/renames and auto-invalidates when content changes.

## When to Activate

- Building file processing pipelines (PDF, images, text extraction)
- Processing cost is high and same files are processed repeatedly
- Need a `--cache/--no-cache` CLI option
- Want to add caching to existing pure functions without modifying them

## Core Pattern

### 1. Content-Hash Based Cache Key

Use file content (not path) as the cache key:

```python
import hashlib
from pathlib import Path

_HASH_CHUNK_SIZE = 65536  # 64KB chunks for large files

def compute_file_hash(path: Path) -> str:
    """SHA-256 of file contents (chunked for large files)."""
    if not path.is_file():
        raise FileNotFoundError(f"File not found: {path}")
    sha256 = hashlib.sha256()
    with open(path, "rb") as f:
        while True:
            chunk = f.read(_HASH_CHUNK_SIZE)
            if not chunk:
                break
            sha256.update(chunk)
    return sha256.hexdigest()
```

**Why content hash?** File rename/move = cache hit. Content change = automatic invalidation. No index file needed.

### 2. Frozen Dataclass for Cache Entry

```python
from dataclasses import dataclass

@dataclass(frozen=True, slots=True)
class CacheEntry:
    file_hash: str
    source_path: str
    document: ExtractedDocument  # The cached result
```

### 3. File-Based Cache Storage

Each cache entry is stored as `{hash}.json` — O(1) lookup by hash, no index file required.

```python
import json
from typing import Any

def write_cache(cache_dir: Path, entry: CacheEntry) -> None:
    cache_dir.mkdir(parents=True, exist_ok=True)
    cache_file = cache_dir / f"{entry.file_hash}.json"
    data = serialize_entry(entry)
    cache_file.write_text(json.dumps(data, ensure_ascii=False), encoding="utf-8")

def read_cache(cache_dir: Path, file_hash: str) -> CacheEntry | None:
    cache_file = cache_dir / f"{file_hash}.json"
    if not cache_file.is_file():
        return None
    try:
        raw = cache_file.read_text(encoding="utf-8")
        data = json.loads(raw)
        return deserialize_entry(data)
    except (json.JSONDecodeError, ValueError, KeyError):
        return None  # Treat corruption as cache miss
```

### 4. Service Layer Wrapper (SRP)

Keep the processing function pure. Add caching as a separate service layer.

```python
def extract_with_cache(
    file_path: Path,
    *,
    cache_enabled: bool = True,
    cache_dir: Path = Path(".cache"),
) -> ExtractedDocument:
    """Service layer: cache check -> extraction -> cache write."""
    if not cache_enabled:
        return extract_text(file_path)  # Pure function, no cache knowledge

    file_hash = compute_file_hash(file_path)

    # Check cache
    cached = read_cache(cache_dir, file_hash)
    if cached is not None:
        logger.info("Cache hit: %s (hash=%s)", file_path.name, file_hash[:12])
        return cached.document

    # Cache miss -> extract -> store
    logger.info("Cache miss: %s (hash=%s)", file_path.name, file_hash[:12])
    doc = extract_text(file_path)
    entry = CacheEntry(file_hash=file_hash, source_path=str(file_path), document=doc)
    write_cache(cache_dir, entry)
    return doc
```

## Key Design Decisions

| Decision | Rationale |
|----------|-----------|
| SHA-256 content hash | Path-independent, auto-invalidates on content change |
| `{hash}.json` file naming | O(1) lookup, no index file needed |
| Service layer wrapper | SRP: extraction stays pure, cache is a separate concern |
| Manual JSON serialization | Full control over frozen dataclass serialization |
| Corruption retu

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is content-hash-cache-pattern for?

When should I use content-hash-cache-pattern?

Is content-hash-cache-pattern safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is content-hash-cache-pattern for?

When should I use content-hash-cache-pattern?

Is content-hash-cache-pattern safe to install?

SKILL.md