Chunking Strategy

Name: Chunking Strategy
Author: giuseppe-trisciuoglio

giuseppe-trisciuoglio/developer-kit

Pick and implement the right RAG text-splitting approach instead of default fixed-size chunks that wreck retrieval quality.

Overview

Chunking-strategy is an agent skill for the Build phase that documents eleven RAG chunking strategies with tradeoffs and implementation starting points.

Install

npx skills add https://github.com/giuseppe-trisciuoglio/developer-kit --skill chunking-strategy

What is this skill?

Documents eleven advanced chunking strategies with complexity, use case, and key benefit per strategy
Covers fixed-length, sentence, paragraph, sliding window, semantic, recursive, context-enriched, modality-specific, agen
Includes implementation patterns such as LangChain CharacterTextSplitter and tiktoken-oriented sizing
Strategy overview table supports baseline vs production-grade RAG design tradeoffs
Oriented toward comprehensive RAG systems rather than one-size-fits-all splitting
11 advanced chunking strategies documented
Strategy overview table with complexity and use-case columns

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 1.2k installs on skills.sh; 271 GitHub stars; 2/3 security scanners passed (skills.sh audits).

What problem does it solve?

You are building RAG but default fixed-size chunks break context on technical docs, long prose, or mixed modalities.

Who is it for?

Indie developers implementing ingestion for agents, support bots, or doc Q&A who need a strategy menu before tuning embeddings and evals.

Skip if: Teams that only need a five-line CharacterTextSplitter snippet with no retrieval-quality goals or no Python/LangChain stack.

When should I use this skill?

User is designing or debugging RAG ingestion, chunk sizes, overlap, or splitter choice for production retrieval.

What do I get? / Deliverables

You select a documented strategy—sentence, semantic, hybrid, or others—and implement splitters with explicit overlap and hierarchy rules suited to your corpus.

Chosen chunking strategy with rationale
Starter splitter configuration or code pattern
Overlap and hierarchy parameters aligned to use case

Recommended Skills

Microsoft Foundrymicrosoft/azure-skills

Microsoft Foundry skill guides agents through the full Azure AI Foundry lifecycle—containerizing agents, pushing to ACR,…377k installs·1.2k stars

Azure Aimicrosoft/azure-skills

azure-ai is a Prism-oriented quick reference for Microsoft Azure AI work, with the published body centered on the Azure …375k installs·1.2k stars

Azure Hosted Copilot Sdkmicrosoft/azure-skills

Azure Hosted Copilot SDK is Microsoft's entry skill for repos using @github/copilot-sdk—it detects CopilotClient usage, …346k installs·1.2k stars

Lark Eventlarksuite/cli

Lark real-time subscription skill via lark-cli event consume for building bots and streaming webhook-style agent workers…208k installs·13.7k stars

Running Claude Code Via Litellm Copilotxixu-me/skills

Running Claude Code via LiteLLM Copilot walks through pointing Claude Code at a local LiteLLM proxy that forwards Anthro…200k installs·61 stars

Setup Matt Pocock Skillsmattpocock/skills

One-time per-repo setup so Matt Pocock engineering skills share correct issue tracker, triage strings, and domain docume…180k installs·121k stars

Journey fit

Primary fit

BuildBackend, data & payments

Chunking decisions belong in Build when you wire ingestion, embeddings, and retrieval behind your agent or SaaS knowledge base. Backend is where document pipelines, token budgets, and retriever behavior are defined before ship-time evals.

Also useful

ShipTesting & QA

How it compares

Use as a strategy catalog and implementation guide—not a hosted chunking API or embedding service.

Common Questions / FAQ

Who is chunking-strategy for?

Solo and small-team builders designing RAG ingestion who want explicit tradeoffs between eleven chunking approaches before coding pipelines.

When should I use chunking-strategy?

Use it in Build while defining backend ingestion for agents or APIs—especially when moving beyond naive fixed-length splits for PDFs, manuals, or mixed content.

Is chunking-strategy safe to install?

It is primarily documentation and sample code; check the Security Audits panel on this Prism page and review any copied Python before running in your environment.

SKILL.md

READMESKILL.md - Chunking Strategy

# Advanced Chunking Strategies

This document provides detailed implementations of 11 advanced chunking strategies for comprehensive RAG systems.

## Strategy Overview

| Strategy | Complexity | Use Case | Key Benefit |
|----------|------------|----------|-------------|
| Fixed-Length | Low | Simple documents, baseline | Easy implementation |
| Sentence-Based | Medium | General text processing | Natural language boundaries |
| Paragraph-Based | Medium | Structured documents | Context preservation |
| Sliding Window | Medium | Context-critical queries | Overlap for continuity |
| Semantic | High | Complex documents | Thematic coherence |
| Recursive | Medium | Mixed content types | Hierarchical structure |
| Context-Enriched | High | Technical documents | Enhanced context |
| Modality-Specific | High | Multi-modal content | Specialized handling |
| Agentic | Very High | Dynamic requirements | Adaptive chunking |
| Subdocument | Medium | Large documents | Logical grouping |
| Hybrid | Very High | Complex systems | Best-of-all approaches |

## 1. Fixed-Length Chunking

### Overview
Divide documents into chunks of fixed character/token count regardless of content structure.

### Implementation
```python
from langchain.text_splitter import CharacterTextSplitter
import tiktoken

class FixedLengthChunker:
    def __init__(self, chunk_size=1000, chunk_overlap=200, encoding_name="cl100k_base"):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.encoding = tiktoken.get_encoding(encoding_name)

    def chunk_by_characters(self, text):
        """Chunk by character count"""
        splitter = CharacterTextSplitter(
            chunk_size=self.chunk_size,
            chunk_overlap=self.chunk_overlap,
            separator="\n\n"
        )
        return splitter.split_text(text)

    def chunk_by_tokens(self, text):
        """Chunk by token count using tiktoken"""
        tokens = self.encoding.encode(text)
        chunks = []

        start = 0
        while start < len(tokens):
            end = min(start + self.chunk_size, len(tokens))
            chunk_tokens = tokens[start:end]
            chunk_text = self.encoding.decode(chunk_tokens)
            chunks.append(chunk_text)

            # Calculate next start position with overlap
            start = max(0, end - self.chunk_overlap)

            # Prevent infinite loop
            if end >= len(tokens):
                break

        return chunks

    def chunk_optimized(self, text, strategy="balanced"):
        """Optimized chunking based on strategy"""
        strategies = {
            "conservative": {"chunk_size": 500, "overlap": 100},
            "balanced": {"chunk_size": 1000, "overlap": 200},
            "aggressive": {"chunk_size": 2000, "overlap": 400}
        }

        config = strategies.get(strategy, strategies["balanced"])
        self.chunk_size = config["chunk_size"]
        self.chunk_overlap = config["overlap"]

        return self.chunk_by_tokens(text)
```

### Best Practices
- Start with 1000 tokens for general use
- Use 10-20% overlap for context preservation
- Adjust based on embedding model context window
- Consider document type for optimal sizing

## 2. Sentence-Based Chunking

### Overview
Split documents at sentence boundaries while maintaining target chunk sizes.

### Implementation
```python
import nltk
import spacy
from typing import List

class SentenceChunker:
    def __init__(self, max_sentences=10, overlap_sentences=2, library="spacy"):
        self.max_sentences = max_sentences
        self.overlap_sentences = overlap_sentences
        self.library = library

        if library == "spacy":
            self.nlp = spacy.load("en_core_web_sm")
        elif library == "nltk":
            nltk.download('punkt')

    def extract_sentences_spacy(self, text):
        """Extract sentences using spaCy"""
        doc = self.nlp(text)
        return [sent.text.strip() for sent in doc.sents]

    def extract_sentences

What is this skill?

Documents eleven advanced chunking strategies with complexity, use case, and key benefit per strategy

Covers fixed-length, sentence, paragraph, sliding window, semantic, recursive, context-enriched, modality-specific, agen

Includes implementation patterns such as LangChain CharacterTextSplitter and tiktoken-oriented sizing

Strategy overview table supports baseline vs production-grade RAG design tradeoffs

Oriented toward comprehensive RAG systems rather than one-size-fits-all splitting

11 advanced chunking strategies documented

Strategy overview table with complexity and use-case columns

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 1.2k installs on skills.sh; 271 GitHub stars; 2/3 security scanners passed (skills.sh audits).

What do I get? / Deliverables

You select a documented strategy—sentence, semantic, hybrid, or others—and implement splitters with explicit overlap and hierarchy rules suited to your corpus.

Chosen chunking strategy with rationale

Starter splitter configuration or code pattern

Overlap and hierarchy parameters aligned to use case

Journey fit

Primary fit

BuildBackend, data & payments

Also useful

ShipTesting & QA

SKILL.md

READMESKILL.md - Chunking Strategy

# Advanced Chunking Strategies

This document provides detailed implementations of 11 advanced chunking strategies for comprehensive RAG systems.

## Strategy Overview

| Strategy | Complexity | Use Case | Key Benefit |
|----------|------------|----------|-------------|
| Fixed-Length | Low | Simple documents, baseline | Easy implementation |
| Sentence-Based | Medium | General text processing | Natural language boundaries |
| Paragraph-Based | Medium | Structured documents | Context preservation |
| Sliding Window | Medium | Context-critical queries | Overlap for continuity |
| Semantic | High | Complex documents | Thematic coherence |
| Recursive | Medium | Mixed content types | Hierarchical structure |
| Context-Enriched | High | Technical documents | Enhanced context |
| Modality-Specific | High | Multi-modal content | Specialized handling |
| Agentic | Very High | Dynamic requirements | Adaptive chunking |
| Subdocument | Medium | Large documents | Logical grouping |
| Hybrid | Very High | Complex systems | Best-of-all approaches |

## 1. Fixed-Length Chunking

### Overview
Divide documents into chunks of fixed character/token count regardless of content structure.

### Implementation
```python
from langchain.text_splitter import CharacterTextSplitter
import tiktoken

class FixedLengthChunker:
    def __init__(self, chunk_size=1000, chunk_overlap=200, encoding_name="cl100k_base"):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.encoding = tiktoken.get_encoding(encoding_name)

    def chunk_by_characters(self, text):
        """Chunk by character count"""
        splitter = CharacterTextSplitter(
            chunk_size=self.chunk_size,
            chunk_overlap=self.chunk_overlap,
            separator="\n\n"
        )
        return splitter.split_text(text)

    def chunk_by_tokens(self, text):
        """Chunk by token count using tiktoken"""
        tokens = self.encoding.encode(text)
        chunks = []

        start = 0
        while start < len(tokens):
            end = min(start + self.chunk_size, len(tokens))
            chunk_tokens = tokens[start:end]
            chunk_text = self.encoding.decode(chunk_tokens)
            chunks.append(chunk_text)

            # Calculate next start position with overlap
            start = max(0, end - self.chunk_overlap)

            # Prevent infinite loop
            if end >= len(tokens):
                break

        return chunks

    def chunk_optimized(self, text, strategy="balanced"):
        """Optimized chunking based on strategy"""
        strategies = {
            "conservative": {"chunk_size": 500, "overlap": 100},
            "balanced": {"chunk_size": 1000, "overlap": 200},
            "aggressive": {"chunk_size": 2000, "overlap": 400}
        }

        config = strategies.get(strategy, strategies["balanced"])
        self.chunk_size = config["chunk_size"]
        self.chunk_overlap = config["overlap"]

        return self.chunk_by_tokens(text)
```

### Best Practices
- Start with 1000 tokens for general use
- Use 10-20% overlap for context preservation
- Adjust based on embedding model context window
- Consider document type for optimal sizing

## 2. Sentence-Based Chunking

### Overview
Split documents at sentence boundaries while maintaining target chunk sizes.

### Implementation
```python
import nltk
import spacy
from typing import List

class SentenceChunker:
    def __init__(self, max_sentences=10, overlap_sentences=2, library="spacy"):
        self.max_sentences = max_sentences
        self.overlap_sentences = overlap_sentences
        self.library = library

        if library == "spacy":
            self.nlp = spacy.load("en_core_web_sm")
        elif library == "nltk":
            nltk.download('punkt')

    def extract_sentences_spacy(self, text):
        """Extract sentences using spaCy"""
        doc = self.nlp(text)
        return [sent.text.strip() for sent in doc.sents]

    def extract_sentences

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is chunking-strategy for?

When should I use chunking-strategy?

Is chunking-strategy safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is chunking-strategy for?

When should I use chunking-strategy?

Is chunking-strategy safe to install?

SKILL.md