chunking is the first step in the rag pipeline. you take your raw documents and split them into smaller pieces that can be indexed, searched, and used for qa generation.
the cgft package ships two chunkers: MarkdownChunker for docs and EmailChunker for email threads. both produce ChunkCollection objects you can save, inspect, and pass downstream.
MarkdownChunker
a 3-stage pipeline for markdown files:
- split by headers: uses LangChain’s
MarkdownHeaderTextSplitterto split on H1/H2/H3 boundaries. header hierarchy is preserved in each chunk’s metadata so you always know where a chunk came from. - fuse short sections: adjacent sections below
min_charget merged together. this prevents tiny chunks that lack enough context. - split large sections: anything above
max_chargets recursively character-split with overlap to maintain continuity across chunk boundaries.
from trainer.chunkers.markdown import MarkdownChunker
chunker = MarkdownChunker(min_char=1024, max_char=2048, chunk_overlap=128)
# chunk a single file
chunks = chunker.chunk_file("./docs/getting-started.md")
# chunk a folder (scans for .md files by default)
chunks = chunker.chunk_folder("./docs/", file_extensions=[".md", ".mdx"])
| parameter | default | description |
|---|---|---|
min_char | 1024 | sections shorter than this get fused with neighbors |
max_char | 2048 | sections longer than this get recursively split |
chunk_overlap | 128 | character overlap between split chunks |
EmailChunker
reply-graph aware chunking for email threads. instead of treating each email as a standalone document, it reconstructs conversation threads and creates overlapping sliding windows across them.
key behaviors:
- handles forked threads (where someone replies to an earlier message) by compacting shared prefixes
- sliding window approach preserves conversational context across chunk boundaries
from trainer.chunkers.email import EmailChunker
chunker = EmailChunker(max_emails_per_chunk=10, max_chars=2048, overlap_emails=2)
chunks = chunker.chunk_folder("./email-export/")
| parameter | default | description |
|---|---|---|
max_emails_per_chunk | 10 | max emails in a single chunk window |
max_chars | 2048 | max character length per chunk |
overlap_emails | 2 | number of emails to overlap between adjacent chunks |
each chunk includes metadata: thread_id, chunk_id, parent_chunk_id, child_chunk_ids, subject, participants, and date ranges.
Chunk and ChunkCollection
both chunkers return a ChunkCollection, a list-like container of Chunk objects.
Chunk is an immutable dataclass. each chunk gets an auto-computed SHA256 hash of its content, which is used for deduplication.
ChunkCollection is file-structure aware and provides navigation methods:
get_file_chunks(): all chunks from a specific source fileget_neighboring_chunks(): chunks adjacent to a given chunk in the original documentget_chunk_with_context(): a chunk plus its surrounding context
duplicate chunks (by hash) are automatically removed.
persistence
save and load chunks as yaml so you don’t have to re-chunk every time:
from trainer.chunkers.storage import save_chunks, load_chunks
save_chunks(chunks, "chunks.yaml")
chunks = load_chunks("chunks.yaml")
inspection
ChunkInspector helps you sanity-check your chunks before moving to corpus upload or qa generation:
from trainer.chunkers.inspector import ChunkInspector
inspector = ChunkInspector(chunks)
print(inspector.summary()) # chunk count, size distribution, etc.
inspector.read_chunk(0) # print a specific chunk
inspector.sample_file() # randomly sample and display chunks from one file
this is useful for tuning min_char / max_char. you want chunks that are long enough to be self-contained but short enough to be focused on a single topic.