corpus

rag Mar 23, 2026 2 min read

a corpus is an indexed, searchable collection of your chunks. once you’ve chunked your documents, you upload them to a corpus backend. the qa generation pipeline and the training environment both search against it.

castform corpus api

castform’s hosted search backend. uses BM25 keyword matching, no embeddings required. this is the default and included with your castform account.

from trainer.corpus.corpora.source import CorporaChunkSource

source = CorporaChunkSource(
    api_key="sk_...",
    corpus_name="my-docs",
    base_url="https://app.castform.com",
)

# load from a local folder (chunks + uploads in one step)
source.populate_from_folder("./docs/")

# or connect to a corpus you've already uploaded
source.populate_from_existing_corpus_name()

third-party backends

we also support third-party vector databases as corpus backends. each has a dedicated integration guide:

  • turbopuffer: lexical, vector, and hybrid search. good if you already have data in turbopuffer or need vector/hybrid retrieval.
  • pinecone: managed vector search with hosted inference. good if you already have data in pinecone or want zero infrastructure overhead.
  • chroma: self-hosted vector, lexical, and hybrid search. good if you want full control over your search infrastructure.

all backends implement the same ChunkSource interface, so you can swap between them without changing downstream code.

ChunkSource interface

every corpus backend implements ChunkSource. this means all the methods below work identically regardless of which backend you’re using.

uploading chunks

# from a folder of documents (chunks automatically)
source.populate_from_folder("./docs/")

# from an existing ChunkCollection
source.populate_from_chunks(chunks)

searching

# simple text search
results = source.search_text("kubernetes pod limits")

# structured search with a SearchSpec
results = source.search({
    "mode": "lexical",
    "text_query": "kubernetes pod limits",
    "top_k": 5,
})

# find chunks related to a source chunk
results = source.search_related(source_chunk, queries=["scaling", "limits"], top_k=5)

sampling and context

# random sample (useful for qa generation seed chunks)
sample = source.sample_chunks(n=10, min_chars=200)

# get a chunk with its neighbors for extra context
context = source.get_chunk_with_context(chunk)

filtering

you can filter search results by chunk metadata:

from trainer.corpus.search_schema import SearchSpec

results = source.search({
    "mode": "lexical",
    "text_query": "kubernetes pod limits",
    "top_k": 5,
    "filter": {"field": "file_path", "op": "eq", "value": "k8s-docs.md"},
})