rag training | castform docs

retrieval-augmented generation (rag) is when an llm uses a search tool to find relevant information before answering a question. a rag agent can be finetuned to a specific corpus, learning to formulate better search queries and reason more effectively over your data. castform provides the full pipeline for this, from chunking documents through training.

rag step-by-step

the web ui walks you through the full pipeline. go to app.castform.com, click new training run, and select a RAG template.

1. corpus setup

upload your documents or emails directly, or connect an external corpus provider:

turbopuffer: lexical, vector, and hybrid search
pinecone: managed vector search with hosted inference
chroma: self-hosted vector, lexical, and hybrid search
notion: connect a Notion database as your corpus

the web ui indexes your documents and prepares them for QA generation.

2. task definition

configure the system prompt and completion tags. the web ui pre-populates a RAG-specific system prompt, which works well for most cases.

tips for a good system prompt:

be specific about the task: “answer customer support questions using the retrieved context” is better than “answer questions”
specify the output format you want: should it cite sources? use bullet points? keep answers short?
mention what the model should do when results are irrelevant: “if the search results don’t contain the answer, say so”

3. dataset

the web ui generates synthetic question-answer pairs from your corpus.

dataset size guidance:

minimum: 16 examples (the platform enforces this)
recommended for validation: start with ~50 examples to verify your setup works before committing to a full run
recommended for training: 200+ examples for good results, 1000+ for best results
expect generation to take time: 1000+ examples can take a few hours to generate. validate your pipeline with a small batch first.

you can preview the generated pairs, adjust the train/eval split, and regenerate if needed.

4. tools

configure the search tool: name, available search modes (depends on your corpus provider), and filterable fields.

5. rewards

this is the most important step. rewards define what “good” looks like for your model. for RAG, the web ui includes four default reward components:

component	what it measures	default weight
correctness (LLM judge)	is the answer correct and supported by the retrieved context?	1
conciseness (LLM judge)	is the response direct, without filler or repetition? gated on correctness >= 0.5	0.5
citation	precision and recall of source citations (e.g. on `thread_id`)	1
tool call efficiency	did the model use a reasonable number of search calls? gated on correctness >= 0.5	1

we recommend starting with these defaults and tweaking them to match what you actually care about. for example:

if citations don’t matter for your use case, remove the citation component or lower its weight
if you want longer, more detailed answers, remove or lower the conciseness weight
if your task requires precise tool usage, increase the tool call efficiency weight

customizing judge prompts: expand any LLM judge component to see its scoring criteria and score levels. you can edit the judge prompt to match your specific quality bar, adjust the score levels (e.g. add a 0.25 for “mostly correct but minor issues”), or change the gating conditions.

6. launch

review your config and launch. see launching for what happens next.

after launching

once your run starts, see the quickstart for what to expect: GPU warmup, early metric fluctuations, and what to watch for in completions.

to set up a rag training run with the python sdk:

1. chunk your data

split your documents into retrieval-sized pieces. built-in chunkers handle markdown and emails, or bring your own.

from trainer.chunkers.markdown import MarkdownChunker

chunker = MarkdownChunker(min_char=1024, max_char=2048)
chunks = chunker.chunk_folder("path/to/docs")

see chunking for configuration options and the email chunker.

2. upload to a corpus backend

index your chunks for search. the simplest option is the castform corpus API:

from trainer.corpus.corpora.source import CorporaChunkSource

source = CorporaChunkSource(
    api_key=API_KEY,
    corpus_name="my-docs",
    base_url=BASE_URL,
)
source.populate_from_chunks(chunks)

for other backends, see corpus (turbopuffer, pinecone, chroma).

3. generate qa pairs

the pipeline generates synthetic question-answer pairs grounded in your corpus:

from trainer.qa_generation.cgft_models import CgftPipelineConfig, PlatformConfig, CorpusConfig, TargetsConfig
from trainer.qa_generation.cgft_pipeline import CgftPipeline

cfg = CgftPipelineConfig(
    platform=PlatformConfig(api_key=API_KEY),
    corpus=CorpusConfig(corpus_name="my-docs", corpus_id=source.corpus_id),
    targets=TargetsConfig(total_samples=200),
)
cfg.resolve_api_keys()

pipeline = CgftPipeline(cfg)
result = pipeline.run()

train_data = result["train_dataset"]
eval_data = result["eval_dataset"]

see qa generation for the full config reference.

4. define the environment and launch

import trainer
from trainer.corpus.corpora.search import CorporaSearch
from trainer.envs.search_env import SearchEnv
from trainer.trainer.pipeline import train

search = CorporaSearch(
    api_key=API_KEY,
    corpus_name="my-docs",
    base_url=BASE_URL,
)

experiment_id = train(
    env_class=SearchEnv,
    env_args={"search": search},
    train_dataset=train_data,
    eval_dataset=eval_data,
    prefix="my-rag-model",
    api_key=API_KEY,
    local_modules=[cgft],
)

see search environment for reward configuration and custom search backends. see launching for train() parameters and dry run mode.

next steps

for more detail on each stage of the RAG pipeline:

rag overview: how the pipeline stages fit together
chunking: customize how documents are split
corpus: backend options and setup
qa generation: how CgftPipeline works, full config reference
search environment: how the RL training environment works