rl trains models to get better at tasks by giving them scores (rewards) when they do well.
instead of training on “here’s the perfect answer” (supervised learning), you let the model try different approaches and reward it when it succeeds. over time, it figures out what works.
why this matters
supervised finetuning teaches models to imitate answers. rl teaches them to achieve outcomes.
this makes rl better for:
- agentic rag: train models to search better, formulate better queries, and retrieve the right chunks
- code generation: optimize for code that actually runs and passes tests (not just looks plausible)
- character/style: tune for specific personas, tones, or formats that match your product needs
- tool use: teach models when and how to use tools effectively
the key difference: you directly optimize for what actually matters to your application.
key concepts
agent: the llm making decisions (which tool to use, what search query to write, what code to generate)
environment: the system it interacts with (search engines, code execution, APIs)
actions: what the model does (tool calls, search queries, generated responses)
rewards: scores you define based on task success (retrieval accuracy, test pass rate, user preferences)
training loop: model tries task → gets scored → updates itself to do better next time
that’s it. no phd required.
environments
an environment is where your model learns the task. it defines what tools the model can use and how you score its performance. three things you configure:
tools: what actions the model can take (e.g. search, execute code)
rewards: how you measure success. this is what the model learns to optimize for
context: what information the model sees (conversation history, system prompt)
example: search environment
class SearchEnv(BaseEnv):
def __init__(self, search: SearchClient):
self.search_client = search
# define what tools the model can use
async def list_tools(self):
return [{
"name": "search",
"description": "Search the corpus",
"parameters": {
"query": "search query string",
"mode": "search mode (auto, lexical, vector, hybrid)",
"limit": "max results (default: 10)"
}
}]
# implement how tools actually work
async def run_tool(self, rollout_id: str, tool_name: str, **args):
if tool_name == "search":
return self.search_client.search(
query=args["query"],
mode=args.get("mode", "auto"),
top_k=args.get("limit", 10),
)
# define how to score the model's performance
async def compute_reward(self, completion, ground_truth):
# measure overlap between retrieved content and ground truth
overlap = compute_overlap(completion, ground_truth)
return {"correctness": overlap}
the environment takes a SearchClient (a backend-agnostic search interface) and gives the model a search tool. rewards are based on how well the retrieved content matches the ground truth.
the training flow:
- model gets a question from your dataset
- model tries different search queries
- model provides a final answer
- you compute reward based on retrieved chunks
- model updates to maximize reward next time
that’s the loop. repeat until performance plateaus.