what is rl

getting started Mar 19, 2026 3 min read

rl trains models to get better at tasks by giving them scores (rewards) when they do well.

instead of training on “here’s the perfect answer” (supervised learning), you let the model try different approaches and reward it when it succeeds. over time, it figures out what works.

why this matters

supervised finetuning teaches models to imitate answers. rl teaches them to achieve outcomes.

this makes rl better for:

  • agentic rag: train models to search better, formulate better queries, and retrieve the right chunks
  • code generation: optimize for code that actually runs and passes tests (not just looks plausible)
  • character/style: tune for specific personas, tones, or formats that match your product needs
  • tool use: teach models when and how to use tools effectively

the key difference: you directly optimize for what actually matters to your application.

key concepts

agent: the llm making decisions (which tool to use, what search query to write, what code to generate)

environment: the system it interacts with (search engines, code execution, APIs)

actions: what the model does (tool calls, search queries, generated responses)

rewards: scores you define based on task success (retrieval accuracy, test pass rate, user preferences)

training loop: model tries task → gets scored → updates itself to do better next time

that’s it. no phd required.

environments

an environment is where your model learns the task. it defines what tools the model can use and how you score its performance. three things you configure:

tools: what actions the model can take (e.g. search, execute code)

rewards: how you measure success. this is what the model learns to optimize for

context: what information the model sees (conversation history, system prompt)

example: search environment

class SearchEnv(BaseEnv):
    def __init__(self, search: SearchClient):
        self.search_client = search

    # define what tools the model can use
    async def list_tools(self):
        return [{
            "name": "search",
            "description": "Search the corpus",
            "parameters": {
                "query": "search query string",
                "mode": "search mode (auto, lexical, vector, hybrid)",
                "limit": "max results (default: 10)"
            }
        }]

    # implement how tools actually work
    async def run_tool(self, rollout_id: str, tool_name: str, **args):
        if tool_name == "search":
            return self.search_client.search(
                query=args["query"],
                mode=args.get("mode", "auto"),
                top_k=args.get("limit", 10),
            )

    # define how to score the model's performance
    async def compute_reward(self, completion, ground_truth):
        # measure overlap between retrieved content and ground truth
        overlap = compute_overlap(completion, ground_truth)
        return {"correctness": overlap}

the environment takes a SearchClient (a backend-agnostic search interface) and gives the model a search tool. rewards are based on how well the retrieved content matches the ground truth.

the training flow:

  1. model gets a question from your dataset
  2. model tries different search queries
  3. model provides a final answer
  4. you compute reward based on retrieved chunks
  5. model updates to maximize reward next time

that’s the loop. repeat until performance plateaus.