after training, check whether your model actually improved. the console gives you three ways to evaluate: the eval tab (held-out metrics), comparison evals (head-to-head against other models), and the playground (interactive testing).
eval tab
the eval tab shows the same metrics as the train tab, but computed on your held-out evaluation dataset. this is the primary way to check for overfitting.
what to look for:
- train reward rising, eval reward flat or declining: the model is overfitting to the training distribution. try a larger or more diverse eval set, or stop training earlier.
- train and eval rewards rising together: healthy. the model is generalizing.
- eval reward higher than train: unusual but possible if your eval set is easier than your training set. check your data split.
metrics include average reward, response lengths, max reward, and solve rate in a 2x2 grid. sub-reward components (e.g. citation precision, judge scores) appear in collapsible sections below.
comparison evals
comparison evals let you run your evaluation dataset through an external model (GPT-4, Claude, etc.) and compare results side by side.
running a comparison
- go to the comp tab on your training run page.
- select a model from the dropdown.
- click start batch eval.
the batch runs asynchronously. progress updates every few seconds. once complete, you’ll see:
- bar chart: your model (green) vs the comparison model (gray) on each reward component.
- performance summary: a percentage delta with a label (“outperformed”, “closing in”, or “comparison model leads”).
- per-rollout comparison table: expand any row to see the full conversation for both models side by side.
you can run multiple comparisons against different models. toggle between them using the chips above the chart.
interpreting results
the comparison uses the same reward function as training. if your model scores higher than the base model on the eval set, it learned something useful. if it scores lower than an external model, try more training or a better reward signal.
reward scores are relative to your reward function, not absolute quality. a model scoring 0.8 on a judge-based reward may produce better answers than one scoring 0.9 on exact match. it depends on what your reward function measures.
playground
the playground lets you chat with your trained model interactively. useful for qualitative testing: checking tone, reasoning quality, and edge cases that metrics won’t capture.
model types
| type | what it connects to | when to use |
|---|---|---|
| debug | live tunnel to in-progress training | test the model mid-training |
| eval | latest checkpoint on castform inference | test after training completes |
| external | GPT-4, Claude, etc. | manual comparison |
the playground supports up to 4 conversation turns and 8 tool calls per turn. responses stream in real time.
using the playground
- go to the playground tab on your training run page.
- select a model type from the dropdown.
- type a prompt and send.
if your environment has tools (e.g. a search tool for RAG), the model can call them during the conversation. tool calls and results appear inline in the chat.
what to evaluate
| question | where to look |
|---|---|
| is the model overfitting? | compare train tab vs eval tab metrics |
| is it better than the base model? | run a comparison eval against the base |
| is it better than GPT-4/Claude? | run a comparison eval against an external model |
| does it handle edge cases? | test manually in the playground |
| is it using tools correctly? | check tool calls in rollout details and playground |
| are rewards gaming the system? | inspect high-reward rollouts for reward hacking |
next steps
- managing training runs: detailed metrics, rollout inspection, health indicators
- serving + sharing: deploy your model once you’re satisfied with the results