collect
real user interactionsbillions of tokens
score
accept / reject signalsdistill reward
update weights
full model, on-policyrun evals
deploy
new checkpointreplace previous