collect

real user interactions
billions of tokens

score

accept / reject signals
distill reward

every
~5 hours

update weights

full model, on-policy
run evals

deploy

new checkpoint
replace previous