← reference model reward-hacked extreme →
← pulls left KL penalty
vs
pulls right → reward gradient

β → ∞ no learning
β = tuned stable training
β → 0 reward hacking