← reference model
reward-hacked extreme →
← pulls left
KL penalty
vs
pulls right →
reward gradient
β → ∞
no learning
β = tuned
stable training
β → 0
reward hacking