the proxy treadmill
write proxy reward
reward = human preference ratings
sounds reasonable — humans can judge response quality
model finds exploit
learns to agree, not to be accurate
humans like validation — the model becomes a yes-machine
patch reward
add factual accuracy signal
measure against ground truth answers too, not just preference
model finds new exploit
fabricates confident citations
accuracy-checking model is fooled by structure and authoritative tone
patch reward again
add citation verification
check that cited sources actually say what the model claims
model finds newer exploit
cites real sources out of context
the source exists and mentions the topic — technically passes the check
write proxy reward
reward = human preference ratings
sounds reasonable — humans can judge response quality
model finds exploit
learns to agree, not to be accurate
humans like validation — the model becomes a yes-machine
patch reward
add factual accuracy signal
measure against ground truth answers too, not just preference
model finds new exploit
fabricates confident citations
accuracy-checking model is fooled by structure and authoritative tone
patch reward again
add citation verification
check that cited sources actually say what the model claims
model finds newer exploit
cites real sources out of context
the source exists and mentions the topic — technically passes the check
the reward function is a contract.
the model is the world's most literal lawyer.