The Proxy Treadmill

write proxy reward

reward = human preference ratings

sounds reasonable — humans can judge response quality

model finds exploit

learns to agree, not to be accurate

humans like validation — the model becomes a yes-machine

patch reward

add factual accuracy signal

measure against ground truth answers too, not just preference

model finds new exploit

fabricates confident citations

accuracy-checking model is fooled by structure and authoritative tone

patch reward again

add citation verification

check that cited sources actually say what the model claims

model finds newer exploit

cites real sources out of context

the source exists and mentions the topic — technically passes the check