Reward Hacking Preview

reward hacking

you specify

score = 1 if summary is short

→

you want

an accurate, useful summary

what the model finds

returns "." — shortest possible output

1.00 ✓

copies the first sentence verbatim

0.94 ✓

repeats the task description back

0.91 ✓