criteria for continual learning 1 / 5

01

it doesn't forget

updating on new data shouldn't degrade what it already knew. this is the core challenge — gradient descent doesn't care about preserving old behavior, it just minimizes loss on whatever's in front of it. new data overwrites old representations.

02

it learns sequentially

standard training mixes data from many tasks into one big batch, which sidesteps the forgetting problem. continual learning means handling data arriving one task at a time, the way the real world works.

03

it handles varied data

forgetting only matters when new tasks are actually different from old ones. a real system has to handle arbitrary new data, not just cases where new data looks like old data.

04

it's efficient

you could avoid forgetting by replaying all previous training data alongside new data — but re-ingesting trillions of tokens on every update isn't viable. it has to work with limited new data.

05

it composes skills over time

a system that learns cats vs. dogs, then birds vs. fish, should also be able to tell cats from birds — not just silo each new task.