01
it doesn't forget
updating on new data shouldn't degrade what it already knew. this is the core challenge — gradient descent doesn't care about preserving old behavior, it just minimizes loss on whatever's in front of it. new data overwrites old representations.
02
it learns sequentially
standard training mixes data from many tasks into one big batch, which sidesteps the forgetting problem. continual learning means handling data arriving one task at a time, the way the real world works.
03
it handles varied data
forgetting only matters when new tasks are actually different from old ones. a real system has to handle arbitrary new data, not just cases where new data looks like old data.
04
it's efficient
you could avoid forgetting by replaying all previous training data alongside new data — but re-ingesting trillions of tokens on every update isn't viable. it has to work with limited new data.
05
it composes skills over time
a system that learns cats vs. dogs, then birds vs. fish, should also be able to tell cats from birds — not just silo each new task.