Abstract

Weak-to-strong generalization refers to the ability of a reasoning model to solve "harder" problems than those in its training set. I'll argue that recurrent architectures, in which networks can dynamically scale the level of computation used to solve a problem, are necessary to achieve dramatic weak to strong behavior. I'll present examples where recurrent networks exhibit weak-to-strong generalization for a range of simple reasoning problems. Then I'll show that transformer-based LLMs benefit from recurrence as well, boosting their performance on weak-to-strong arithmetic tasks.

Video Recording