
Abstract
Scalable oversight attempts to align AI systems to human values by training AI models based on human feedback and using AI assistance to strengthen that human feedback signal. This talk will cover:
1. Recent theoretical work applying tools from computational complexity, multi-agent training dynamics, and learning theory to design improved scalable oversight methods which achieve theoretical guarantees given simplified assumptions about human feedback.
2. Prospects for extending such methods to weaker (and thus more realistic) assumptions about human feedback, and stronger requirements on solutions.
3. Prospects for integrating these developments into practical ML training.
For (1), we have a new "prover-predictor game" variant of debate which (in a theoretical setting with sufficiently strong assumptions) avoids the "obfuscated arguments" problem discovered during human participant scalable oversight experiments in 2020. Previous versions of debate either assumed infinitely powerful agents or required computational complexity proportional to the length of a human-checkable argument. The new method allows ML systems to spend time related to the length of an ML-checkable argument, which can be much shorter if superhuman heuristics are involved.
For (2), the talk will lay out some sources of optimism in the hopes of encouraging more work in this area. There are concrete theoretical limitations in the current methods which may be addressable using tools from theory. It is not clear that this work will succeed, but it is importantly orthogonal to much of the safety research occurring at AI labs today, and I believe there are strong prospects for bringing new ideas from other areas of theoretical computer science which have not yet been applied to AI safety.
For (3), the new method has the structure of a zero-sum, adversarial team game, and both theoretical and practical evidence shows that such games admit practical, convergent training methods. Importantly, while the asymptotic guarantees provided by this type of theory are weaker than full verification, they may also be more likely to translate into practice.