Abstract

As LLM capabilities improve rapidly across a range of domains (including ones the designers didn’t intend), it becomes increasingly challenging to rule out catastrophic harms. I’ll argue for the need to make affirmative safety cases for LLMs. Once LLMs are capable of complex autonomous plans, understanding their motivational structures becomes increasingly central to safety. I’ll highlight the need for a science of LLM generalization so that we can understand how the training data affects a model’s beliefs and motivations.