Abstract
AI systems are a complex pipeline from training data, to learned representations, to observed behaviors. Can we use AI to help us understand each of these objects, and use this understanding to steer and align the system? I will present a series of tools that use AI to understand AI, both at the level of behaviors and learned representations.
First, we consider behavior elicitation, the problem of finding prompts that elicit a specified model behavior, such as "the model hallucinates the signing date of the Declaration of Independence". We train investigator agents that automatically elicit behaviors given such a description, by formulating elicitation as an RL problem and applying a combination of supervised fine-tuning and DPO. We use this to create strong jailbreaks, surface hallucinations, and invert models.
Second, we consider neuron description---understanding what leads a neuron to be active and describing this in natural language. We significantly improve previous description pipelines and obtain descriptions that are at or slightly above human quality. Our pipeline is cheap, consisting of 8B-parameter open-weight models.
Finally, we apply our neuron descriptions through an observability interface called Monitor. We use Monitor to understand several puzzling model behaviors, including why language models often say that 9.8 is smaller than 9.11.