Abstract

Large language models, vision-language models, and other generative AI systems are rapidly permeating society---when it was released, ChatGPT was the fastest-growing app in history. With the rapid proliferation of this technology, we need tools for society to understand and steer its effects.

The promise (and complexity) of generative AI lies in its open-ended behavior. To tackle this complexity, we need tools that can adaptively query an AI model to find unexpected behaviors, then categorize them into human-interpretable patterns. I'll describe systems we built for this task, and show how we can leverage other AI systems as part of this pipeline.

Finally, it is not enough to understand systems--we also need to steer them based on our understanding. I will show how, by understanding the structure of neural representations, we can steer models to be more accurate and truthful.