Causal Representation Learning: A Natural Fit for Mechanistic Interpretability

Abstract

Steering methods manipulate the representations of large language models (LLMs) to induce responses that have desired properties, e.g., truthfulness, offering a promising approach for LLM alignment without the need for fine-tuning. However, these methods typically require supervision from e.g., contrastive pairs of prompts that vary by a single target concept, which is costly to obtain and limits the speed of steering research. An appealing alternative is to use unsupervised approaches such as sparse autoencoders (SAEs) to map LLM embeddings to sparse representations that capture human-interpretable concepts. However, without further assumptions, SAEs may not be identifiable: they could learn latent dimensions that entangle multiple concepts, leading to unintentional steering of unrelated properties. In this talk, I'll introduce sparse shift autoencoders (SSAEs). These models map the differences between embeddings to sparse representations that capture concept shifts. Crucially, we show that SSAEs are identifiable from paired observations that vary by multiple unknown concepts, leading to accurate steering of single concepts without the need for supervision. We empirically demonstrate accurate steering across semi-synthetic and real-world language datasets using Llama-3.1 embeddings.

Attachment

Slides

Causal Representation Learning: A Natural Fit for Mechanistic Interpretability

Abstract

Attachment

Video Recording