On Spurious Associations and LLM Alignment

Workshop

Domain Adaptation and Related Areas

Speaker(s)

Victor Veitch (University of Chicago)

Location

Calvin Lab Auditorium

Date

Thursday, Nov. 14, 2024

Time

9:20 – 10 a.m. PT

Abstract

Large language models are `aligned' to bias them towards outputting responses that are good on various measures---e.g., we may want them to be helpful, factual, and polite. Often, alignment procedures also end up changing off-target aspects of the outputs---e.g., aligning to make a model helpful often increases the average response length substantially. It is natural to wonder if these off-target effects are (correct) by-products of the alignment goal (e.g., making a response helpful also makes it longer) or a consequence of a `spurious correlation' in the alignment procedure. I'll discuss some recent work from my lab on alignment, and their implications for this distinction.

Attachment

Slides

On Spurious Associations and LLM Alignment

Abstract

Attachment

Video Recording