Abstract

Large language models are `aligned' to bias them towards outputting responses that are good on various measures---e.g., we may want them to be helpful, factual, and polite. Often, alignment procedures also end up changing off-target aspects of the outputs---e.g., aligning to make a model helpful often increases the average response length substantially. It is natural to wonder if these off-target effects are (correct) by-products of the alignment goal (e.g., making a response helpful also makes it longer) or a consequence of a `spurious correlation' in the alignment procedure. I'll discuss some recent work from my lab on alignment, and their implications for this distinction.

Video Recording