Are Aligned Language Models “Adversarially Aligned”?

Workshop

Large Language Models and Transformers

Speaker(s)

Nicholas Carlini (Google DeepMind)

Location

Calvin Lab Auditorium

Date

Wednesday, Aug. 16, 2023

Time

3:30 – 4:30 p.m. PT

Abstract

An "aligned" model is "helpful and harmless". In this talk I will show that while language models may be aligned under typical situations, they are not "adversarially aligned". Using standard techniques from adversarial examples, we can construct inputs to otherwise-aligned language models to coerce them into emitting harmful text and performing harmful behavior. Creating aligned models robust to adversaries will require significant advances in both alignment and adversarial machine learning.

Are Aligned Language Models “Adversarially Aligned”?

Abstract

Video Recording