Abstract

There are numerous safety concerns and wide-ranging attacks on current large language models. In this talk, we will identify the common root causes of these failures. Using a simple illustrative problem, we will walk through several defense strategies and evaluate their strengths and weaknesses. Finally, we will draw connections to the broader literature on safety and robustness in machine learning.

Video Recording