How to Fail Interpretability Research

Abstract

Stoic philosophers practiced a method called “premeditation of evils” to help think about how to prepare today for potential failures. This powerful idea is simple: think in reverse. Instead of figuring out how to succeed, think about how to fail, then try to avoid those mistakes. Interpretability has been well-recognized as an important problem in ML, but It is a complex problem with many potential failure modes. This talk, I’ll share a few failure modes in 1) setting expectations 2) making an interpretability method 3) evaluating and 4) interpreting an explanation. To put things in context, I’ll share statistics on how many papers are making these mistakes from last year’s publications on the topic. I will also share some open theoretical questions may help us move forward. I hope this talk will offer a new angle to look at ways to make progress in this field.

How to Fail Interpretability Research

Abstract

Video Recording