Abstract

Performance evaluation plays a central role in decisions about whether and how predictive algorithms should be deployed in high-stakes settings. Yet, in many real-world domains, evaluation is fundamentally difficult: the data available for assessment are often biased, incomplete, or noisy, and the act of deploying a model can itself alter which outcomes are observed. As a result, standard evaluation practices may substantially misrepresent both overall model performance and disparities across groups. In this talk, we examine several common threats to valid evaluation—including measurement error, selection bias, and distribution shift—and present principled evaluation methods that enable valid performance assessment under these challenges when appropriate conditions are met.

Attachment

Video Recording