Abstract

Over the past few years, researchers have proposed a myriad of ways to measure the robustness of machine learning models. In the first part of the talk, we will survey the current robustness landscape based on two large-scale experimental studies involving more than 200 different models and test conditions. Despite the large variety of test conditions, three common trends emerge: (i) robustness to natural distribution shift and synthetic perturbations are distinct phenomena, (ii) current algorithmic techniques have little effect on robustness to natural distribution shifts, (iii) training on more diverse datasets offers robustness gains on several natural distribution shifts. In the second part of the talk, we then leverage the aforementioned insights to improve pre-trained models such as OpenAI’s CLIP. CLIP achieved unprecedented robustness on several natural distribution shifts, but only when used as a zero-shot model. The zero-shot evaluation precludes the use of extra data for fine-tuning and hence leads to lower performance when there is a specific task of interest. To address this issue, we introduce weight-space ensembling as a simple yet effective method for fine-tuning zero-shot models that yields large robustness gains without reducing in-distribution performance.

Video Recording