Abstract
Pre-training plays a central role in modern natural language processing, from the ELMo and BERT models of 2018 to the large language models of today. To what extent does pre-training mitigate the challenges of domain transfer and distribution shift? Does pre-training really teach models to understand tasks, or does it make them better at fitting (potentially spurious) patterns? What types of “transfer” even make sense for large language models that have been pre-trained on the entire internet? I will describe my journey in grappling with these questions, from the dawn of language model pre-training to the current day. I will present a number of challenges for understanding the role of distribution shift in the modern age, including considering shifts relative to the pre-training data distribution and disentangling the roles of distribution shift and intrinsic example difficulty in causing model errors.