Abstract: Modern multi-modal models such as CLIP require significant engineering efforts to efficiently train, evaluate, and deploy. Furthermore, such models typically serve as a backbone feature extractor for many downstream tasks. This talk will provide an overview of how we’ve accomplished this at Apple, where CLIP now powers a large number of user experiences on iOS. We’ll cover concepts such as multi-node multi-gpu distributed training on billions of examples, transfer learning for downstream tasks, model pruning, efficient on-device inference with transformers, and more.