Abstract

Recent advancements in data augmentation have led to state-of-the-art performance on diverse machine learning tasks. However, it remains unclear how to choose, compare, and schedule augmentations in a principled way. I will present ongoing work, joint with Y. Sun,  which provides a theoretical framework for data augmentation in which such issues can be studied directly. Our framework is general enough to unify augmentations such as synthetic noise, CutOut and label-preserving transformations (color jitter, geometric transformations) together with more traditional stochastic optimization methods (SGD, Mixup). The essence of our approach is that any augmentation corresponds to noisy gradient descent on a time-varying sequence of proxy losses. Specializing our framework to overparameterized linear models, we obtain a Munro-Robbins type result, which provides conditions for jointly scheduling learning rate and augmentation strength.