Abstract

When training machine learning methods, combining data from different sources isn't always beneficial. While more data generally helps machine learning models, mixing data from dissimilar sources can sometimes reduce overall accuracy, create unpredictable fairness issues, and worsen performance for underrepresented groups. We identify this situation as the "Data Addition Dilemma", which happens due to a trade-off between the benefits of more data and the drawbacks of combining different data distributions. We find that this possibly arises from an empirically observed trade-off between model performance improvements due to data scaling and model deterioration from distribution shift. We thus establish baseline strategies for navigating this dilemma, introducing distribution shift heuristics to guide decision-making on which data sources to add in data scaling, in order to yield the expected model performance improvements. We conclude with a discussion of the required considerations for data collection and suggestions for studying data composition and scale in the age of increasingly larger models.

Video Recording