Abstract
We initiate a study of the problem of PAC-learning from distributed data and analyze fundamental communication complexity questions involved. Broadly, we consider a framework where information is distributed between several locations, and our goal is to learn a low-error hypothesis with respect to the overall distribution of data using as little communication, and as few rounds of communication, as possible. As an example, suppose k research groups around the world have collected large scientific datasets, such as genomic sequence data or sky survey data, and we wish to perform learning over the union of all these different datasets without too much communication.
We provide general upper and lower bounds on the amount of communication needed to learn a given class, as well as broadly-applicable techniques for achieving communication-efficient learning. We also analyze a number of important specific classes, giving efficient learning algorithms with especially good communication performance.
We additionally present an analysis of privacy, considering both differential privacy and a notion of distributional privacy that is especially appealing in this context.