Weighted Sampling for Scalable Analytics of Large Data Sets

Abstract

Random sampling is a classic statistical tool that is used to summarize and perform analytics on data that is too large to store or scalably process. Suppose that our dataset consists of keys (cookies, users, queries) with numeric positive weights. We want to estimate the sum of some function f of the weights over keys in a specified segment; we call such a sum an f-statistic. For example, the number of users from a certain zipcode can be expressed in this way. Weighted sampling can be applied to estimate such f-statistics. Very common statistics are the number of active keys in the segment, the sum of their weights, and capped sums, where weights are capped by a parameter. Classic sampling schemes achieve optimal tradeoffs of sample size and estimation quality for a particular $f$-statistics, roughly, by including keys with probability proportional to $f$ applied to their weight. This talk will focus on extensions of this classic tool to two important regimes. The first, multi-objective sampling schemes, obtain optimal samples that provide estimates with specified statistical guarantees for a set of functions. We will also show that the sample-size overhead (with respect to the optimum sample for a particular f) is surprisingly small for important classes of functions. The second addresses scalability on streamed or distributed data with unaggregatedpresentation: Each key has multiple elements and its weight is the sum of these element weights, but we are interested in sampling without costly aggregation. We will present a simple sampling framework with novel estimators, which in particular, provided the first solution to capping statistics, used in online advertising.