Abstract

Estimating the support size of a distribution is a classical problem in probability and statistics, dating back to the early work of Fisher, Good-Turing, and the famous paper on "how many words did Shakespeare know" by Efron-Thisted. When the samples are scarce, the challenge lies in how to extrapolate the number of unseen symbols based on what have been seen so far.

We consider the estimation of the support size of a discrete distribution using independent samples whose minimum non-zero mass is at least \frac{1}{k}. We show that, to achieve an additive error of \epsilon k with probability at least 0.9, the sample complexity is within universal constant factors of \frac{k}{\log k}\log^2\frac{1}{\epsilon}, which improves the state-of-the-art result \frac{k}{\log k} \frac{1}{\epsilon^2} due to Valiant-Valiant. The apparatus of best polynomial approximation and its duality to the moment matching problem play a key role in both the impossibility result and the construction of linear-time estimators with provable optimality.

We also investigate the closely-related species problem where the goal is to estimate the number of distinct colors in an ball-urn model from repeated draws. Experimental results on real data will be presented. Time permitted, extensions to other distributional properties will be discussed. (Joint work with Pengkun Yang).

Video Recording