Random sampling is a simple, generic, and universal method to deal with massive amounts of data across all scientific disciplines. It has wide-ranging applications in statistics, online-algorithms, databases, approximation algorithms, machine learning, and other fields. Perhaps the central reason for its wide applicability is the fact that it (provably, and with high probability) suffices to take only a small number of random samples from a large dataset in order to "represent" the dataset truthfully.
In this work, we investigate the robustness of sampling against adaptive adversarial attacks in a streaming setting: An adversary sends a stream of elements from a universe U to a sampling algorithm, with the goal of making the sample "very unrepresentative" of the underlying data stream. This is modeled by a two-player game between a randomized algorithm and an adaptive adversary. Well-known results indicate that if the stream is chosen non-adaptively, then a random sample of small size is a good approximation of the full data with high probability (a formal definition will be given in the talk).
Does this sample size suffice for robustness against an adaptive adversary?
The simplistic answer is negative, and we demonstrate this by describing an efficient adaptive attack. However, this attack is "theoretical only", requiring the universe of elements to be exponential in the stream length. This is not a coincidence: We show how to make the sampling algorithm robust against adaptive adversaries when the universe is smaller.
The talk will be self-contained.
Joint work with Omri Ben-Eliezer.