I will talk about an optimal online algorithm for Internet Ad markets, such as Google's AdWords market. Although this result was obtained over a decade ago when the algorithmic and economic/game-theoretic issues of this marketplace were just being understood, its impact is becoming clear only in recent years. Our result addresses a central algorithmic issue underlying this marketplace: how to match query keywords to advertisers so as to maximize Google's revenue. I will give an overview of the novel LP-based techniques that led to this result and the simple heuristic, of bid scaling, that is suggested by our algorithm which meets one of the key requirements that Google should be able to make query allocations in real time. These ideas have been widely adopted by Google and other search engine companies. Purely theoretical work, from the 1990s, on the online bipartite matching problem greatly benefitted our work. The latter problem, while mathematically clean and elegant, appears to have no applications. On the other hand, the multi-billion dollar online ad industry has become the key source of revenues for several Internet companies. For algorithms designers, this a very happy story of practical impact from rich theory.

### Monday, April 30th, 2018

Mechanism design studies the design of incentive structures in order to achieve a desired objective in the presence of strategic agents. It has a wide range of applications including online auctions, transportation systems and energy markets. Most work in mechanism design assumes that agents are risk neutral. For example, one of the most celebrated results in mechanism design, Myerson's characterization of the revenue optimal auction for selling a single item implies that the optimal revenue can be achieved via a simple auction. However, this is no longer the case if the agents exhibit risk-aversion, where the optimal auction can be quite complicated.

While risk-aversion is usually exhibited for larger rewards, behavioral studies have shown as well as existing practices in transportation systems and demand response protocols that agents exhibit risk-seeking behavior when the stakes are low. We analyze the optimal auction for risk-seeking agents and show surprisingly that it is quite simpler compared to its risk-averse counterpart. Finally, we discuss further directions in mechanism design beyond risk neutrality including different models for risk as well as risk-robust optimization.

We seek to understand when heterogeneity in agent preferences yields improved outcomes in terms of overall cost. That this might be hoped for is based on the common belief that diversity is advantageous in many multi-agent settings. We investigate this in the context of routing. Our main result is a sharp characterization of the network settings in which diversity always helps, versus those in which it is sometimes harmful.

Specifically, we consider routing games, where diversity arises in the way that agents trade-off two criteria (such as time and money, or, in the case of stochastic delays, expectation and variance of delay). Our main contributions are: 1) A participant-oriented measure of cost in the presence of agent diversity; 2) A full characterization of those net- work topologies for which diversity always helps, for all latency functions and demands.

Bike-sharing systems are changing the urban transportation landscape; we have been working with New York City Bike Share (aka Citi Bike), using analytics and optimization to improve the management of their system. Huge rush-hour usage imbalances the system, and in this talk we focus on methods used to mitigate the imbalances that develop. In particular, we will focus on the use of incentives; we have helped guide the development of Bike Angels, which enlists users to make

“rebalancing rides”, and we will describe tradeoffs among a number of policies for determining when and where rides should be incentivized, all of which are based on a user dissatisfaction function model of the performance of the system. The recent incentive results are joint work with Hangil Chung and Daniel Freund, but the basis for much of the research presented is also joint with Shane Henderson and Eoin O’Mahony.

In the future, a city operator might react in real time to high pollution levels, by signaling to a fleet operator (deliveries, groceries and food, Uber or Lyft, etc.) that it should avoid certain streets. What makes this a challenge? (1) Fleet operators control sizable amounts of traffic in a city, and they work with real-time data feeds on location and job assignment, and they can change their control policies with short notice. In the past, the control problem for a city has involved slow-changing signals sent to predictable users, but now it can involve much more dynamic and high-impact actions. (2) The outcome depends on the interplay between three types of agent, each with their own goals: the user, the fleet operators, and the city operator. Fleet operators can add new features at the speed of app development, far faster than the speed of city planning. Therefore the city operator needs to be able to quickly use data visualisation and ad hoc analysis, Excel-style, tied to on-demand flexible simulation, rather than waiting for academics to devise theoretical models. I will describe a project that's starting at the Alan Turing Institute, prototyping such a dashboard. We believe it will be a useful contribution to the debate about how cities and fleet operators need to interact.

### Tuesday, May 1st, 2018

We consider two interesting generalizations of learning Poisson Binomial Distributions (PBDs), namely learning all powers of a PBD and learning a Graph Binomial Distribution. A PBD is the distribution of X = X_1+...+X_n, where X_1, ..., X_n are independent Bernoulli trials with Exp[X_i] = p_i. The k-th power of X is X^{(k)} = X^{(k)}_1+...+X^{(k)}_n, where X^{(k)}_1, ..., X^{(k)}_n are independent Bernoulli trials with Exp[X^{(k)}_i] = p^k_i. A learning algorithm can take samples from any power. It should compute a sequence of distributions Y^{(1)}, Y^{(2)}, ... such that with probability at least 1-\delta, each Y^{(k)} is within total variation distance at most \eps to X^{(k)}. The question is to which extent a learning algorithm can exploit the special structure of powers and succeed in learning all powers of X using a number of samples comparable to the number of samples required for learning X alone. A Graph Binomial Distribution (GBD) is determined by a graph G(V, E). It is the distribution of Z = \sum_{\{ u, v \} \in E} Z_u Z_v, where Z_v s are independent Bernoulli trials. A learning algorithm can take samples from Z and should compute a distribution Q such that with probability at least 1-\delta, Q is within total variation distance at most \epsilon to Z. The question is to which extent (and for which classes of graphs) the dependencies among edge terms make learning Z more difficult than learning a standard PBD. We present upper and lower bounds on the number of samples required for learning (i) all powers of a PBD and (ii) a GBD. We also provide some insights on the structure of GBDs for simple classes of graphs.

Finding a large matching is the most well-studied graph problem in the data stream model. To bypass the Ω(n) lower bound required to store a matching, recent research has begun to focus on only approximating the size of matchings, resulting in several algorithms with sublinear space bounds. We continue this line of work and present several results for estimating matchings in low arboricity graphs using o(n) space. We give three structural results relating the size of a maximum matching to the arboricity α of a graph and show how those lead to approximation algorithms in different streaming models. For dynamic (insert-delete) streams, we present an algorithm returning (α+2)(1+ ε) approximation of the size of the matching using space O(α ε^{-2} n^{4}^{/5} polylog n). For insert-only streams, we show that the same approximation factor can be achieved in merely O(ε^{-2} log n) space. And finally, we further improve the approximation factor to (α+2) and space to O(1) in adjacency list streams, where edges incident to the same vertex arrive together.

The Fourier transform is one of the most widely used computational tools. It is used in signal processing, communications, audio and video compression, medical imaging, genomics, astronomy, etc. The fastest algorithm for computing the Fourier transform is FFT, which has O(n log n) time complexity. The near-linear time of the FFT made it an indispensable tool in many applications. However, with the emergence of big data problems and the need for real-time decision making, FFT’s runtime is no longer sufficient and faster algorithms that do not sample every data point are required. The Sparse Fourier Transform is a family of algorithms that compute the frequency spectrum faster than FFT. The Sparse Fourier Transform is based on the insight that many real-world signals are sparse –i.e., most of the frequencies have a negligible contribution to the overall signal. Exploiting this sparsity, we can compute the Fourier Transform of sparse signals very quickly (in sub-linear time). In this talk, I will give an overview of the Sparse Fourier Transform algorithms with a focus on real-time applications like GPS receivers, dynamic spectrum access, 5G wireless network, and radio astronomy.

We describe Self-Programming Networks (SPNs), an ongoing research effort at Stanford for making cloud computing networks autonomous; that is, to enable networks to sense and monitor themselves, and program and control themselves. We present the architecture of SPNs and two key algorithms: (i) Simon, for fine-grained network measurement using packet and probe timestamps taken at the network’s edge, and (ii) Huygens, for nanosecond-level clock synchronization in real-time and at scale. We will present the algorithms and results from some deployments.

We consider an unstable scalar linear stochastic system, $X_{n+1}=a X_n + Z_n - U_n$, where $a \geq 1$ is the system gain, $Z_n$'s are random variables with bounded $\alpha$-th moments, and $U_n$'s are the control actions that are chosen by a controller who receives a single element of a finite set $\{1, \ldots, M\}$ as its only information about system state $X_i$. We show that $M = \lfloor a + 1 \rfloor$ is necessary and sufficient for $\beta$-moment stability, for any $\beta < \alpha$. The converse is shown using information-theoretic techniques. The matching achievability scheme is a uniform quantizer of zoom-in / zoom-out type whose performance is analyzed using probabilistic arguments.

This is joint work with Yuval Peres, Gireeja Ranade, Mark Sellke.

High-performance cyber-physical systems rely on many sensors and hardware components for successful operation. Control strategies for these devices require an understanding of how unpredictability in these components might impair performance. In this talk, we aim to quantify the informational bottlenecks imposed by uncertain system models. We can use this to quantify the value of side-information regarding the uncertainty in the system (in bits), in order to answer questions such as: "what is the value of adding an extra sensor to the system?" We will also show that systems with uncertain actuation (e.g., when motors on a drone cannot precisely execute control actions) exhibit surprisingly different behavior than systems with uncertain sensing (e.g., miscalibrated cameras).

The talk will include joint work with Jian Ding, Yuval Peres, Govind Ramnarayan, Anant Sahai, and Alex Zhai.

### Wednesday, May 2nd, 2018

Network infrastructures for transportation, communication, or energy transmission are an important backbone of our society. However, they are also prone to failure and intentional sabotage. In such cases it is desirable to quickly recover the service provided through the network. We introduce reroutable flows, a robust version of network flows in which link failures can be mitigated by rerouting the affected flow. An important new feature of this model, distinguishing it from existing robust network flow models, is that no flow can get lost in the network.

Our goal is to compute maximum flows under this robustness requirement. We investigate different variants depending on the number of failing links, the capacities available for rerouting, and integrality requirements. Using a suitably adapted concept of a minimum cut as an upper bound, we derive approximation algorithms for some of the harder variants of the problem and exact algorithms for important special cases.

This is joint work with Tom McCormick and Gianpaolo Oriolo.

How quickly can information introduced at time zero at a given node can influence decisions at far-away nodes in a distributed network? To understand this question, we study the following generalization of the well-known model of broadcasting on trees. Consider an infinite directed acyclic graph (DAG) with a unique source node X. Let the collection of nodes at distance k from X be called the kth layer. At time zero, the source node is given a bit. At time k each node in the (k - 1)th layer inspects its inputs and sends a bit to its descendants in the kth layer. Each bit is flipped with a probability of error Δ. The goal is to be able to recover the original bit with probability of error better than 1/2 from the values of all nodes at an arbitrarily deep layer k. Besides its natural broadcast interpretation, the DAG broadcast is a natural model of noisy computation. Some special cases of the model represent information flow in biological networks, and other cases are closely related to probabilistic cellular automata models. We show that there exist DAGs with bounded degree and layers of size Ωlog(k) that permit recovery provided Δ is sufficiently small and find the critical value for the DAGs constructed. Our result demonstrates a doubly-exponential advantage for storing a bit in bounded degree DAGs compared to trees. On the negative side, we show that if the DAG is a two-dimensional regular grid, then recovery is impossible for any positive Δ provided all nodes use the same symmetric processing functions (i.e. up to equivalent AND, XOR, NAND).

No abstract available.

At the heart of most algorithms today there is an optimization engine trying to learn online and provide the best decision, for e.g. rankings of objects, at any time with the partial information observed thus far in time. Often it becomes difficult to find near optimal solutions to many problems due to their inherent combinatorial structure that leads to certain computational bottlenecks. Submodularity is a discrete analogue of convexity and is a key property often exploited in tackling combinatorial optimization problems. In the first part of the talk, we will focus on computational bottlenecks that involve submodular functions: convex function minimization over submodular base polytopes (for e.g. permutahedron) and movement along a line inside submodular base polytopes. We give a conceptually simple and strongly polynomial algorithm Inc-Fix for the former, which is useful in computing Bregman projections in first-order projection-based methods like online mirror descent. For the latter, we will bound the iterations of the discrete Newton method which gives a running time improvement by at least n^6 over the state of the art. This is joint work with Michel Goemans and Patrick Jaillet. In the second part of the talk, we will shift our focus from learning decisions (like rankings) to learning which heuristic for Max-Cut (and a related quadratic unconstrained binary optimization problem) would perform best with respect to the given data. With the help of a large scale evaluation using cloud computing and the construction of a feature space to represent inputs, we show that machine learning can be used to predict which heuristic will work best on a previously unseen problem instance. This is joint work with John Silberholz and Iain Dunning.

### Thursday, May 3rd, 2018

The design of efficient online learning algorithms is a central topic in machine learning. Researchers have investigated a wide variety of algorithmic templates that lead to efficient algorithms provided the learning problem has the right structure. Two examples of such algorithmic templates are Follow the Regularized Leader (FTRL) and Follow the Perturbed Leader (FTPL). In this talk, I will briefly describe some recent work that sheds light on the connections between FTRL and FTPL. I will highlight the role that stability plays in the theoretical analyses of these algorithms. I will also try to connect stability to two topics outside of machine learning: the Littlewood-Offord theorem and the theory of Differential Privacy. (Talk is based on joint work with Jacob Abernethy, Chansoo Lee, Audra McMillan and Zifan Li.)

Distributed optimization increasingly plays a central role in economical and sustainable operation of cyber-physical systems. Nevertheless, the complete potential of the technology has not yet been fully exploited in practice due to communication limitations posed by the real-world infrastructures. Our work investigates fundamental properties of dual gradient methods in distributed resource allocation optimization, where the gradient information is communicated using a limited number of bits. Sufficient and necessary conditions are provided on the quantization set to guarantee the optimality and convergence of the algorithms. We also investigate communication complexity of the problem, the minimal number of communicated bits needed to solve some classes of resource allocation problems regardless of the used algorithms.

This is the joint work with Sindri Magnusson, Chinwendu Enyioha, Carlo Fischione, and Vahid Tarokh.

The online allocation of scarce resources is a canonical paradigm of decision-making under uncertainty, and is widely studied in many fields of engineering under a variety of titles (dynamic allocation/revenue management/prophet inequalities/etc.). In this talk, I will re-examine the basic online allocation problem, with the aim of building bridges to the ever-improving predictive capabilities enabled by modern machine-learning methods. To this end, I will present a new Bayesian-learning inspired algorithm for online allocation problems, and show that it achieves the first horizon-independent and budget-independent regret bounds for a wide class of online packing problems with stochastic inputs. The result follows from a simple coupling technique, and I will discuss how this may prove useful for related online decision-making settings.

Mergeable summaries (formalized by Agarwal et al. in PODS 2012) allow one to process many different streams of data independently, and then the summaries computed from each stream can be quickly combined to obtain an accurate summary of various combinations of the datasets (union, intersection, etc.). Among other major benefits, mergeable summaries allow data to be automatically processed in a fully distributed and parallel manner, by partitioning the data arbitrarily across many machines, summarizing each partition, and seamlessly combining the results.

This talk will describe a line of research that has grown out of the development of Data Sketches, an open source library of production-quality implementations of mergeable summaries for basic problems including unique counts, quantiles, frequent items, sampling, and matrix analysis. The library is currently used by several companies and government agencies (Yahoo/Oath, Amazon, Splice Machine, GCHQ, etc.) and enables real-time processing of massive datasets.

Both in theory and practice, coresets provide the state of the art solution for problems such as k-means clustering, logistic regression or low rank approximation (SVD/PCA). This includes streaming and real-time "Big data" that may be distributed e.g. on a network ("cloud"). A coreset (or, core-set) for a given problem is a "compressed" representation of its input, in the sense that a solution for the problem with the (small) coreset as input would yield a provable (1+epsilon) factor approximation to the problem with the original (large) input. In this talk we will survey the main techniques for computing core-sets and see some applications for real-time drones navigation and weak "Internet of Things" devices.

Deployment of Neural network models for real world safety critical applications necessitates going beyond empirical validation of the accuracy on test data. Adversarial examples for image classifiers are evidence for the need for a more formal verification paradigm. While progress is starting to be made on this direction, most existing approaches are applicable to networks with piece-wise linear non-linearities (ReLUs and max-pooling, for example). This paper overcomes this by developing a general framework for model verification that applies to feed-forward networks with arbitrary component-wise non-linearities (including tanh, sigmoid, ELUs) and max-pooling. Our method relies on formulating the verification problem as an optimization problem and solving a Lagrangian relaxation of the same to obtain a bound on the verification objective. We demonstrate the practical efficiency of our approach on several tasks: Verifying adversarial robustness of image classifiers and verifying stability of neural network predictions to features evolving over time.

### Friday, May 4th, 2018

Abstract: We consider relative error low rank approximation of tensors with respect to the Frobenius norm: given an order-q tensor A, output a rank-k tensor B for which |A-B|_F <= (1+eps)OPT, where OPT =inf_{rank-k A'} |A-A'|_F. Despite the success on obtaining relative error low rank approximations for matrices, no such results were known for tensors. One structural issue is there may be no rank-k tensor A' achieving the above infinum. Another, computational issue, is that an efficient relative error low rank approximation algorithm for tensors would allow one to compute the tensor rank, which is NP-hard. We bypass these issues via (1) bicriteria and (2) parameterized complexity solutions. As an example, we give an algorithm which outputs a rank (k/eps)^{q-1} tensor B for which |A-B|_F <= (1+eps)OPT in nnz(A)+n*poly(k/eps) time in the real RAM model for an n x n x ... n tensor. Here nnz(A) is the number of non-zero entries in A. For outputting a rank-k tensor, or even a bicriteria solution with rank-Ck for a certain constant C>1, we show an exp(k^{1-o(1)}) time lower bound under the Exponential Time Hypothesis. Our results also give the first relative error low rank approximations for tensors for a large number of robust error measures for which nothing was known, as well as column row and tube subset selection problems, generalizing and strengthening results for CUR decompositions of matrices. Based on joint with Zhao Song (UT Austin) and Peilin Zhong (Columbia).

Many popular computational primitives in Machine learning and Optimization involve summing up a pairwise function over a large data-set. Such primitives include computing the Kernel Density, Partition Function, and Gradients of Empirical Loss functions. These procedures require computing a discrete integral that depends on vector y (query), that is unknown a priori or might change with time. We describe a technique that utilizes ``Distance Sensitive Hashing" to provide fast approximations to such integrals that in certain cases results in sqrt{n} speedups over the previously best known data structures. Our technique extends some recent results from FOCS'17 and essentially amounts to constructing an efficient Monte Carlo-version of the Riemann Integral over the Unit Sphere for a certain class of functions.

We consider the problem of active linear regression with L2-bounded noise. In this context, the learner receives a set of unlabeled data points, chooses a small subset to receive the labels for, and must give an estimate of the function that performs well on fresh samples. We give an algorithm that is simultaneously optimal in the number of labeled and unlabeled data points, with O(d) labeled samples; previous work required Ω(d log d) labeled samples regardless of the number of unlabeled samples.

Our results also apply to learning linear functions from noisy queries, again achieving optimal sample complexities. The techniques extend beyond linear functions, giving improved sample complexities for learning the family of k-Fourier-sparse signals with continuous frequencies.