Monday, November 14th, 2016

10:10 am10:50 am

Graph Partitioning problems (like Minimum Balanced Cut) form a central topic of study in the algorithms and machine learning research. However, most problems in this class are NP-hard in the worst case, and efficient algorithms with good worst-case guarantees (e.g. good approximation algorithms) have been elusive for many of these problems. Since real-world instances are unlikely to behave like worst-case instances, average-case models for graph partitioning and community detection have been widely studied to reason about most instances that occur in practice. The Stochastic Block Model or the Planted Partition Model is the most widely used probabilistic model in various fields, including machine learning, statistics, and social sciences. Many existing algorithmic approached like spectral methods successfully learn the ground-trut clustering (or communities) when the data is drawn exactly according to the model; but they do not work well in the presence of errors.

In this talk, I will address the following question: Can we design polynomial time algorithms for learning probabilistic models for graph partitioning, that are robust to modeling errors?

I will first survey different average-case models for graph partitioning like semi-random models that are more general than the stochastic block model, and algorithmic results for these models. Then, I will present a new polynomial time algorithm to recover the clustering (or communities) in a stochastic block model even in the presence of various modeling errors.  These algorithms can tolerate even a constant fraction of edges being corrupted adversarially.

This is based on joint works with Konstantin Makarychev and Yury Makarychev.

11:20 am12:00 pm

The k-center problem is a canonical and long-studied facility location and clustering problem with many applications in both its symmetric and asymmetric forms. Both versions of the problem have tight approximation factors on worst case instances: a 2-approximation for symmetric k-center and an O(log*k)-approximation for the asymmetric version. In this work, we go beyond the worst case and provide strong positive results both for the asymmetric and symmetric k-center problems under a very natural input stability (promise) condition called alpha-perturbation resilience (Bilu & Linial 2012) , which states that the optimal solution does not change under any alpha-factor perturbation to the input distances. We show that by assuming 2-perturbation resilience, the exact solution for the asymmetric k-center problem can be found in polynomial time. To our knowledge, this is the first problem that is hard to approximate to any constant factor in the worst case, yet can be optimally solved in polynomial time under perturbation resilience for a constant value of alpha. Furthermore, we prove our result is tight by showing symmetric k-center under (2=E2=88=92epsilon)-perturbation resilience is hard unless N= P=3DRP. This is the first tight result for any problem under perturbation resilience, i.e., this is the first time the exact value of alpha for which the problem switches from being NP-hard to efficiently computable has been found. Our results illustrate a surprising relationship between symmetric and asymmetric k-center instances under perturbation resilience. Unlike approximation ratio, for which symmetric k-center is easily solved to a factor of 2 but asymmetric k-center cannot be approximated to any constant factor, both symmetric and asymmetric k-center can be solved optimally under resilience to 2-perturbations.

Joint work with Nina Balcan and Nika Haghtalab.

2:00 pm2:40 pm

The random planted 3-coloring model generates 3-colorable graph $G$ by first generating a random host graph $H$ of average degree $d$, and then planting in it a random 3-coloring (by giving each vertex a random color and dropping the monochromatic edges). For a sufficiently large constant $c$, Alon and Kahale [SICOMP 1997] presented a spectral algorithm that finds (with high probability) the planted 3-coloring of such graphs whenever $d > c\log n$. Perhaps more surprisingly, they also extend this algorithm to handle the case $d > c$, where in this regime it finds a legal 3-coloring (not necessarily the planted one).

It is natural to ask whether the algorithm of Alon and Kahale works whenever the host graph $H$ is a $d$-regular spectral expander (chosen by an adversary). Likewise, one may ask whether the planted 3-coloring needs to be random, or may we allow an adversary to plant a balanced 3-coloring of his choice after seeing $H$. In this talk we shall review the algorithm of Alon and Kahale, while addressing questions of the above nature.

Joint work with Roee David.

2:40 pm3:20 pm

We consider the problem of comparing algorithms on real-world data.  This raises several challenges which include how to model the problem at an appropriate level, and how to compare algorithms on not just the data sets at hand, but also anticipated (larger) future data sets.  We will use two examples -- scheduling jobs on google datacenters and scheduling a weapons maintenance and decommissioning facility.

3:50 pm4:30 pm

Mixed integer programming is NP-hard, and yet it has been successfully used to solve combinatorial optimization problems in industry. In this talk, we shed some light on this puzzle from an informal, practitioner's perspective. We show the main challenges of crafting an optimization model for real-life problems. We present techniques often used to make the model "solvable in practice", and we explain what this really means and entails.  It turns out that worst-case complexity of proving the optimality is rarely an actual concern, in contrast to: input data quality, careful design of a model, rapid progression towards a feasible solution, and an overall reliability. We hope to open a discussion on plausible theoretical underpinnings of the "unreasonable effectiveness of mixed-integer programming in the industry".

Tuesday, November 15th, 2016

9:30 am10:10 am

In both statistical and algorithmic analysis, a bound f(C) in terms of instance-specific characteristics C (e.g. data distribution) is often described as "optimal" if there exists some C* for which it happens to be tight. And yet, the bound might be extremely loose for the types of C that arise in practice.

Fortunately, for nonparametric methods - that is, algorithms or statistical estimators whose behavior is essentially local - it is frequently possible to do much better and obtain bounds f(C) that are tight for all C.

We'll discuss two such results, one statistical and one algorithmic, relating to nearest neighbor.

  1. What properties of the data distribution influence the statistical rate of convergence of nearest neighbor classification?

For any data distribution, we can obtain distribution-specific upper and lower bounds that are closely matching. These yield, as by-products, solutions to several open questions around nearest neighbor methods, such as a characterization of the metric measure spaces in which nearest neighbor is universally consistent. They also suggest a new notion of smoothness, different from the usual Lipschitz or Holder conditions, that is well-suited to nearest neighbor.

  2. What properties of the data distribution influence the computational complexity of finding the nearest neighbor?

Here, we can obtain a tight characterization of the performance of tree-based NN search, as a simple function of the specific data configuration. This general result is then easily specialized to common types of structure in data.

We'll end with some statistical and algorithmic open problems.

10:10 am10:50 am

We study spectral algorithms for high-dimensional Nearest Neighbor Searching (NNS). In practice, spectral approaches for NNS are quite popular, and often outperform the random-projection methods that are common in theoretical analyses (e.g., of worst-case datasets).  Our work aims to explain this disparity.

Specifically, we consider a semi-random setting where a dataset P in R^d is chosen arbitrarily from an unknown subspace of low dimension (much smaller than d), and then perturbed by full-dimensional Gaussian noise. We design spectral NNS algorithms whose query time is polynomial in d and log |P|, and are effective even when the noise magnitude is much larger than the interpoint distances in P. These algorithms use repeated computation of the top PCA vector/subspace, akin to practical heuristics.

Joint work with Amirali Abdullah, Alexandr Andoni, and Ravindran Kannan.

11:20 am12:00 pm

In many settings, the behavior of a system arises as a consequence of the actions of many self-interested agents optimizing their individual objectives, as opposed to the intervention of a central authority optimizing for a global objective. The concepts of the price of anarchy (POA) and the price of stability (POS) from game theory characterize system performance for the worst and best equilibria, respectively, in such settings relative to the globally optimal solution. A deficiency of these concepts, however, is that they are purely existential and do not shed light on which equilibrium states are reachable in an actual game, i.e., via natural game dynamics. This is particularly striking if these quantities differ significantly in value, such as in network design games where the exponential gap between the best and worst equilibria originally prompted the notion of POS in game theory (Anshelevich et al., FOCS 2002).

In this work, we make progress toward bridging this gap by studying network design games under natural game dynamics. On the one hand, we show that in a completely decentralized setting, where agents arrive, depart, and make improving moves in an arbitrary order, the inefficiency of the equilibrium attained can be polynomially large. On the other hand, if after every batch of (adversarial) arrivals and departures of agents, a game designer is allowed to execute a sequence of improving moves to create an equilibrium, then the resulting equilibrium states attained by the game are exponentially more efficient, i.e., the ratio of costs compared to the optimum is only logarithmic. Overall our establish that in network games the efficiency of equilibrium states is dictated by whether agents are allowed to join or leave the game in arbitrary states, an observation that might be useful in analyzing the dynamics of other classes of games with divergent POS and POA bounds.

This is joint work with Seffi Naor, Debmalya Panigrahi, Mohit Singh, and Seeun Umboh.

2:00 pm2:40 pm

Finding similar items in a dataset is a basic problem in many fields, including information retrieval, natural language processing and machine learning. In theoretical CS much algorithmic work assumes a distance function on instances, or even that the instances are vectors in R^d and distance is defined using some l_p metric.

But in most settings it is rarely the case that the data is a  vector in R^d: for instance treating an image as a vector of pixels in not a very natural representation. Instead, much applied work is devoted to finding a suitable vector representation that captures the semantic "content." For instance Hinton and Salakhutdinov (2007) show how to compute semantic hashing of text using Restricted Boltzmann Machines (RBMs). More recent work uses neural nets and matrix factorization to define sentence and paragraph embeddings.  

The talk will cover recent work that gives a theoretical analysis of such nonlinear, nonconvex methods and what kind of semantics it uncovers, with new theoretical insight into low-dimensiional word embeddings, sentence/paragraph embeddings, and content recommendation. 

(Joint works: [A., Li, Liang, Ma, Risteski] [A., Li and Ma] and [A. and Risteski])

2:40 pm3:20 pm

Quantitative properties such as execution time, memory consumption, and power usage determine whether a software system is useful in practice. During the past years, my collaborators and I have developed a new static program analysis that can automatically determine the complexity of many programs. The technique is called automatic amortized resource analysis (AARA) and it derives worst-case resource bounds which are parametric in the input sizes of programs. The analysis can be integrated in a type system (for functional programs) or a Hoare logic (for imperative programs). A novel algebraic structure, which we call multivariate resource polynomials, enables the reduction of bound inference to LP solving. AARA is naturally compositional and derives bounds for a wide range of non-trivial programs.

In this talk, I will introduce the idea of AARA and explain the connections between bound inference and linear optimization. An interesting observation is that AARA usually produces network-like linear programs that seem to be solvable in linear time in practice by simplex-based LP solvers such as CPLEX.

3:50 pm4:30 pm

In this presentation, I describe a statistical approach for analyzing the scaling of the empirical running time of an algorithm when applied to sets of inputs of growing size. Unlike asymptotic worst-case analysis, this approach is applicable to sophisticated implementations of high-performance, highly heuristic algorithms on arbitrary distributions of inputs. It is based on standard numerical techniques for fitting models, which are then challenged by extrapolation to larger problem sizes and statistically validated using bootstrap confidence intervals. Our approach permits not only the automatic construction of predictive models of a given algorithm’s running time, but also the comparative evaluation of multiple hypotheses on its scaling, in the form of different parametric functions. Using it, we have obtained empirical scaling analysis results for state-of-the-art TSP and SAT algorithms that challenge conventional wisdom regarding the complexity of these problems.

Wednesday, November 16th, 2016

9:30 am10:10 am

We investigate algorithms that learn to run faster: given a series of instances of a computational problem, these algorithms use characteristics of past instances to sharpen their performance for new instances. We have found such *self-improving* algorithms for sorting a list of numbers, and for some problems on planar point sets: computing Delaunay triangulations, coordinate-wise maxima, and convex hulls. Under the assumption that input instances are random variables, each algorithm begins with a training phase during which it adjusts itself to the distribution of the input instances; this is followed by a stationary regime in which the algorithm runs in its optimized version. In this setting, an algorithm must take expected time proportional to the input size n, and to the entropy E of the output: the input must be touched, and the output must be communicated. Our algorithms achieve O(n+E) expected running times in the stationary regime for all the problems mentioned except convex hulls, where O(n*loglog n + E) expected time is needed.

This work was done 2008-2012 by Nir Ailon, Bernard Chazelle, Ding Liu, Wolfgang Mulzer, C. Seshadhri, and the speaker.

10:10 am10:50 am

The best algorithm for a computational problem generally depends on the "relevant inputs," a concept that depends on the application domain and often defies formal articulation. While there is a large literature on empirical approaches to selecting the best algorithm for a given application domain, there has been surprisingly little theoretical analysis of the problem.  This work adapts concepts from statistical and online learning theory to reason about application-specific algorithm selection. Our models capture several state-of-the-art empirical and theoretical approaches to the problem, ranging from self-improving algorithms to empirical performance models, and our results identify conditions under which these approaches are guaranteed to perform well. We present one framework that models algorithm selection as a statistical learning problem, and our work here shows that dimension notions from statistical learning theory, historically used to measure the complexity of classes of binary- and real-valued functions, are relevant in a much broader algorithmic context. We also study the online version of the algorithm selection problem, and give possibility and impossibility results for the existence of no-regret learning algorithms.

Joint work with Tim Roughgarden.

11:20 am12:00 pm

Problems that we encounter in real-world applications are often NP-hard. Therefore, a wealth of algorithms have been developed to provide approximate solutions to these problems efficiently. Max-cut, clustering, and many other partitioning problems fall into this camp, with widespread applications ranging from statistical physics to computational biology. Since the most commonly used heuristics and approximation algorithms for these problems are often characterized by real-valued parameters, it is crucial to decide which algorithm from a parametrized family of infinitely many algorithms is the “best” to use. While conventionally, one would then choose the algorithm with the best worst-case performance, the worst-case scenario may never materialize in a specific application domain.  This calls for an application-specific algorithm selection approach in which one tunes the parameters of the algorithm to suit an unknown distribution over problem instances defining which problems are typical in that application. While this question has been studied empirically over the past several decades, there has been little work in developing a firm theoretical understanding until recently when Gupta and Roughgarden [2016] proposed a learning-theoretic model where one “learns” the best algorithm for an application by sampling problem instances from the underlying distribution. We work in this framework and focus on two widely used classes of algorithms: i) an infinite family of clustering algorithms which involve an agglomerative step followed by dynamic programming ii) a rich class of randomized rounding techniques for SDP relaxations of integer quadratic programs such as max-cut. We provide tight bounds on the learning-theoretic dimensions of these algorithm classes which demonstrates their intrinsic complexity. Based on this we present simple and computationally efficient techniques with sample complexity guarantees to learn the best algorithm from these classes for a given application domain.

Joint work with Maria-Florina Balcan, Ellen Vitercik and Colin White.

12:00 pm12:40 pm

All known algorithms for solving NP-complete prob­lems require exponential time in the worst case; however, these algorithms nevertheless solve many problems of practical impor­tance astoundingly quickly, and are hence relied upon in a broad range of applications. This talk is built around the observation that“Empirical Hardness Models”—statistical models that predict algorithm runtime on novel instances from a given distribution—work surprisingly well. These models can serve as powerful tools for algorithm design, specifically by facilitating automated methods for algorithm design and for constructing algorithm portfolios. They also offer a statistical alternative to beyond-worst-case analysis and a starting point for theoretical investigations.All known algorithms for solving NP-complete prob­lems require exponential time in the worst case; however, these algorithms nevertheless solve many problems of practical impor­tance astoundingly quickly, and are hence relied upon in a broad range of applications. This talk is built around the observation that“Empirical Hardness Models”—statistical models that predict algorithm runtime on novel instances from a given distribution—work surprisingly well. These models can serve as powerful tools for algorithm design, specifically by facilitating automated methods for algorithm design and for constructing algorithm portfolios. They also offer a statistical alternative to beyond-worst-case analysis and a starting point for theoretical investigations.

Thursday, November 17th, 2016

9:30 am10:10 am
Non-negative matrix factorization is a ubiquitous machine learning primitive that concerns decomposing data as a non-negative combination of features, and is known to be NP-hard in the worst case. However, in practice, a popular family of heuristics to solve it proceed by simply alternatingly decomposing the samples using the current estimated features (colloquially called decoding), and updating the features based on the decoding.  
This work proposes a fairly weak generative model for the data and provides guarantees on such a "decode-and-update" type of algorithm for non-negative matrix factorization. Importantly, the generative model does not require separability assumptions on the feature matrix, which is often the case in prior provable works on non-negative matrix factorization. Interestingly, our analysis does not fall into the "local strong convexity" framework for analyzing non-convex optimization. 
Based on joint work with Yuanzhi Li and Andrej Risteski.
11:20 am12:00 pm

An active learner is given a hypothesis class, a large set of unlabeled examples and the ability to interactively query labels of a subset of them; the learner's goal is to learn a hypothesis in the class that fits the data well by making as few label queries as possible. While active learning can yield considerable label savings in the realizable case -- when there is a perfect hypothesis in the class that fits the data -- the savings are not always as substantial when labels provided by the annotator may be noisy or biased. Thus an open question is whether more complex feedback can help active learning in the presence of noise.

In this talk, I will present two separate feedback mechanisms, and talk about when they can help reduce the label complexity of active learning.  The first is when labels are obtained from multiple annotators with slightly different labeling patterns; the second is when the annotator can say "I don't know" instead of providing an incorrect label.

This talk is based on joint work with Chicheng Zhang, Songbai Yan and Tara Javidi.

2:00 pm2:40 pm

We overview several techniques for solving large scale reinforcement learning problems of the type that might commonly arise in advertising and recommendation contexts. We place special emphasis on techniques that exploit the data and models that are used for traditional "myopic" prediction of user behavior (e.g., CTR) to readily construct policies that optimize long-term, cumulative versions of these metrics. We outline challenges and potential solutions that arise in model-free RL in such settings, and derive novel new model-based techniques for the solution of large factored Markov decision processes.We overview several techniques for solving large scale reinforcement learning problems of the type that might commonly arise in advertising and recommendation contexts. We place special emphasis on techniques that exploit the data and models that are used for traditional "myopic" prediction of user behavior (e.g., CTR) to readily construct policies that optimize long-term, cumulative versions of these metrics. We outline challenges and potential solutions that arise in model-free RL in such settings, and derive novel new model-based techniques for the solution of large factored Markov decision processes.

2:40 pm3:20 pm

We investigate dropout for a single neuron (inputs are dropped independently at random with probability half).  When the loss is linear, then we can prove very good properties for dropout perturbation: optimal regret in the worst case without having to tune any parameter, AND optimal regret in the iid case when there is a gap between the best and second best feature.

We give high level intuitions of the new proof techniques and discuss a number of competitor algorithms some of which require tuning.

Joint work with Tim Van Erven and Wojciech Kotlowski.

4:30 pm5:10 pm

An effective technique for solving optimization problems over massive data sets is to partition the data into smaller pieces, solve the problem on each piece, compute a representative solution from it, and finally obtain a solution inside the union of the representative solutions for all pieces. Such an algorithm can be implemented easily in 2 rounds of MapReduces or be applied in an streaming model. This technique can be captured via the concept of {\em composable core-sets}, and has been recently applied to solve diversity maximization problems as well as several clustering problems. However, for coverage and submodular maximization problems, impossibility bounds are known for this technique. In this talk, after an initial discussion about this technique and applications in diversity maximization and clustering problems, we focus on the submodular and coverage maximization problem, and show how to apply a *randomized variant of composable core-set problem*, and achieve 1/3-and 0.58-approximation for non-monotone and monotone submodualr maximization problems. Finally, I show how to improve the above results to achieve a 1-1/e-approximation for coverage function using a new sketching technique, and show how to design almost optimal streaming and distributed algorithms for coverage problems.

The talk summarizes result from a STOC 2015 (with M. ZadiMoghaddam) and a new paper (with H. Bateni and H. Esfandiari). The survey part of the talk are from three recent papers that appeared in PODS 2014, NIPS 2014, and ICML 2016.

Friday, November 18th, 2016

10:10 am10:50 am
Structured prediction tasks in machine learning involve the simultaneous prediction of multiple labels. This is often done by maximizing a score function on the space of labels, which decomposes as a sum of pairwise elements, each depending on two specific labels. The goal of this work is to develop a theoretical explanation of the empirical effectiveness of heuristic inference algorithms for solving such structured prediction problems. We study the minimum-achievable expected Hamming error in such problems, highlighting the case of 2D grid graphs, which are common in machine vision applications. Our main theorems provide tight upper and lower bounds on this error, as well as a polynomial-time algorithm that achieves the bound.
Joint work with Amir Globerson, David Sontag, and Cafer Yildirim.
11:20 am12:00 pm

Monotone estimation problems (MEP) have a simple formulation: We have a data point $x$ from a universe $D$ and we are interested in estimating $f(x)$ for a nonnegative function $f$. However, we do not observe $x$ but instead see the output $S(x,u)$ of a blurry scope applied to $x$ with resolution $u \sim U[0,1]$: The lower $u$ is, the more information we have on $x$ and hence on $f(x)$. We are interested in estimators that are unbiased, nonnegative, and optimal (admissible).

In this talk, we will characterize feasibility, present constructions of estimators, and motivate the study of MEPs. One important application of MEPs is coordinated weighted sampling. Sample coordination was introduced decades ago (Brewer et al 1972) for efficient surveys of populations that change over time and later applied for scalable summarization and analytics of large data sets (it is an instance of LSH and MinHash sketches are a special case). Coordinated samples are effective summaries of matrix data (logs, transactions, feature vectors collected across time or servers), generalized coverage functions, and graph-structure kernels. We will see that estimators for some fundamental statistics can be expressed as a sum of MEPs: Change in activity across time or servers, similarity and influence of nodes in a network, and coverage of set of items. The quality of our MEP estimators is critical to the overall performance tradeoffs of these applications.

2:00 pm2:40 pm

No abstract available.

3:50 pm4:30 pm

Predictions about the future are a crucial part of the decision making process in many real-world online problems.  However, analysis of online algorithms has little to say about how to use predictions, and how properties of prediction errors impact algorithm design.  In this talk, I'll describe recent results exploring the power of predictions in online convex optimization and how properties of prediction noise can impact the structure of optimal algorithms.