Playlist: 5 videos
Joint IFML/Data-Driven Decision Processes Workshop
0:31:15
Surbhi Goel (Microsoft Research and University of Pennsylvania)
https://simons.berkeley.edu/talks/stochastic-optimization-under-distributional-drift
Joint IFML/Data-Driven Decision Processes Workshop
Over the past few years, Transformers have revolutionized deep learning, leading to advances in natural language processing and beyond. These models discard recurrence and convolutions, in favor of "self-attention" which directly and globally models interactions within the input context. Despite their success, currently there is limited understanding of why they work. In this talk, I will present our recent results on rigorously quantifying the statistical and representational properties of Transformers which shed light on their ability to capture long range dependencies efficiently. First, I will show how bounded-norm self-attention layers can represent arbitrary sparse functions of the input sequence, with sample complexity scaling only logarithmically with the context length, akin to sparse regression. Subsequently, I will briefly show how this ability of self-attention to compute sparse functions along with its ability to compute averages can be used to construct Transformers that exactly replicate the dynamics of a recurrent model of computation depth $T$ with only $o(T)$ depth. I will conclude the talk with experimental results on synthetic tasks based on learning Boolean functions and automata. Based on joint works with Jordan T. Ash, Ben L. Edelman, Sham M. Kakade, Akshay Krishnamurthy, Bingbin Liu, and Cyril Zhang.
Visit talk page
https://simons.berkeley.edu/talks/stochastic-optimization-under-distributional-drift
Joint IFML/Data-Driven Decision Processes Workshop
Over the past few years, Transformers have revolutionized deep learning, leading to advances in natural language processing and beyond. These models discard recurrence and convolutions, in favor of "self-attention" which directly and globally models interactions within the input context. Despite their success, currently there is limited understanding of why they work. In this talk, I will present our recent results on rigorously quantifying the statistical and representational properties of Transformers which shed light on their ability to capture long range dependencies efficiently. First, I will show how bounded-norm self-attention layers can represent arbitrary sparse functions of the input sequence, with sample complexity scaling only logarithmically with the context length, akin to sparse regression. Subsequently, I will briefly show how this ability of self-attention to compute sparse functions along with its ability to compute averages can be used to construct Transformers that exactly replicate the dynamics of a recurrent model of computation depth $T$ with only $o(T)$ depth. I will conclude the talk with experimental results on synthetic tasks based on learning Boolean functions and automata. Based on joint works with Jordan T. Ash, Ben L. Edelman, Sham M. Kakade, Akshay Krishnamurthy, Bingbin Liu, and Cyril Zhang.
0:34:35
Raghu Meka (UCLA)
https://simons.berkeley.edu/talks/power-preconditioning-sparse-linear-regression
Joint IFML/Data-Driven Decision Processes Workshop
Sparse linear regression is a fundamental problem in high-dimensional statistics, but strikingly little is known about how to solve it efficiently without restrictive conditions on the design matrix. This talk will focus on the (correlated) random design setting, where the covariates are independently drawn from a multivariate Gaussian and the ground-truth signal is sparse. Information theoretically, one can achieve strong error bounds with O(klogn) samples for arbitrary covariance matrices and sparsity k; however, no efficient algorithms are known to match these guarantees even with o(n) samples in general. As for hardness, computational lower bounds are only known with worst-case design matrices. Random-design instances are known which are hard for the Lasso, but these instances can generally be solved by Lasso after a simple change-of-basis (i.e. preconditioning). In the talk, we will discuss new upper and lower bounds clarifying the power of preconditioning in sparse linear regression. First, we show that the preconditioned Lasso can solve a large class of sparse linear regression problems nearly optimally: it succeeds whenever the dependency structure of the covariates, in the sense of the Markov property, has low treewidth -- even if the covariance matrix is highly ill-conditioned. Second, we construct (for the first time) random-design instances which are provably hard for an optimally preconditioned Lasso. Based on joint works with Jonathan Kelner, Frederic Koehler, Dhruv Rohatgi.
Visit talk page
https://simons.berkeley.edu/talks/power-preconditioning-sparse-linear-regression
Joint IFML/Data-Driven Decision Processes Workshop
Sparse linear regression is a fundamental problem in high-dimensional statistics, but strikingly little is known about how to solve it efficiently without restrictive conditions on the design matrix. This talk will focus on the (correlated) random design setting, where the covariates are independently drawn from a multivariate Gaussian and the ground-truth signal is sparse. Information theoretically, one can achieve strong error bounds with O(klogn) samples for arbitrary covariance matrices and sparsity k; however, no efficient algorithms are known to match these guarantees even with o(n) samples in general. As for hardness, computational lower bounds are only known with worst-case design matrices. Random-design instances are known which are hard for the Lasso, but these instances can generally be solved by Lasso after a simple change-of-basis (i.e. preconditioning). In the talk, we will discuss new upper and lower bounds clarifying the power of preconditioning in sparse linear regression. First, we show that the preconditioned Lasso can solve a large class of sparse linear regression problems nearly optimally: it succeeds whenever the dependency structure of the covariates, in the sense of the Markov property, has low treewidth -- even if the covariance matrix is highly ill-conditioned. Second, we construct (for the first time) random-design instances which are provably hard for an optimally preconditioned Lasso. Based on joint works with Jonathan Kelner, Frederic Koehler, Dhruv Rohatgi.
0:34:16
Tselil Schramm (Stanford University)
https://simons.berkeley.edu/talks/testing-thresholds-high-dimensional-random-geometric-graphs
Joint IFML/Data-Driven Decision Processes Workshop
A graph is said to be a (1-dimensional) expander if the second eigenvalue of its adjacency matrix is bounded away from 1, or almost-equivalently, if it has no sparse vertex cuts. There are several natural ways to generalize the notion of expansion to hypergraphs/simplicial complexes, but one such way is 2-dimensional spectral expansion, in which the expansion of vertex neighborhoods (remarkably) witnesses global expansion. While 1-dimensional expansion is known to be achieved by, e.g., random regular graphs, very few constructions of sparse 2-dimensional expanders are known, and at present all are algebraic. It is an open question whether sparse 2-dimensional expanders are "abundant" and "natural" or "rare." In this talk, we'll give some evidence towards abundance: we show that the set of triangles in a random geometric graph on a high-dimensional sphere yields an expanding simplicial complex of arbitrarily small polynomial degree.
Visit talk page
https://simons.berkeley.edu/talks/testing-thresholds-high-dimensional-random-geometric-graphs
Joint IFML/Data-Driven Decision Processes Workshop
A graph is said to be a (1-dimensional) expander if the second eigenvalue of its adjacency matrix is bounded away from 1, or almost-equivalently, if it has no sparse vertex cuts. There are several natural ways to generalize the notion of expansion to hypergraphs/simplicial complexes, but one such way is 2-dimensional spectral expansion, in which the expansion of vertex neighborhoods (remarkably) witnesses global expansion. While 1-dimensional expansion is known to be achieved by, e.g., random regular graphs, very few constructions of sparse 2-dimensional expanders are known, and at present all are algebraic. It is an open question whether sparse 2-dimensional expanders are "abundant" and "natural" or "rare." In this talk, we'll give some evidence towards abundance: we show that the set of triangles in a random geometric graph on a high-dimensional sphere yields an expanding simplicial complex of arbitrarily small polynomial degree.
0:31:56
Sid Banerjee (Cornell University)
https://simons.berkeley.edu/talks/compensated-coupling-or-why-future-best-guide-present
Joint IFML/Data-Driven Decision Processes Workshop
What makes online decision-making different from other decision-making/optimization problems? While it seems clear that the unique features are the sequential nature of taking actions and uncertainty in future outcomes, most techniques for solving such problems tend to obfuscate these features - so are these the best ways to think about these settings? I will present the compensated coupling: a simple paradigm for reasoning about and designing online decision-making policies, based on a sample-pathwise accounting of their performance compared to some benchmark policy. This approach generalizes many standard results used in studying Markov decision processes and reinforcement learning, but also gives us new policies which are much simpler and more effective than existing heuristics. For a large class of widely-studied control problems including online resource-allocation, dynamic pricing, generalized assignment, online bin packing, and bandits with knapsacks, I will illustrate how these new algorithms achieve constant regret (i.e., additive loss compared to the hindsight optimal which is independent of the horizon and state-space) under a wide range of conditions. Time permitting, I will try and describe how we can use this technique to incorporate side information and historical data in these settings, and achieve constant regret with as little as a single data trace.
Visit talk page
https://simons.berkeley.edu/talks/compensated-coupling-or-why-future-best-guide-present
Joint IFML/Data-Driven Decision Processes Workshop
What makes online decision-making different from other decision-making/optimization problems? While it seems clear that the unique features are the sequential nature of taking actions and uncertainty in future outcomes, most techniques for solving such problems tend to obfuscate these features - so are these the best ways to think about these settings? I will present the compensated coupling: a simple paradigm for reasoning about and designing online decision-making policies, based on a sample-pathwise accounting of their performance compared to some benchmark policy. This approach generalizes many standard results used in studying Markov decision processes and reinforcement learning, but also gives us new policies which are much simpler and more effective than existing heuristics. For a large class of widely-studied control problems including online resource-allocation, dynamic pricing, generalized assignment, online bin packing, and bandits with knapsacks, I will illustrate how these new algorithms achieve constant regret (i.e., additive loss compared to the hindsight optimal which is independent of the horizon and state-space) under a wide range of conditions. Time permitting, I will try and describe how we can use this technique to incorporate side information and historical data in these settings, and achieve constant regret with as little as a single data trace.
0:30:0
Siva Theja Maguluri (Georgia Institute of Technology)
https://simons.berkeley.edu/talks/sample-complexity-policy-based-methods-under-policy-sampling-and-linear-function-approximation
Joint IFML/Data-Driven Decision Processes Workshop
In this work, we study policy-space methods for solving the reinforcement learning (RL) problem. We first consider the setting where the model is known (MDP setting) and show that the Natural Policy Gradient (NPG) algorithm (and other variants) have linear (geometric) convergence without the need of any regularization. In contrast to optimization approaches used in the literature, our approach is based on approximate dynamic programing. We then consider a linear function approximation variant of it and establish its convergence guarantees. Finally, we consider the RL setting where a critic is deployed for policy evaluation when the model is unknown. We consider a critic that is based on TD learning, and uses off-policy sampling with linear function approximation. This leads to the infamous deadly-triad issue. We propose a generic algorithm framework of a single time-scale multi-step TD-learning with generalized importance sampling ratios that enables us to overcome the high variance issue in off-policy learning, and establish its finite-sample guarantees. We show that this leads to an overall sample complexity of O(epsilon^-2).
Visit talk page
https://simons.berkeley.edu/talks/sample-complexity-policy-based-methods-under-policy-sampling-and-linear-function-approximation
Joint IFML/Data-Driven Decision Processes Workshop
In this work, we study policy-space methods for solving the reinforcement learning (RL) problem. We first consider the setting where the model is known (MDP setting) and show that the Natural Policy Gradient (NPG) algorithm (and other variants) have linear (geometric) convergence without the need of any regularization. In contrast to optimization approaches used in the literature, our approach is based on approximate dynamic programing. We then consider a linear function approximation variant of it and establish its convergence guarantees. Finally, we consider the RL setting where a critic is deployed for policy evaluation when the model is unknown. We consider a critic that is based on TD learning, and uses off-policy sampling with linear function approximation. This leads to the infamous deadly-triad issue. We propose a generic algorithm framework of a single time-scale multi-step TD-learning with generalized importance sampling ratios that enables us to overcome the high variance issue in off-policy learning, and establish its finite-sample guarantees. We show that this leads to an overall sample complexity of O(epsilon^-2).
0:28:51
Weina Wang (CMU)
https://simons.berkeley.edu/talks/exploration-load-balancing-under-unknown-service-rates
Joint IFML/Data-Driven Decision Processes Workshop
Today’s computing systems consist of servers that offer highly heterogeneous service rates due to the diverse machine types and the dynamic computing environment. In such systems, to guarantee low latency, load-balancing algorithms need to consider not only the amount of work on servers but also their service rates. However, because of the dynamic nature of the computing environment, the service rates cannot be obtained beforehand and thus have to be learned on the fly. In this talk, we consider a load-balancing system that consists of multiple servers with unknown service rates. Each incoming job needs to be assigned to one of the servers upon arrival. Our goal is to develop learning-integrated job-assigning algorithms that perform comparably to their counterparts under known service rates. Although some facets of this problem resemble the well-studied multi-armed bandit problem, our findings reveal that the exploration component is fundamentally different. In particular, we develop a meta-algorithm that integrates learning into a large class of load-balancing algorithms. The proposed algorithm uses almost no exploration, yet it achieves a constant regret in the accumulated latency of jobs. A similar free exploration phenomenon has been observed in a prior paper, but in a different setting with a single server. This is joint work with Yifei Huang and Zhiyuan Tang, both from Tsinghua University.
Visit talk page
https://simons.berkeley.edu/talks/exploration-load-balancing-under-unknown-service-rates
Joint IFML/Data-Driven Decision Processes Workshop
Today’s computing systems consist of servers that offer highly heterogeneous service rates due to the diverse machine types and the dynamic computing environment. In such systems, to guarantee low latency, load-balancing algorithms need to consider not only the amount of work on servers but also their service rates. However, because of the dynamic nature of the computing environment, the service rates cannot be obtained beforehand and thus have to be learned on the fly. In this talk, we consider a load-balancing system that consists of multiple servers with unknown service rates. Each incoming job needs to be assigned to one of the servers upon arrival. Our goal is to develop learning-integrated job-assigning algorithms that perform comparably to their counterparts under known service rates. Although some facets of this problem resemble the well-studied multi-armed bandit problem, our findings reveal that the exploration component is fundamentally different. In particular, we develop a meta-algorithm that integrates learning into a large class of load-balancing algorithms. The proposed algorithm uses almost no exploration, yet it achieves a constant regret in the accumulated latency of jobs. A similar free exploration phenomenon has been observed in a prior paper, but in a different setting with a single server. This is joint work with Yifei Huang and Zhiyuan Tang, both from Tsinghua University.