Playlist: 25 videos

### Multi-Agent Reinforcement Learning and Bandit Learning

0:59:20

Sergiu Hart (Hebrew University of Jerusalem)

https://simons.berkeley.edu/talks/calibeating-beating-forecasters-their-own-game

Multi-Agent Reinforcement Learning and Bandit Learning

Visit talk page
https://simons.berkeley.edu/talks/calibeating-beating-forecasters-their-own-game

Multi-Agent Reinforcement Learning and Bandit Learning

0:55:56

Chi Jin (Princeton University)

https://simons.berkeley.edu/talks/v-learning-simple-efficient-decentralized-algorithm-multiagent-rl

Multi-Agent Reinforcement Learning and Bandit Learning

A major challenge of multiagent reinforcement learning (MARL) is the curse of multiagents, where the size of the joint action space scales exponentially with the number of agents. This remains to be a bottleneck for designing efficient MARL algorithms even in a basic scenario with finitely many states and actions. This paper resolves this challenge for the model of episodic Markov games. We design a new class of fully decentralized algorithms---V-learning, which provably learns Nash equilibria (in the two-player zero-sum setting), correlated equilibria and coarse correlated equilibria (in the multiplayer general-sum setting) in a number of samples that only scales with max_i Ai, where Ai is the number of actions for the ith player. This is in sharp contrast to the size of the joint action space which is \prod_i Ai. V-learning (in its basic form) is a new class of single-agent RL algorithms that convert any adversarial bandit algorithm with suitable regret guarantees into a RL algorithm. Similar to the classical Q-learning algorithm, it performs incremental updates to the value functions. Different from Q-learning, it only maintains the estimates of V-values instead of Q-values. This key difference allows V-learning to achieve the claimed guarantees in the MARL setting by simply letting all agents run V-learning independently.

Visit talk page
https://simons.berkeley.edu/talks/v-learning-simple-efficient-decentralized-algorithm-multiagent-rl

Multi-Agent Reinforcement Learning and Bandit Learning

A major challenge of multiagent reinforcement learning (MARL) is the curse of multiagents, where the size of the joint action space scales exponentially with the number of agents. This remains to be a bottleneck for designing efficient MARL algorithms even in a basic scenario with finitely many states and actions. This paper resolves this challenge for the model of episodic Markov games. We design a new class of fully decentralized algorithms---V-learning, which provably learns Nash equilibria (in the two-player zero-sum setting), correlated equilibria and coarse correlated equilibria (in the multiplayer general-sum setting) in a number of samples that only scales with max_i Ai, where Ai is the number of actions for the ith player. This is in sharp contrast to the size of the joint action space which is \prod_i Ai. V-learning (in its basic form) is a new class of single-agent RL algorithms that convert any adversarial bandit algorithm with suitable regret guarantees into a RL algorithm. Similar to the classical Q-learning algorithm, it performs incremental updates to the value functions. Different from Q-learning, it only maintains the estimates of V-values instead of Q-values. This key difference allows V-learning to achieve the claimed guarantees in the MARL setting by simply letting all agents run V-learning independently.

0:53:1

Sham Kakade (Harvard and MSR)

https://simons.berkeley.edu/talks/what-statistical-complexity-reinforcement-learning

Multi-Agent Reinforcement Learning and Bandit Learning

A fundamental question in the theory of reinforcement learning is what (representational or structural) conditions govern our ability to generalize and avoid the curse of dimensionality. With regards to supervised learning, these questions are well understood theoretically: practically, we have overwhelming evidence on the value of representational learning (say through modern deep networks) as a means for sample efficient learning, and, theoretically, there are well-known complexity measures (e.g. the VC dimension and Rademacher complexity) that govern the statistical complexity of learning. Providing an analogous theory for reinforcement learning is far more challenging, where even characterizing any structural conditions which support sample efficient generalization is far less well understood. This talk will highlight recent advances towards characterizing when generalization is possible in reinforcement learning (both in online and offline settings), focusing on both necessary and sufficient conditions. In particular, we will introduce a new complexity measure, the Decision-Estimation Coefficient, that is proven to be necessary (and, essentially, sufficient) for sample-efficient interactive learning.

Visit talk page
https://simons.berkeley.edu/talks/what-statistical-complexity-reinforcement-learning

Multi-Agent Reinforcement Learning and Bandit Learning

A fundamental question in the theory of reinforcement learning is what (representational or structural) conditions govern our ability to generalize and avoid the curse of dimensionality. With regards to supervised learning, these questions are well understood theoretically: practically, we have overwhelming evidence on the value of representational learning (say through modern deep networks) as a means for sample efficient learning, and, theoretically, there are well-known complexity measures (e.g. the VC dimension and Rademacher complexity) that govern the statistical complexity of learning. Providing an analogous theory for reinforcement learning is far more challenging, where even characterizing any structural conditions which support sample efficient generalization is far less well understood. This talk will highlight recent advances towards characterizing when generalization is possible in reinforcement learning (both in online and offline settings), focusing on both necessary and sufficient conditions. In particular, we will introduce a new complexity measure, the Decision-Estimation Coefficient, that is proven to be necessary (and, essentially, sufficient) for sample-efficient interactive learning.

0:35:35

Haipeng Luo (University of Southern California)

https://simons.berkeley.edu/talks/no-regret-learning-time-varying-zero-sum-games

Multi-Agent Reinforcement Learning and Bandit Learning

Learning from repeated play in a fixed two-player zero-sum game is a classic problem in game theory and online learning. This talk focuses on a natural yet underexplored variant of this problem where the game payoff matrix changes over time, possibly in an adversarial manner. In the first part of the talk, I will discuss what the appropriate performance measures are for this problem (and argue that some measures from prior works might be unreasonable). In the second part of the talk, I will present a new parameter-free algorithm that simultaneously enjoys favorable guarantees under three different performance measures. These guarantees are adaptive to different non-stationarity measures of the payoff matrices and, importantly, recover the best known results when the payoff matrix is fixed. I will conclude the talk by discussing some future directions on the extensions to multi-player bandits and reinforcement learning.

Visit talk page
https://simons.berkeley.edu/talks/no-regret-learning-time-varying-zero-sum-games

Multi-Agent Reinforcement Learning and Bandit Learning

Learning from repeated play in a fixed two-player zero-sum game is a classic problem in game theory and online learning. This talk focuses on a natural yet underexplored variant of this problem where the game payoff matrix changes over time, possibly in an adversarial manner. In the first part of the talk, I will discuss what the appropriate performance measures are for this problem (and argue that some measures from prior works might be unreasonable). In the second part of the talk, I will present a new parameter-free algorithm that simultaneously enjoys favorable guarantees under three different performance measures. These guarantees are adaptive to different non-stationarity measures of the payoff matrices and, importantly, recover the best known results when the payoff matrix is fixed. I will conclude the talk by discussing some future directions on the extensions to multi-player bandits and reinforcement learning.

0:28:0

Eric Mazumdar (Caltech)

https://simons.berkeley.edu/talks/policy-gradients-general-sum-dynamic-games-when-do-they-even-converge

Multi-Agent Reinforcement Learning and Bandit Learning

In this talk I will present work showing that agents using simple policy gradient algorithms in arguably the simplest class of continuous action- and state-space multi-agent control problem: general-sum linear quadratic games, have no guarantees of asymptotic convergence, and that proximal point and extra-gradients will not solve these issues. I will then focus in on zero-sum LQ games in which stronger convergence guarantees are possible when agents use independent policy gradients with a finite timescale separation.

Visit talk page
https://simons.berkeley.edu/talks/policy-gradients-general-sum-dynamic-games-when-do-they-even-converge

Multi-Agent Reinforcement Learning and Bandit Learning

In this talk I will present work showing that agents using simple policy gradient algorithms in arguably the simplest class of continuous action- and state-space multi-agent control problem: general-sum linear quadratic games, have no guarantees of asymptotic convergence, and that proximal point and extra-gradients will not solve these issues. I will then focus in on zero-sum LQ games in which stronger convergence guarantees are possible when agents use independent policy gradients with a finite timescale separation.

0:53:41

Vidya Muthukumar (Georgia Institute of Technology)

https://simons.berkeley.edu/talks/complexity-infinite-horizon-general-sum-stochastic-games-turn-based-and-simultaneous-play

Multi-Agent Reinforcement Learning and Bandit Learning

A common variant of infinite-horizon stochastic games are turn-based, in the sense that only one agent can take an action at a state at a given time. We initiate an algorithmic study of such turn-based stochastic games (TBSG) with both discounted and averaged general-sum rewards, and ask whether it is possible to compute a Nash equilibrium (NE) of a TBSG in time that is polynomial in the number of states (S) and the number of actions per state (A). On the one hand, we show that non-stationary NE are computable in poly(S,A) time and stationary NE are computable in poly(A) time. On the other hand, we show that computing a multiplayer approximate stationary NE is PPAD-complete in S. Our results imply PPAD-hardness for computing stationary coarse correlated equilibria in general-sum simultaneous stochastic games. This constitutes a non-standard separation in complexity between zero-sum and general-sum settings for infinite-horizon stochastic games. Beyond these results, we provide a number of additional characterizations of related stochastic games. For example, we show that computing a stationary NE of an infinite-horizon simultaneous stochastic game is in PPAD despite the inherent non-convexity in utilities. We also identify some special cases of general-sum TBSG for which pure stationary NE always exist and are computable in poly(S,A) time. Underlying all of our results are new structural properties for stochastic games that may be of independent interest.

Visit talk page
https://simons.berkeley.edu/talks/complexity-infinite-horizon-general-sum-stochastic-games-turn-based-and-simultaneous-play

Multi-Agent Reinforcement Learning and Bandit Learning

A common variant of infinite-horizon stochastic games are turn-based, in the sense that only one agent can take an action at a state at a given time. We initiate an algorithmic study of such turn-based stochastic games (TBSG) with both discounted and averaged general-sum rewards, and ask whether it is possible to compute a Nash equilibrium (NE) of a TBSG in time that is polynomial in the number of states (S) and the number of actions per state (A). On the one hand, we show that non-stationary NE are computable in poly(S,A) time and stationary NE are computable in poly(A) time. On the other hand, we show that computing a multiplayer approximate stationary NE is PPAD-complete in S. Our results imply PPAD-hardness for computing stationary coarse correlated equilibria in general-sum simultaneous stochastic games. This constitutes a non-standard separation in complexity between zero-sum and general-sum settings for infinite-horizon stochastic games. Beyond these results, we provide a number of additional characterizations of related stochastic games. For example, we show that computing a stationary NE of an infinite-horizon simultaneous stochastic game is in PPAD despite the inherent non-convexity in utilities. We also identify some special cases of general-sum TBSG for which pure stationary NE always exist and are computable in poly(S,A) time. Underlying all of our results are new structural properties for stochastic games that may be of independent interest.

0:54:35

Noah Golowich (MIT)

https://simons.berkeley.edu/talks/complexity-markov-equilibrium-stochastic-games

Multi-Agent Reinforcement Learning and Bandit Learning

We show that computing approximate stationary Markov coarse correlated equilibria (CCE) in general-sum stochastic games is computationally intractable, even when there are two players, the game is turn-based, the discount factor is an absolute constant, and the approximation is an absolute constant. Our intractability results stand in sharp contrast to normal-form games where exact CCEs are efficiently computable. A fortiori, our results imply that there are no efficient algorithms for learning stationary Markov CCE policies in multi-agent reinforcement learning (MARL), even when the interaction is two-player and turn-based, and both the discount factor and the desired approximation of the learned policies is an absolute constant. In turn, these results stand in sharp contrast to single-agent reinforcement learning (RL) where near-optimal stationary Markov policies can be efficiently learned. Complementing our intractability results for stationary Markov CCEs, we provide a decentralized algorithm (assuming shared randomness among players) for learning a nonstationary Markov CCE policy with polynomial time and sample complexity in all problem parameters. Previous work for learning Markov CCE policies all required exponential time and sample complexity in the number of players. Joint work with Costis Daskalakis and Kaiqing Zhang.

Visit talk page
https://simons.berkeley.edu/talks/complexity-markov-equilibrium-stochastic-games

Multi-Agent Reinforcement Learning and Bandit Learning

We show that computing approximate stationary Markov coarse correlated equilibria (CCE) in general-sum stochastic games is computationally intractable, even when there are two players, the game is turn-based, the discount factor is an absolute constant, and the approximation is an absolute constant. Our intractability results stand in sharp contrast to normal-form games where exact CCEs are efficiently computable. A fortiori, our results imply that there are no efficient algorithms for learning stationary Markov CCE policies in multi-agent reinforcement learning (MARL), even when the interaction is two-player and turn-based, and both the discount factor and the desired approximation of the learned policies is an absolute constant. In turn, these results stand in sharp contrast to single-agent reinforcement learning (RL) where near-optimal stationary Markov policies can be efficiently learned. Complementing our intractability results for stationary Markov CCEs, we provide a decentralized algorithm (assuming shared randomness among players) for learning a nonstationary Markov CCE policy with polynomial time and sample complexity in all problem parameters. Previous work for learning Markov CCE policies all required exponential time and sample complexity in the number of players. Joint work with Costis Daskalakis and Kaiqing Zhang.

0:42:15

Elad Hazan (Princeton University and Google Research)

https://simons.berkeley.edu/talks/regret-minimization-approach-mutli-agent-control-and-rl

Multi-Agent Reinforcement Learning and Bandit Learning

We'll start by describing a new paradigm in reinforcement learning called nonstochastic control, how it relates to existing frameworks, and survey efficient gradient-based methods for regret minimization in this model. We then proceed to describe recent work on multi-agent learning based on regret minimization methods that reach an equilibrium. We'll conclude with remaining challenges and potential directions for further research.

Visit talk page
https://simons.berkeley.edu/talks/regret-minimization-approach-mutli-agent-control-and-rl

Multi-Agent Reinforcement Learning and Bandit Learning

We'll start by describing a new paradigm in reinforcement learning called nonstochastic control, how it relates to existing frameworks, and survey efficient gradient-based methods for regret minimization in this model. We then proceed to describe recent work on multi-agent learning based on regret minimization methods that reach an equilibrium. We'll conclude with remaining challenges and potential directions for further research.

0:52:30

Tamer Başar (University of Illinois Urbana-Champaign)

https://simons.berkeley.edu/talks/multi-agent-reinforcement-learning-high-population-regime

Multi-Agent Reinforcement Learning and Bandit Learning

I will discuss some recent results on learning approximate Nash equilibrium policies in nonzero-sum stochastic dynamic games using the framework of mean-field games (MFGs). Following a general introduction, I will focus, for concrete results, on the structured setting of discrete-time infinite-horizon linear-quadratic-Gaussian dynamic games, where the players (agents) are partitioned into finitely-many populations connected by a network of known structure. Each population has a high number of agents, which are indistinguishable, but there is no indistinguishability across different populations. It is possible to characterize the Nash equilibrium (NE) of the game when the number of agents in each population goes to infinity, the so-called mean-field equilibrium (MFE), with local state information for each agent (thus making scalability not an issue), which can then be shown to lead to an approximate NE when the population sizes are finite, with a precise quantification of the approximation as a function of population sizes. The main focus of the talk, however, will be the model-free versions of such games, for which I will introduce a learning algorithm, based on zero-order stochastic optimization, for computation of the MFE, along with guaranteed convergence. The algorithm exploits the affine structure of both the equilibrium controller (for each population) and the equilibrium MF trajectory by decomposing the learning task into learning first the linear terms, and then the affine terms. One can also obtain a finite-sample bound quantifying the estimation error as a function of the number of samples. The talk will conclude with discussion of some extensions of the setting and future research directions.

Visit talk page
https://simons.berkeley.edu/talks/multi-agent-reinforcement-learning-high-population-regime

Multi-Agent Reinforcement Learning and Bandit Learning

I will discuss some recent results on learning approximate Nash equilibrium policies in nonzero-sum stochastic dynamic games using the framework of mean-field games (MFGs). Following a general introduction, I will focus, for concrete results, on the structured setting of discrete-time infinite-horizon linear-quadratic-Gaussian dynamic games, where the players (agents) are partitioned into finitely-many populations connected by a network of known structure. Each population has a high number of agents, which are indistinguishable, but there is no indistinguishability across different populations. It is possible to characterize the Nash equilibrium (NE) of the game when the number of agents in each population goes to infinity, the so-called mean-field equilibrium (MFE), with local state information for each agent (thus making scalability not an issue), which can then be shown to lead to an approximate NE when the population sizes are finite, with a precise quantification of the approximation as a function of population sizes. The main focus of the talk, however, will be the model-free versions of such games, for which I will introduce a learning algorithm, based on zero-order stochastic optimization, for computation of the MFE, along with guaranteed convergence. The algorithm exploits the affine structure of both the equilibrium controller (for each population) and the equilibrium MF trajectory by decomposing the learning task into learning first the linear terms, and then the affine terms. One can also obtain a finite-sample bound quantifying the estimation error as a function of the number of samples. The talk will conclude with discussion of some extensions of the setting and future research directions.

0:29:51

Ann Nowe (Vrije Universiteit Brussel)

https://simons.berkeley.edu/talks/learning-automata-building-blocks-marl

Multi-Agent Reinforcement Learning and Bandit Learning

In this talk I will show that Learning Automata (LA), and more precisely Reward in Action update schemes are interesting building blocks for Multi-agent RL, both in bandit settings as well as stateful RL. Based on the theorem of Narendra and Wheeler we have convergence guarantees in n-person non-zero sum games. However, LA have also shown to be robust in more relaxed settings, such as queueing systems, where updates happen asynchronously and the feedback sent to the agents is delayed.

Visit talk page
https://simons.berkeley.edu/talks/learning-automata-building-blocks-marl

Multi-Agent Reinforcement Learning and Bandit Learning

In this talk I will show that Learning Automata (LA), and more precisely Reward in Action update schemes are interesting building blocks for Multi-agent RL, both in bandit settings as well as stateful RL. Based on the theorem of Narendra and Wheeler we have convergence guarantees in n-person non-zero sum games. However, LA have also shown to be robust in more relaxed settings, such as queueing systems, where updates happen asynchronously and the feedback sent to the agents is delayed.