Multi-Agent Reinforcement Learning and Bandit Learning

Program

Learning and Games

Location

Calvin Lab Auditorium

Date

Monday, May 2 – Thursday, May 5, 2022

Back to calendar

Apr. 29, 2022 Playlist: 25 videos

Multi-Agent Reinforcement Learning and Bandit Learning

May 4, 2022 0:32:6

On Rewards in Multi-Agent Systems

Doina Precup (McGill University / DeepMind Montreal)
https://simons.berkeley.edu/talks/tbd-393
Multi-Agent Reinforcement Learning and Bandit Learning

Visit talk page

May 4, 2022 0:38:25

Independent Learning in Stochastic Games

Kaiqing Zhang (MIT)
https://simons.berkeley.edu/talks/tbd-398
Multi-Agent Reinforcement Learning and Bandit Learning

Visit talk page

May 4, 2022 0:38:5

Sequential Information Design: Markov Persuasion Process and Its Efficient Reinforcement Learning

Zhuoran Yang (UC Berkeley)
https://simons.berkeley.edu/talks/sequential-information-design-markov-persuasion-process-and-its-efficient-reinforcement
Multi-Agent Reinforcement Learning and Bandit Learning

In today's economy, it becomes important for Internet platforms to consider the sequential information design problem to align its long-term interest with the incentives of the gig service providers. In this talk, I will introduce a novel model of sequential information design, namely the Markov persuasion processes (MPPs), where a sender, with informational advantage, seeks to persuade a stream of myopic receivers to take actions that maximize the sender's cumulative utilities in a finite horizon Markovian environment with varying prior and utility functions. Planning in MPPs thus faces the unique challenge i finding a signaling policy that is simultaneously persuasive to the myopic receivers and inducing the optimal long-term cumulative utilities of the sender. Under the online setting where the model is unknown, I will introduce a provably efficient reinforcement learning algorithm, the Optimism-Pessimism Principle for Persuasion Process (OP4), which features a novel combination of both optimism and pessimism principles and enjoys a sublinear regret upper bound.

Visit talk page

May 4, 2022 0:57:50

Multi-Agent Reinforcement Learning Towards Zero-Shot Communication

Kalesha Bullard (DeepMind)
https://simons.berkeley.edu/talks/multi-agent-reinforcement-learning-towards-zero-shot-communication
Multi-Agent Reinforcement Learning and Bandit Learning

Visit talk page

May 4, 2022 0:51:35

Kernelized Multiplicative Weights for 0/1-Polyhedral Games: Bridging the Gap Between Learning in Extensive-Form and Normal-Form Games

Gabriele Farina (Carnegie Mellon University)
https://simons.berkeley.edu/talks/geometry-correlated-strategies-multiplayer-extensive-form-games-positive-and-parametric
Multi-Agent Reinforcement Learning and Bandit Learning

While extensive-form games (EFGs) can be converted into normal-form games (NFGs), doing so comes at the cost of an exponential blowup of the strategy space. So, progress on NFGs and EFGs has historically followed separate tracks, with the EFG community often having to catch up with advances (e.g., last-iterate convergence and predictive regret bounds) from the larger NFG community. In this talk I will show that the Optimistic Multiplicative Weights Update (OMWU) algorithm -- the premier learning algorithm for NFGs -- can be simulated on the normal-form equivalent of an EFG in linear time per iteration in the game tree size using a kernel trick. The resulting algorithm, Kernelized OMWU (KOMWU), applies more broadly to all convex games whose strategy space is a polytope with 0/1 integral vertices, as long as the kernel can be evaluated efficiently. In the particular case of EFGs, KOMWU closes several standing gaps between NFG and EFG learning, by enabling direct, black-box transfer to EFGs of desirable properties of learning dynamics that were so far known to be achievable only in NFGs. Specifically, KOMWU gives the first algorithm that guarantees at the same time last-iterate convergence, lower dependence on the size of the game tree than all prior algorithms, and \tilde{O}(1) regret when followed by all players.

Based on joint work with Christian Kroer, Chung-Wei Lee, and Haipeng Luo. ArXiv Link: https://arxiv.org/abs/2202.00237

Visit talk page

May 5, 2022 0:40:25

General Game-Theoretic Multiagent Reinforcement Learning

Marc Lanctot (DeepMind)
https://simons.berkeley.edu/talks/general-game-theoretic-multiagent-reinforcement-learning
Multi-Agent Reinforcement Learning and Bandit Learning

Regret minimizing agents in self-play have been used to learn approximate minimax-optimal strategies with much success, scaling to large hold‚Äôem poker games and to super-human level performance in very large multiplayer games. This prescriptive approach has guided the development of algorithms for two-player zero-sum games, and similarly for fully-cooperative games. What about the fully general case‚Äì what could a prescriptive agenda look like there? Is there an agent-centric criterion that can be optimized without relying on outside authorities or third parties? In this talk, I will quickly survey the recent approaches to game-theoretic multiagent reinforcement learning in general games, and then focus on ideas that could attempt to answer these open questions in multiagent reinforcement learning.

Visit talk page

May 5, 2022 0:34:11

When Is Offline Two-Player Zero-Sum Markov Game Solvable?

Simon Du (University of Washington)
https://simons.berkeley.edu/talks/when-offline-two-player-zero-sum-markov-game-solvable
Multi-Agent Reinforcement Learning and Bandit Learning

We study what dataset assumption permits solving offline two-player zero-sum Markov game. In stark contrast to the offline single-agent Markov decision process, we show that the single strategy concentration assumption is insufficient for learning the Nash equilibrium (NE) strategy in offline two-player zero-sum Markov games. On the other hand, we propose a new assumption named unilateral concentration and design a pessimism-type algorithm that is provably efficient under this assumption. In addition, we show that the unilateral concentration assumption is necessary for learning an NE strategy. Furthermore, our algorithm can achieve minimax sample complexity without any modification for two widely studied settings: dataset with uniform concentration assumption and turn-based Markov game. Our work serves as an important initial step towards understanding offline multi-agent reinforcement learning.

Visit talk page

May 5, 2022 0:44:36

Variants and Invariants in No-Regret Algorithms

Sylvain Sorin (Sorbonne Universite)
https://simons.berkeley.edu/talks/variants-and-invariants-no-regret-algorithms
Multi-Agent Reinforcement Learning and Bandit Learning

We describe some properties of no-regret algorithms in discrete and continuous time

Visit talk page

May 5, 2022 0:36:25

What Does Machine Learning Offer Game Theory (And Vice Versa)?

Alex Peysakhovich (Meta AI)
https://simons.berkeley.edu/talks/what-does-machine-learning-offer-game-theory-and-vice-versa
Multi-Agent Reinforcement Learning and Bandit Learning

One of the major reasons that deep learning for supervised learning has been so successful is that deep networks allow us to tame high dimensional input spaces and transform them into simpler, lower dimensional objects that we can work with. This talk argues that a similar approach using ML techniques to simplify scenarios to their "low dimensional game theoretic representation" can yield progress in building agents that cooperate and coordinate, as well as allow us to apply mechanism design ideas at large scale.

Visit talk page

May 5, 2022 0:40:34

Multi-Player Bandits With No Collisions

Mark Sellke (Stanford)
https://simons.berkeley.edu/talks/multi-player-bandits-no-collisions
Multi-Agent Reinforcement Learning and Bandit Learning

In the stochastic multi-player bandit problem, m(greater than)1 players cooperate to maximize their total reward on a set of K(greater than)m bandit arms. However the players cannot communicate online and are penalized (e.g. receive no reward) if they collide by pulling the same arm at the same time. This problem was introduced in the context of wireless radio communication and serves as a natural model for decentralized online decision-making. There have been many results for different versions of this model. Most rely on a small number of collisions in order to implicitly communicate. However it was later realized that nearly-optimal T^{1/2} regret is possible with no collisions (and hence no communication) at all. I will discuss this construction, as well as our recent characterization of the Pareto optimal trade-offs for gap dependent regret without communication. Based on collaborations with Sebastien Bubeck, Thomas Budzinski, and Allen Liu.

Visit talk page