Playlist: 25 videos

Multi-Agent Reinforcement Learning and Bandit Learning

Remote video URL
0:53:35
Ioannis Panageas (UC Irvine)
https://simons.berkeley.edu/talks/tbd-399
Multi-Agent Reinforcement Learning and Bandit Learning

Potential games are arguably one of the most important and widely studied classes of normal form games. They define the archetypal setting of multi-agent coordination as all agent utilities are perfectly aligned with each other via a common potential function. Can this intuitive framework be transplanted in the setting of Markov Games? What are the similarities and differences between multi-agent coordination with and without state dependence? We present a novel definition of Markov Potential Games (MPG) that generalizes prior attempts at capturing complex stateful multi-agent coordination. Counter-intuitively, insights from normal-form potential games do not carry over as MPGs can consist of settings where state-games can be zero-sum games. In the opposite direction, Markov games where every state-game is a potential game are not necessarily MPGs. Nevertheless, MPGs showcase standard desirable properties such as the existence of deterministic Nash policies. In our main technical result, we prove fast convergence of independent policy gradient (and its stochastic variant) to Nash policies by adapting recent gradient dominance property arguments developed for single agent MDPs to multi-agent learning settings.
Visit talk page
Remote video URL
0:59:5
Drew Fudenberg (MIT)
https://simons.berkeley.edu/talks/learning-and-equilibrium-refinements
Multi-Agent Reinforcement Learning and Bandit Learning

The learning in games literature interprets equilibrium strategy profiles as the long-run average behavior of agents who are selected at random to play the game. In normal-form games we expect that as the agents accumulate evidence about play of the game they will develop accurate beliefs, so that the stationary points of the process correspond to the Nash equilibria. There is no reason to expect learning by myopic agents to lead to Nash equilibrium in general games, as agents may not experiment enough to learn the consequences of deviating from the equilibrium path. The focus here is on settings where the agents are patient, so they do have an incentive to experiment, and stationary points must ne Nash equilibria.However, eExtensive-form games typically have many Nash equilibria, and not all of them seem equally plausible. This talk discusses the restrictions that learning models impose on Nash equilibria and how these differ from the restrictions of classical equilibrium refinements. This talk discusses the restrictions that learning models impose on Nash equilibria and how these differ from the restrictions of classical equilibrium refinements.
Visit talk page
Remote video URL
0:55:51
Amy Greenwald (Brown University)
https://simons.berkeley.edu/talks/no-regret-learning-extensive-form-games
Multi-Agent Reinforcement Learning and Bandit Learning

The convergence of \Phi-regret-minimization algorithms in self-play to \Phi-equilibria is well understood in normal-form games (NFGs), where \Phi is the set of deviation strategies. This talk investigates the analogous relationship in extensive-form games (EFGs). While the primary choices for \Phi in NFGs are internal and external regret, the space of possible deviations in EFGs is much richer. We restrict attention to a class of deviations known as behavioral deviations, inspired by von Stengel and Forges' deviation player, which they introduced when defining extensive-form correlated equilibria (EFCE). We then propose extensive-form regret minimization (EFR), a regret-minimizing learning algorithm whose complexity scales with the complexity of \Phi, and which converges in self-play to EFCE when \Phi is the set of behavioral deviations. Von Stengel and Forges, Zinkevich et al., and Celli et al. all weaken the deviation player in various ways, and then derive corresponding efficient equilibrium-finding algorithms. These weakenings (and others) can be seamlessly encoded into EFR at runtime, by simply defining an appropriate \Phi. The result is a class of efficient \Phi-equilibrium finding algorithms for EFGs.
Visit talk page
Remote video URL
0:50:56
Na Li (Harvard University)
https://simons.berkeley.edu/talks/tbd-400
Multi-Agent Reinforcement Learning and Bandit Learning

Multiagent reinforcement learning has received a growing interest with various problem settings and applications. We will first present our recent work in learning decentralized policies in networked multiagent systems under a cooperative setting. Specifically, we propose a Scalable Actor Critic (SAC) framework that exploits the network structure and finds a local, decentralized policy that is an O(ρ^κ)-approximation of a first-order stationary point of the global objective for some ρ∈(0,1). Motivated by the question of characterizing the performance of the stationary points, we look into the case where states could be shared among agents but agents still need to take actions following decentralized policies. We show that even when agents have identical interests, the first-order stationary points are only corresponding to Nash equilibria. This observation naturally leads to the use stochastic game framework to characterize the performance of policy gradients for decentralized policies for multiagent MDP systems.

Joint work with Guannan Qu, Adam Wierman, Runyu Zhang, Zhaolin Ren
Visit talk page
Remote video URL
0:55:5
Dorsa Sadigh (Stanford University)
https://simons.berkeley.edu/talks/role-conventions-adaptive-human-ai-interaction
Multi-Agent Reinforcement Learning and Bandit Learning

Today I will be discussing some of the challenges and lessons learned in partner modeling in decentralized multi-agent coordination. We will start with discussing the role of representation learning in learning effective conventions and latent partner strategies and how one can leverage the learned conventions within a reinforcement learning loop for achieving coordination, collaboration, and influencing. We will then extend the notion of influencing beyond optimizing for long-horizon objectives, and analyze how strategies that stabilize latent partner representations can be effective in reducing non-stationarity and achieving a more desirable learning outcome. Finally, we will formalize the problem of decentralized multi-agent coordination as a collaborative multi-armed bandit with partial observability, and demonstrate that partner modeling strategies are effective approaches for achieving logarithmic regret.
Visit talk page