Abstract
We discuss the solution of multistage decision problems using methods that are based on the idea of policy iteration (PI for short), i.e., start from some base policy and generate an improved policy. Rollout
is the simplest method of this type, where just one improved policy is generated. We can view PI as repeated application of rollout, where the rollout policy at each iteration serves as the base policy for the next iteration. In contrast with PI, rollout can be applied on-line and is suitable for on-line replanning. Moreover, rollout can use as base policy one of the policies produced by PI, thereby improving on that policy. This is the type of scheme underlying the prominently successful AlphaZero chess program.
In this paper we focus on rollout and PI-like methods for multiagent problems, where the control consists of multiple components each selected (conceptually) by a separate agent. We discuss an approach, whereby at every stage, the agents sequentially (one-at-a-time) execute a local rollout algorithm that uses a base policy, together with some coordinating information from the other agents. The amount of total computation required at every stage grows linearly with the number of agents. By contrast, in the standard rollout algorithm, the amount of total computation grows exponentially with the number of agents. Despite the dramatic reduction in required computation, we show that our multiagent rollout algorithm has the fundamental cost improvement property of standard rollout: it guarantees an improved performance relative to the base policy.
We first develop our agent-by-agent policy improvement approach for finite horizon problems, and then we extend it to exact and approximate PI for discounted and other infinite horizon problems. We prove that the cost improvement property steers the algorithm towards convergence to an agent-by-agent optimal policy, thus establishing a connection with the theory of teams. We also discuss autonomous multiagent rollout schemes that allow the agents to make decisions autonomously through the use of precomputed signaling information, which is sufficient to maintain the cost improvement property, without any on-line coordination of control selection between the agents.