Abstract

We prove that the classic policy-iteration method (Howard 1960), including the Simplex method (Dantzig 1947) with the most-negative-reduced-cost pivoting rule, is a strongly polynomial-time algorithm for solving the Markov decision problem (MDP) with any fixed discount factor. Furthermore, the computational complexity of the policy-iteration method (including the Simplex method) is superior to that of the only known strongly polynomial-time interior-point algorithm for solving this problem. The result is surprising since the Simplex method with the same pivoting rule was shown to be exponential for solving a general linear programming (LP) problem, the Simplex (or simple policy-iteration) method with the smallest-index pivoting rule was shown to be exponential for solving an MDP regardless of discount rates, and the policy-iteration method was recently shown to be exponential for solving a undiscounted MDP. We also extend the result to solving MDPs with sub-stochastic and transient state transition probability matrices.

Video Recording