Primal-Dual π Learning: Sample Complexity and Sublinear Run Time for Ergodic Markov Decision Problems
Consider the problem of approximating the optimal policy of a Markov decision process (MDP) by sampling state transitions. In contrast to existing reinforcement learning methods that are based on successive approximations to the nonlinear Bellman equation, we propose a Primal-Dual π Learning method in light of the linear duality between the value and policy. The … Read more