Optimizing interpretable policies for Markov Decision Processes (MDPs) can be computationally intractable for large-scale MDPs, e.g., for monotone policies, the optimal interpretable policy depends on the initial state distribution, precluding standard dynamic programming techniques. Previous work has proposed Monotone Policy Iteration (MPI) to produce a feasible solution for warm starting a Mixed Integer Linear Program (MILP) that finds an optimal monotone policy. However, this prior work did not investigate the convergence and optimality of this algorithm, nor did they investigate the impact of state ordering rules, i.e., the order in which policy improvement steps are performed in MPI. In this study, we analytically characterize the convergence and optimality of the MPI algorithm. To test MPI numerically, we formulate a machine maintenance problem wherein the optimal policy is guaranteed to be monotone. We also propose and investigate 19 state ordering rules for MPI based on each state's value function, initial probability, and stationary distribution. Our analytical study reveals that both the convergence of the MPI algorithm and its optimality upon convergence depend on the state ordering rule. Computational results on perturbations of the machine maintenance problem reveal that the best-performing state ordering rules are those that prioritize states with near-optimal values. Specifically, these state ordering rules can yield policies with less than 1% optimality gap and enables near-optimal interpretable policies to be found in under 300 seconds for an MDP with 40 actions and 400 states. In contrast, the MILP formulation could not find an optimal solution within 1800 seconds.
Citation
X. Gong, S.J. Lee, G.-G. P. Garcia (2023). Analysis of Monotone Policy Iteration for Interpretable Policies in Markov Decision Processes: Impact of State Ordering Rules.