Optimizing interpretable policies for Markov Decision Processes (MDPs) can be computationally intractable for large-scale MDPs, e.g., for monotone policies, the optimal interpretable policy depends on the initial state distribution, precluding standard dynamic programming techniques. Previous work has proposed Monotone Policy Iteration (MPI) to produce a feasible solution for warm starting a Mixed Integer Linear Program (MILP) that finds an optimal monotone policy. However, this prior work did not investigate the convergence and optimality of this algorithm, nor did they investigate the impact of state ordering rules, i.e., the order in which policy improvement steps are performed in MPI. In this study, we analytically characterize the convergence and optimality of the MPI algorithm, introduce a modified MPI (MMPI) algorithm, and show that our algorithm improves upon the MPI algorithm. To test MMPI numerically, we conduct experiments in two settings: 1) perturbations of a machine maintenance problem wherein the optimal policy is guaranteed to be monotone or near-monotone and 2) randomly generated MDPs. We propose and investigate 19 state ordering rules for MMPI based on each state's value function, initial probability, and stationary distribution. Computational results reveal a trade-off between computational time and optimality gap; in the structured machine maintenance setting, the fastest state ordering rules still yield high quality policies while the trade-off is more pronounced in the random MDP setting. Across both settings, the random state ordering rule performs the best in terms of optimality gap (less than approximately 5% on average) at the expense of computational time.
Citation
S.J. Lee, X. Gong, G.-G. P. Garcia (2024). Modified Monotone Policy Iteration for Interpretable Policies in Markov Decision Processes and the Impact of State Ordering Rules.