LAMP implements primal mirror prox updates by tracking an averaged dual sequence, reducing storage complexity from \({O}(nm)\) to \({O}(n+m)\) while preserving dense, GPU-friendly reductions.
Consequently, LAMP preserves the last-iterate \(\widetilde{{O}}( nm\varepsilon^{-1})\) arithmetic complexity of conservatively parameterized primal-dual mirror prox. We further analyze LAMP as a direct optimal transport solver in a more performant parameter regime, providing a last-iterate sub-optimality certificate dependent on infeasibility and an explicit \({O}(1/t)\) term. Moreover, we give a computable sufficient condition for best-iterate convergence to a saddle-point.
Numerical experiments with an optimized CUDA implementation show that LAMP outperforms first-order baselines in several high-accuracy (entropic) optimal transport problems. LAMP is further shown to scale up to problems with \(n=m=2^{18}\) marginal supports, which were previously beyond the reach of primal-dual first-order methods.