X.Y. Han – Optimization Online

A Theoretical Framework for Auxiliary-Loss-Free Load Balancing of Sparse Mixture-of-Experts in Large-Scale AI Models

Published: 2025/12/02, Updated: 2026/04/26

Convex and Nonsmooth Optimization, Optimization in Data Science, Stochastic Programming artificial intelligence, auxiliary loss free load balancing, deepseek, load balancing, online convex optimization, online optimization, primal-dual algorithms, sparse mixture of experts

In large-scale AI training, Sparse Mixture-of-Experts (s-MoE) layers enable scaling by activating only a small subset of experts per token. An operational challenge in this design is load balancing: routing tokens to minimize the number of idle experts, which is important for the efficient utilization of costly GPUs and for the thorough training of architecture … Read more

Survey Descent: A Multipoint Generalization of Gradient Descent for Nonsmooth Optimization

Published: 2021/11/30, Updated: 2022/09/27

Convex Optimization, Nonsmooth Optimization active sets, convex optimization, gradient descent, linear convergence, max functions, minimax optimization, multipoint method, nonsmooth optimization, optimal methods, survey descent

For strongly convex objectives that are smooth, the classical theory of gradient descent ensures linear convergence relative to the number of gradient evaluations. An analogous nonsmooth theory is challenging. Even when the objective is smooth at every iterate, the corresponding local models are unstable and the number of cutting planes invoked by traditional remedies is … Read more

Disk matrices and the proximal mapping for the numerical radius

Published: 2020/04/29

Convex Optimization, Semi-definite Programming field of values, numerical radius, partial smoothness, proximal mapping, semidefinite programming

Optimal matrices for problems involving the matrix numerical radius often have fields of values that are disks, a phenomenon associated with partial smoothness. Such matrices are highly structured: we experiment in particular with the proximal mapping for the radius, which often maps n-by-n random matrix inputs into a particular manifold of disk matrices that has … Read more