A Theoretical Framework for Auxiliary-Loss-Free Load Balancing of Sparse Mixture-of-Experts in Large-Scale AI Models

In large-scale AI training, Sparse Mixture-of-Experts (s-MoE) layers enable scaling by activating only a small subset of experts per token. An operational challenge in this design is load balancing: routing tokens to minimize the number of idle experts, which is important for the efficient utilization of (costly) GPUs. We provide a theoretical framework for analyzing … Read more

Survey Descent: A Multipoint Generalization of Gradient Descent for Nonsmooth Optimization

For strongly convex objectives that are smooth, the classical theory of gradient descent ensures linear convergence relative to the number of gradient evaluations. An analogous nonsmooth theory is challenging. Even when the objective is smooth at every iterate, the corresponding local models are unstable and the number of cutting planes invoked by traditional remedies is … Read more

Disk matrices and the proximal mapping for the numerical radius

Optimal matrices for problems involving the matrix numerical radius often have fields of values that are disks, a phenomenon associated with partial smoothness. Such matrices are highly structured: we experiment in particular with the proximal mapping for the radius, which often maps n-by-n random matrix inputs into a particular manifold of disk matrices that has … Read more