A Taxonomy of Multi-Objective Alignment Techniques for Large Language Models

Aligning large language models (LLMs) with human preferences has evolved from single-objective reward maximization to sophisticated multi-objective optimization. Real-world deployment requires balancing competing objectiveshelpfulness, harmlessness, honesty, instruction-following, and task-specic capabilitiesthat often conict. This survey provides a systematic taxonomy of multi-objective alignment techniques, organizing the rapidly growing literature into four categories: (1) Reward Decomposition approaches that factorize monolithic rewards into interpretable components, (2) Multi-Objective Reinforcement Learning methods that explicitly navigate Pareto frontiers, (3) Constraint-Based Alignment techniques that enforce hard constraints on safety and format, and (4) Direct Preference Optimization variants that bypass reward modeling entirely. We analyze 47 representative methods across dimensions including optimization strategy, feedback source, computational cost, and Pareto eciency. Our analysis reveals that while single-objective methods dominate current practice, multi-objective approaches consistently outperform them when objectives genuinely conict. We identify key open problems including automatic objective discovery, dynamic preference adaptation, and theoretical foundations for multi-objective alignment. This taxonomy serves as a roadmap for researchers and practitioners navigating the increasingly complex landscape of LLM alignment.

Article

Download

View PDF