An explicitly exploratory iterative NLHF method achieves O(sqrt(T)) regret for Nash equilibria under general preference models, removing the exponential KL dependence that plagues standard iterative approaches.
Reward-free alignment for conflicting objectives
6 Pith papers cite this work. Polarity classification is still indexing.
abstract
Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world alignment problems involve multiple conflicting objectives, where naive aggregation of preferences can lead to unstable training and poor trade-offs. In particular, weighted loss methods may fail to identify update directions that simultaneously improve all objectives, and existing multi-objective approaches often rely on explicit reward models, introducing additional complexity and distorting user-specified preferences. The contributions of this paper are two-fold. First, we propose a Reward-free Alignment framework for Conflicted Objectives (RACO) that directly leverages pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent. We provide convergence guarantees to Pareto-critical points that respect user-specified objective weights, and further show that clipping can strictly improve convergence rate in the two-objective setting. Second, we improve our method using some heuristics and conduct experiments to demonstrate the compatibility of the proposed framework for LLM alignment. Both qualitative and quantitative evaluations on multi-objective summarization and safety alignment tasks across multiple LLM families (Qwen 3, Llama 3, Gemma 3) show that our method consistently achieves better Pareto trade-offs compared to existing multi-objective alignment baselines.
citation-role summary
citation-polarity summary
fields
cs.LG 6years
2026 6verdicts
UNVERDICTED 6roles
baseline 1polarities
baseline 1representative citing papers
Update direction selection for PINN training is cast as a Chebyshev-center problem in the dual cone, yielding an efficient dual formulation with nonconvex convergence guarantees and automatic recovery of scale robustness and simultaneous descent.
2FFS is a two-fidelity tree-search algorithm for stochastic minimax BAI that proves fixed-confidence correctness, finite stopping, and polynomial cost bounds while using fewer samples than baselines in experiments.
Larger models succeed on rare and complex tasks by reducing gradient interference from common tasks, allowing rare-task features to accumulate, as shown via synthetic task mixtures and OLMo pretraining from 4M to 4B parameters.
RVPO penalizes variance across multiple reward signals during RLHF advantage aggregation, using a LogSumExp operator as a smooth variance penalty to reduce constraint neglect in LLM alignment.
SAW uses coefficient of variation to dynamically reweight objectives in MORL for LLMs, improving training efficiency and performance on tool-calling and summarization tasks under GRPO and GDPO.
citing papers explorer
-
Efficient Exploration for Iterative Nash Preference Optimization
An explicitly exploratory iterative NLHF method achieves O(sqrt(T)) regret for Nash equilibria under general preference models, removing the exponential KL dependence that plagues standard iterative approaches.
-
Chebyshev Center-Based Direction Selection for Multi-Objective Optimization and Training PINNs
Update direction selection for PINN training is cast as a Chebyshev-center problem in the dual cone, yielding an efficient dual formulation with nonconvex convergence guarantees and automatic recovery of scale robustness and simultaneous descent.
-
Two-Fidelity Best-Action Identification for Stochastic Minimax Tree
2FFS is a two-fidelity tree-search algorithm for stochastic minimax BAI that proves fixed-confidence correctness, finite stopping, and polynomial cost bounds while using fewer samples than baselines in experiments.
-
Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
Larger models succeed on rare and complex tasks by reducing gradient interference from common tasks, allowing rare-task features to accumulate, as shown via synthetic task mixtures and OLMo pretraining from 4M to 4B parameters.
-
RVPO: Risk-Sensitive Alignment via Variance Regularization
RVPO penalizes variance across multiple reward signals during RLHF advantage aggregation, using a LogSumExp operator as a smooth variance penalty to reduce constraint neglect in LLM alignment.
-
SAW: Stage-Aware Dynamic Weighting for Multi-Objective Reinforcement Learning in Large Language Models
SAW uses coefficient of variation to dynamically reweight objectives in MORL for LLMs, improving training efficiency and performance on tool-calling and summarization tasks under GRPO and GDPO.