pith. sign in

arxiv: 2510.01167 · v2 · pith:NW76EOA2new · submitted 2025-10-01 · 💻 cs.LG · cs.AI· cs.CL

Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards

classification 💻 cs.LG cs.AIcs.CL
keywords textbfacrossalignmentinferencemulti-objectivenon-verifiableverifiablecontrol
0
0 comments X
read the original abstract

Aligning large language models to human preferences is inherently multidimensional, yet most pipelines collapse heterogeneous signals into a single objective. We seek to answer what it would take to simultaneously align a model across various domains spanning those with: verifiable rewards, non-verifiable subjective preferences, and complex interactive scenarios. Such multi-objective alignment setups are often plagued by individual objectives being at odds with each other, resulting in inefficient training and limited user control during inference. To address these issues, we propose $\textbf{M}$ulti-$\textbf{A}$ction-$\textbf{H}$ead $\textbf{AL}$ignment with PRM-guided Dec$\textbf{O}$ding ($\textbf{MAHALO}$), a unified framework that standardizes PRM training across verifiable and non-verifiable settings for step-level supervision, performs vectorized multi-objective alignment with Multi-Action-Head DPO, and enables controllable inference through objective-specific weighting and PRM-guided decoding. Experiments across math reasoning, human values alignment, and multi-turn tutoring show that MAHALO jointly improves multiple objectives simultaneously with limited interference, while remaining generalizable and adaptable across domains and offering flexible user control at inference time. Our code is available at: https://github.com/pearls-lab/multiobj-align.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Many Preferences, Few Policies: Towards Scalable Language Model Personalization

    cs.CL 2026-04 unverdicted novelty 7.0

    PALM produces a small portfolio of LLMs that contains a near-optimal model for any user preference weight vector, with theoretical bounds on portfolio size and approximation quality.

  2. SURF: Steering the Scalarization Weight to Uniformly Traverse the Pareto Front

    cs.LG 2026-05 unverdicted novelty 6.0

    SURF derives weight sampling rules from the arc-length CDF of the scalarization path to uniformly traverse the Pareto front in multi-objective optimization.