MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

Guangchen Lan , Sipeng Zhang , Tianle Wang , Yuwei Zhang , Daoan Zhang , Xinpeng Wei , Xiaoman Pan , Hongming Zhang

show 2 more authors

Dong-Jun Han Christopher G. Brinton

Authors on Pith no claims yet

classification 💻 cs.LG cs.AIcs.CL

keywords optimizationpreferencemappomaximumposterioripriorvariantsalignment

0 comments

read the original abstract

As the era of large language models (LLMs) unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference Optimization (MaPPO), a methodology for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective. Building on the paradigm employed by Direct Preference Optimization (DPO) and its variants of treating preference learning as a Maximum Likelihood Estimation (MLE) problem, MaPPO integrates prior reward estimates into a principled Maximum a Posteriori (MaP) objective. This not only generalizes DPO and its variants, but also enhances alignment by mitigating the oversimplified binary classification of responses. Additionally, MaPPO introduces no additional hyperparameters, and supports preference optimization in both offline and online settings. In addition, MaPPO can be used as a plugin for DPO variants, including widely used SimPO, IPO and CPO, and produce consistent improvements. Extensive empirical evaluations of different model sizes and model series on three standard benchmarks (MT-Bench, AlpacaEval 2.0, and Arena-Hard) demonstrate consistent improvements in alignment performance without sacrificing computational efficiency.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy
cs.LG 2026-03 unverdicted novelty 7.0

ARL-RR alternates optimization over rubric meta-classes with dynamic selection to avoid fixed scalarization, outperforming baselines on HealthBench.
Zero-Shot Vulnerability Detection in Low-Resource Smart Contracts Through Solidity-Only Training
cs.CR 2026-03 unverdicted novelty 5.0

Sol2Vy transfers vulnerability detection from Solidity to Vyper in zero-shot fashion, outperforming prior methods on reentrancy, weak randomness, and unchecked transfers.
Reinforcement Learning for Scalable and Trustworthy Intelligent Systems
cs.LG 2026-05 unverdicted novelty 3.0

Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.