pith. machine review for the scientific record. sign in

arxiv: 2507.21183 · v5 · submitted 2025-07-27 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

Authors on Pith no claims yet
classification 💻 cs.LG cs.AIcs.CL
keywords optimizationpreferencemappomaximumposterioripriorvariantsalignment
0
0 comments X
read the original abstract

As the era of large language models (LLMs) unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference Optimization (MaPPO), a methodology for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective. Building on the paradigm employed by Direct Preference Optimization (DPO) and its variants of treating preference learning as a Maximum Likelihood Estimation (MLE) problem, MaPPO integrates prior reward estimates into a principled Maximum a Posteriori (MaP) objective. This not only generalizes DPO and its variants, but also enhances alignment by mitigating the oversimplified binary classification of responses. Additionally, MaPPO introduces no additional hyperparameters, and supports preference optimization in both offline and online settings. In addition, MaPPO can be used as a plugin for DPO variants, including widely used SimPO, IPO and CPO, and produce consistent improvements. Extensive empirical evaluations of different model sizes and model series on three standard benchmarks (MT-Bench, AlpacaEval 2.0, and Arena-Hard) demonstrate consistent improvements in alignment performance without sacrificing computational efficiency.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy

    cs.LG 2026-03 unverdicted novelty 7.0

    ARL-RR alternates optimization over rubric meta-classes with dynamic selection to avoid fixed scalarization, outperforming baselines on HealthBench.

  2. Zero-Shot Vulnerability Detection in Low-Resource Smart Contracts Through Solidity-Only Training

    cs.CR 2026-03 unverdicted novelty 5.0

    Sol2Vy transfers vulnerability detection from Solidity to Vyper in zero-shot fashion, outperforming prior methods on reentrancy, weak randomness, and unchecked transfers.

  3. Reinforcement Learning for Scalable and Trustworthy Intelligent Systems

    cs.LG 2026-05 unverdicted novelty 3.0

    Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.