AAPA: Adversarially Anchored Preference Alignment for Post-Training of Large Language Models

Faqiang Qian; Kang An; Liangjian Wen; Mengya Gao; Weikun Zhang; Xuhui Zheng; Yichao Wu; Yong Dai; Ziliang Wang

arxiv: 2509.25148 · v2 · pith:WEDH3WNYnew · submitted 2025-09-29 · 💻 cs.AI

AAPA: Adversarially Anchored Preference Alignment for Post-Training of Large Language Models

Faqiang Qian , Kang An , Weikun Zhang , Ziliang Wang , Xuhui Zheng , Liangjian Wen , Yong Dai , Mengya Gao

show 1 more author

Yichao Wu

This is my paper

classification 💻 cs.AI

keywords aapapreferencealignmentanchoringdiscriminatorexpertpost-trainingadversarial

0 comments

read the original abstract

Post-training alignment of large language models often combines supervised fine-tuning (SFT) on expert demonstrations with reinforcement learning (RL) from preference or verifiable feedback. SFT provides a useful behavioral anchor but can overfit to static demonstrations, whereas RL encourages exploration but may drift from expert behavior or exploit imperfect rewards. We propose \textbf{AAPA} (\emph{Adversarially Anchored Preference Alignment}), a plug-in framework that augments existing post-training objectives with a sentence-level adversarial anchoring signal. AAPA compares policy rollouts with offline, pre-collected expert responses using a fixed lightweight discriminator, and therefore requires neither online teacher inference nor discriminator co-training during policy optimization. The same anchoring term can be added to SFT, GRPO, and CHORD while preserving their original training pipelines. Experiments on instruction-following benchmarks show that AAPA consistently improves the corresponding base objectives across model scales. In particular, the staged AAPA configuration improves over a strong GRPO baseline by 5.77\% on \texttt{Qwen3-0.6B} and 3.75\% on \texttt{Qwen3-4B}. Further analyses on response length, log-probability distributions, and discriminator variants suggest that adversarial anchoring provides a stable semantic grounding signal for preference optimization. Code is available at \url{https://github.com/IsFaqq/AAPA}.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs
cs.AI 2026-06 unverdicted novelty 4.0

E³RL uses dynamic thresholds on epistemic entropy from autoregressive cross-entropy to enable erasable RL in LLM reasoning, reporting 5.349% and 6.514% gains on AIME for 4B and 8B models over prior SOTA.