Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.
Dpo meets ppo: Reinforced token optimization for rlhf.arXiv preprint arXiv:2404.18922
7 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 7representative citing papers
LASER generates complex slow-query training data with MCTS and aligns small models via SQL-GRPO to deliver efficient, low-cost SQL rewriting that outperforms rules and large models.
VEPO improves RL for visual reasoning by multiplicatively coupling visual sensitivity with token entropy, outperforming entropy-only baselines by 2.28 points at 7B and 3.15 points at 3B scale.
LC-ERD frames LLM self-alignment as latent structure mining via a Variational Logic Potential and Multi-Agent Value Decomposition to provide granular, logic-consistent supervision.
RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with offline methods across five LLMs.
Data Agent learns a co-evolving sample selection policy end-to-end that accelerates training by over 50% on ImageNet-1k and MMLU with no performance loss.
SCOPE-RL adds a regularization term built from high-temperature positive samples to quantitatively control entropy dynamics and maintain exploration in RL post-training of reasoning LLMs.
citing papers explorer
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.
-
LASER: A Data-Centric Method for Low-Cost and Efficient SQL Rewriting based on SQL-GRPO
LASER generates complex slow-query training data with MCTS and aligns small models via SQL-GRPO to deliver efficient, low-cost SQL rewriting that outperforms rules and large models.
-
Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection
VEPO improves RL for visual reasoning by multiplicatively coupling visual sensitivity with token entropy, outperforming entropy-only baselines by 2.28 points at 7B and 3.15 points at 3B scale.
-
LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition
LC-ERD frames LLM self-alignment as latent structure mining via a Variational Logic Potential and Multi-Agent Value Decomposition to provide granular, logic-consistent supervision.
-
Leveraging RAG for Training-Free Alignment of LLMs
RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with offline methods across five LLMs.
-
Data Agent: Learning to Select Data via End-to-End Dynamic Optimization
Data Agent learns a co-evolving sample selection policy end-to-end that accelerates training by over 50% on ImageNet-1k and MMLU with no performance loss.
-
SCOPE-RL: Stable and Quantitative Control of Policy Entropy in RL Post-Training
SCOPE-RL adds a regularization term built from high-temperature positive samples to quantitatively control entropy dynamics and maintain exploration in RL post-training of reasoning LLMs.