Enhancing llm reasoning with iterative dpo: A comprehensive empirical investigation

· 2025 · arXiv 2503.12854

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of the annotated data.

Enhancing LLM Metacognition via Cognitive Pairwise Training

cs.LG · 2026-05-30 · unverdicted · novelty 6.0

CPT is introduced as a pairwise reasoning-trace comparison stage that improves the reasoning-metacognition trade-off over standard SFT+RL pipelines across model scales.

How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

Short GRPO warm-up followed by offline DPO on informative rollouts matches or beats full GRPO on math reasoning benchmarks at substantially lower compute cost.

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.

TPMM-DPO: Trajectory-aware Preference-guided Model Merging for Iterative Direct Preference Optimization

cs.IR · 2026-05-22 · unverdicted · novelty 5.0

TPMM-DPO applies trajectory-aware learned-weight merging of prior policy models to stabilize iterative DPO against preference noise accumulation.

LegalDrill: Diagnosis-Driven Synthesis for Legal Reasoning in Small Language Models

cs.CL · 2026-04-26 · unverdicted · novelty 5.0

LegalDrill uses diagnosis-driven synthesis and self-reflective verification to create high-quality training data that improves small language models' legal reasoning without expert annotations.

Sample-efficient LLM Optimization with Reset Replay

cs.LG · 2025-08-08 · unverdicted · novelty 5.0

LoRR augments preference optimization methods like DPO with high-replay training, periodic resets to initial data/policy, and a hybrid objective to improve sample efficiency and reduce primacy bias on math and reasoning tasks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning cs.LG · 2026-04-22 · unverdicted · none · ref 70
IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of the annotated data.

Enhancing llm reasoning with iterative dpo: A comprehensive empirical investigation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer