IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of the annotated data.
Enhancing llm reasoning with iterative dpo: A comprehensive empirical investigation
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 7representative citing papers
CPT is introduced as a pairwise reasoning-trace comparison stage that improves the reasoning-metacognition trade-off over standard SFT+RL pipelines across model scales.
Short GRPO warm-up followed by offline DPO on informative rollouts matches or beats full GRPO on math reasoning benchmarks at substantially lower compute cost.
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.
TPMM-DPO applies trajectory-aware learned-weight merging of prior policy models to stabilize iterative DPO against preference noise accumulation.
LegalDrill uses diagnosis-driven synthesis and self-reflective verification to create high-quality training data that improves small language models' legal reasoning without expert annotations.
LoRR augments preference optimization methods like DPO with high-replay training, periodic resets to initial data/policy, and a hybrid objective to improve sample efficiency and reduce primacy bias on math and reasoning tasks.
citing papers explorer
-
IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning
IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of the annotated data.