DGAO uses reinforcement learning to optimize LLMs for both accuracy and order stability by balancing intra-group accuracy advantages and inter-group stability advantages.
arXiv preprint arXiv:2512.05111 , year=
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.
LensVLM trains VLMs to scan compressed rendered text images and selectively expand task-relevant regions, achieving 4.3x compression with near full-text accuracy and outperforming baselines up to 10.1x on text QA benchmarks.
citing papers explorer
-
Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization
DGAO uses reinforcement learning to optimize LLMs for both accuracy and order stability by balancing intra-group accuracy advantages and inter-group stability advantages.
-
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification
DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.
-
LensVLM: Selective Context Expansion for Compressed Visual Representation of Text
LensVLM trains VLMs to scan compressed rendered text images and selectively expand task-relevant regions, achieving 4.3x compression with near full-text accuracy and outperforming baselines up to 10.1x on text QA benchmarks.