A derived scaling law R(N) = 1/(1 + c(N-1)N^{-β}) fits answer diversity and correctness across 44 LLM multi-agent conditions with R² > 0.99, classifying regimes by β and showing only heterogeneous teams escape hard-ceiling saturation.
DSDR: Dual-scale diversity regularization for exploration in LLM reasoning.arXiv preprint arXiv:2602.19895
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5representative citing papers
RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.
REFT improves Pass@1/8/64 in RLVR by uniform first-token sampling from top-N candidates across 0.5B-7B models and multiple difficulty levels.
EVE-Agent adds an evidence verifier to the proposer-solver loop that rewards spans by marginal accuracy gain, producing self-generated but inspectable training examples for search agents.
MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.
citing papers explorer
-
The Ringelmann Effect in Multi-Agent LLM Systems: A Scaling Law for Effective Team Size
A derived scaling law R(N) = 1/(1 + c(N-1)N^{-β}) fits answer diversity and correctness across 44 LLM multi-agent conditions with R² > 0.99, classifying regimes by β and showing only heterogeneous teams escape hard-ceiling saturation.
-
Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.
-
Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR
REFT improves Pass@1/8/64 in RLVR by uniform first-token sampling from top-N candidates across 0.5B-7B models and multiple difficulty levels.
-
EVE-Agent: Evidence-Verifiable Self-Evolving Agents
EVE-Agent adds an evidence verifier to the proposer-solver loop that rewards spans by marginal accuracy gain, producing self-generated but inspectable training examples for search agents.
-
The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping
MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.