DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
Efficient rl training for reasoning models via length-aware optimization
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
SxS Interleaved Reasoning learns when to disclose partial reasoning during generation and improves accuracy versus content-latency trade-offs on math and science benchmarks.
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
citing papers explorer
-
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
-
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
-
When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning
SxS Interleaved Reasoning learns when to disclose partial reasoning during generation and improves accuracy versus content-latency trade-offs on math and science benchmarks.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.