{"total":16,"items":[{"citing_arxiv_id":"2606.24320","ref_index":268,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ZONOS2 Technical Report","primary_cat":"cs.SD","submitted_at":"2026-06-23T08:57:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ZONOS2 8B is a scaled MoE TTS model with 900M active parameters trained on 6M hours of data that reports competitive SOTA results on naturalness, speaker similarity, WER, and a new ZTTS1-Eval benchmark while releasing weights and code.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.21943","ref_index":231,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Modularized Reinforcement Learning on LLMs: From MDP Creation to Exploration and Learning","primary_cat":"cs.LG","submitted_at":"2026-06-20T08:20:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Survey mapping RL techniques onto LLM training and highlighting gaps in value-based, off-policy, and bootstrapping methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20295","ref_index":88,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Token-Operations-Oriented Inference Optimization Techniques for Large Models","primary_cat":"cs.SE","submitted_at":"2026-06-18T14:33:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper introduces a four-layer technical architecture for token-operations-oriented inference optimization in large models and reviews key technologies and industry status at each layer.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"instruction-following model first generates a high-level solution outline and a reasoning model subsequently expands it into a detailed solution, achieving a 22.3% reduction in token consumption across three major benchmark datasets [87]. The NAT framework incorporates token budget as a primary optimization objective during reinforcement learning and achieves performance comparable to full-sequence GRPO training while using only 50% of the tokens [88]. The survey Stop Overthinking, published by Rice University, provides a systematic review of this emerging field and categorizes existing approaches into three major directions: model-based optimization, reasoning-output-based optimization, and input-prompt-based optimization [89]. However, most existing efficient reasoning methods adopt a one-size-fits-all compression strategy, uniformly reducing"},{"citing_arxiv_id":"2606.18089","ref_index":82,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning","primary_cat":"cs.LG","submitted_at":"2026-06-16T15:55:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces a hierarchical latent selection model showing SFT supplies raw module materials in compound traces while RL decomposes them to identify atomic modules and enable recombination for new reasoning configurations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03077","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Libra: Efficient Resource Management for Agentic RL Post-Training","primary_cat":"cs.LG","submitted_at":"2026-06-02T03:09:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Libra optimizes GPU allocation across rollout and training in agentic RL via an elastic hybrid pool and C-MLFQ scheduler based on tool-return causal signals, claiming up to 3.0x throughput and 2.5x faster reward convergence on 48 A800 GPUs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01249","ref_index":243,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Trust Region On-Policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-31T14:04:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22211","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CLORE: Content-Level Optimization for Reasoning Efficiency","primary_cat":"cs.AI","submitted_at":"2026-05-21T09:16:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19358","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-19T04:41:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CES applies conditional bidirectional entropy control on top of DAPO to improve accuracy and shorten responses on mathematical benchmarks for 7B and 1.5B LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09806","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-10T23:05:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning outputs than base models on math benchmarks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"[11] Junyi Chen, Chuheng Du, Renyuan Liu, Shuochao Yao, Dingtian Yan, Jiang Liao, Shengzhong Liu, Fan Wu, and Guihai Chen. Tokenflow: Responsive llm text streaming serving under request burst via preemptive scheduling.arXiv preprint arXiv:2510.02758, 2025. 10 [12] Daman Arora and Andrea Zanette. Training language models to reason efficiently.arXiv preprint arXiv:2502.04463, 2025. [13] Violet Xiang, Chase Blagden, Rafael Rafailov, Nathan Lile, Sang Truong, Chelsea Finn, and Nick Haber. Just enough thinking: Efficient reasoning with adaptive length penalties reinforcement learning.arXiv preprint arXiv:2506.05256, 2025. [14] Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning."},{"citing_arxiv_id":"2605.08441","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards","primary_cat":"cs.LG","submitted_at":"2026-05-08T20:03:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"1) optimizes one of two natural degrees of freedom: either deciding which prompts should receive rollouts [40, 43, 23] or deciding how long each rollout should continue [37]. Imposing a fixed length cap on every rollout is strictly suboptimal: aggressive truncation rewards the policy for committing to its first guess and silently eliminates the long chains of thought that underlie frontier reasoning performance [36, 41, 40]. In this work we coordinate the prompt-level decision of how many rollouts to draw with the within-rollout decision of when to stop them, all under a single shared compute budget. This simple coordination delivers substantial gains over either dimension alone. Classical statistics offers a clean solution to precisely this two-decision budget-allocation problem."},{"citing_arxiv_id":"2605.05365","ref_index":227,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ZAYA1-8B Technical Report","primary_cat":"cs.AI","submitted_at":"2026-05-06T18:44:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27039","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling","primary_cat":"cs.CL","submitted_at":"2026-04-29T17:09:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LenVM models token-level remaining generation length as a bounded discounted value function derived from constant negative per-token rewards, providing a scalable proxy for generation horizon.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"length reward as rlen t =−(1−γ),t=0, . . . ,L−1,(20) with terminal rewardrlen L =0. The corresponding discounted return from statest is then Glen t = L−t ∑ i=0 γirlen t+i =−(1−γ L−t).(21) 19 Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling Therefore the length value function under policyπis Vlen π (st)≜E π[Glen t ∣s t].(22) This is exactly the quantity LenVM is trained to estimate. In this sense, LenVM is not merely correlated with generation length. It is a value function for a well-defined token-level objective in which each additional decoding step incurs a constant discounted cost. This interpretation is useful because it makes explicit what happens if LenVM is inserted directly into RL"},{"citing_arxiv_id":"2603.08659","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning","primary_cat":"cs.CL","submitted_at":"2026-03-09T17:37:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CODA uses rollout-based difficulty signals to drive two gates that penalize verbosity on easy instances and promote deliberation on hard ones, cutting token use over 60% on simple tasks while maintaining accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.09953","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ATTNPO: Attention-Guided Process Supervision for Efficient Reasoning","primary_cat":"cs.CL","submitted_at":"2026-02-10T16:40:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ATTNPO guides process-supervised RL with intrinsic attention signals to shorten reasoning traces while raising accuracy on nine benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.11340","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Neural Chain-of-Thought Search: Searching the Optimal Reasoning Path to Enhance Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-01-16T14:38:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NCoTS treats chain-of-thought reasoning as a search problem and uses a dual-factor heuristic to find paths that are over 3.5% more accurate and 22% shorter on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.19995","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Schoenfeld's Anatomy of Mathematical Reasoning by Language Models","primary_cat":"cs.CL","submitted_at":"2025-12-23T02:44:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ThinkARM abstracts LLM reasoning traces into Schoenfeld episodes and shows that exploration steps correlate with correctness while efficiency methods selectively suppress evaluative feedback.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}