{"total":11,"items":[{"citing_arxiv_id":"2605.23074","ref_index":23,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"PathCal: State-Aware Reflection-Marker Calibration for Efficient Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-21T22:13:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PathCal calibrates reasoning paths by type-aware soft rebalancing of reflection-marker logits at uncertain states, yielding better efficiency-performance trade-offs on six benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12207","ref_index":38,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Not How Many, But Which: Parameter Placement in Low-Rank Adaptation","primary_cat":"cs.LG","submitted_at":"2026-05-12T14:46:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gradient-informed placement of LoRA parameters recovers full performance under GRPO while random placement does not, due to differences in gradient rank and stability across training regimes.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[36] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021. [37] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3, 2022. [38] Uijeong Jang, Jason D Lee, and Ernest K Ryu. Lora training in the ntk regime has no spurious local minima. In International Conference on Machine Learning, pages 21306-21328. PMLR, 2024. [39] Maxwell Jia. Aime problem set 2024, 2024. URL https://huggingface.co/datasets/ Maxwell-Jia/AIME_2024. [40] Ting Jiang, Shaohan Huang, Shengyue Luo, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei"},{"citing_arxiv_id":"2605.11854","ref_index":24,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-12T09:39:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TABOM is a trajectory-aligned Boltzmann modeling framework that turns self-distilled inference paths into a pairwise ranking loss to close the training-inference gap in diffusion language models and expand their effective capabilities.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"9 are tuned via a grid search over {0.1,0.2,0.3} and {1,2} , respectively, for each task. Specific hyperparameters, training details, and evaluation setups are detailed in Appendix. Evaluation Tasks.We conduct experiments on five main tasks grouped into three categories: (1) Mathematical reasoning: GSM8K [ 22] and MATH500 [ 23]; (2)Code generation: HumanEval [24] and MBPP [25]; (3)Instruction following: IFEval [ 26], which evaluates the model's ability to follow verifiable instructions. Note that for each trained model, we evaluate both its in-domain and out-of-distribution (OOD) performance. For instance, for a model trained on the code generation dataset, HumanEval and MBPP serve as in-domain evaluations, while the remaining three tasks are"},{"citing_arxiv_id":"2605.11290","ref_index":11,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ReAD: Reinforcement-Guided Capability Distillation for Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-11T22:17:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Zhenyuan Yang, Haoran Lu, Xilin Gong, Yufang Liu, Terry Ma, Wei Ruan, Ali Abbasi, Jing Zhang, Tao Wang, Ehsan Latif, Wei Liu, Wei Zhang, Soheil Kolouri, Xiaoming Zhai, Dajiang Zhu, Wenxuan Zhong, Tianming Liu, and Ping Ma. Knowledge distillation and dataset distillation of large language models: Emerging trends, challenges, and future directions.arXiv preprint arXiv:2504.14772, 2025. [11] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021. [12] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 10 [13] Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander"},{"citing_arxiv_id":"2605.10405","ref_index":5,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization","primary_cat":"cs.LG","submitted_at":"2026-05-11T11:43:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Doubly robust estimators that incorporate low-rank predictions enable valid finite-sample confidence intervals for best-model identification under adaptive sampling and without-replacement example selection in LLM evaluation.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"practitioners often encounter the challenge of choosing the best-performing model and configuration- such as prompts and hyperparameters-for a specific task. A common practice for that purpose is to compare the performances of the models on fixed benchmarks using a score function. For example, consider the task of mathematical problem solving, as studied in benchmarks such as MMLU [ 4] and MATH [5]. In this case, a score function can be a binary function that indicates whether the model's solution is correct. Unfortunately, exhaustively evaluating every model on every example is resource-intensive: a single evaluation on the GAIA benchmark [6] can cost up to $2,829 [7], and the recent Holistic Agent Leaderboard evaluation spent roughly $40,000 to compare 9 models across 9"},{"citing_arxiv_id":"2604.10701","ref_index":49,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-12T15:54:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.23629","ref_index":18,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Emergent Slow Thinking in LLMs as Inverse Tree Freezing","primary_cat":"cs.AI","submitted_at":"2025-09-28T04:10:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RLVR drives a concept network in LLMs through nucleation and freezing into inverse trees that support slow thinking, and intervening with brief SFT at peak frustration outperforms standard RLVR while post-freeze SFT causes forgetting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.05015","ref_index":11,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SPaCe: Unlocking Sample-Efficient Large Language Models Training With Self-Pace Curriculum Learning","primary_cat":"cs.LG","submitted_at":"2025-08-07T03:50:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SPaCe uses semantic clustering to shrink training sets and a multi-armed bandit to adaptively select samples, matching or beating baselines on reasoning benchmarks with up to 100x fewer examples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.01937","ref_index":44,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"RewardBench 2: Advancing Reward Model Evaluation","primary_cat":"cs.CL","submitted_at":"2025-06-02T17:54:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.15134","ref_index":31,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning","primary_cat":"cs.LG","submitted_at":"2025-05-21T05:39:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Entropy minimization on self-generated outputs elicits strong reasoning in pretrained LLMs, matching or exceeding supervised RL methods on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.17452","ref_index":12,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving","primary_cat":"cs.CL","submitted_at":"2023-09-29T17:59:38+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on MATH for a 7B model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}