{"total":10,"items":[{"citing_arxiv_id":"2605.21630","ref_index":15,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MindLoom: Composing Thought Modes for Frontier-Level Reasoning Data Synthesis","primary_cat":"cs.AI","submitted_at":"2026-05-20T18:40:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MindLoom synthesizes frontier-level reasoning data by decomposing solutions into thought mode chains, training a retrieval model for mode selection, composing new problems with distribution-aligned sampling, and applying rollout-based difficulty labeling for fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15549","ref_index":35,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CTF4Nuclear: Common Task Framework for Nuclear Fission and Fusion Models","primary_cat":"cs.LG","submitted_at":"2026-05-15T02:42:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CTF4Nuclear proposes a common task framework for benchmarking ML methods on nuclear engineering datasets using 12 metrics and a new sparse-measurement system monitoring paradigm.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05949","ref_index":24,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MAS-Algorithm: A Workflow for Solving Algorithmic Programming Problems with a Multi-Agent System","primary_cat":"cs.AI","submitted_at":"2026-05-07T09:57:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MAS-Algorithm is a multi-agent workflow that improves AI acceptance rates on algorithmic problems by 6.48% on average, outperforming parameter-efficient fine-tuning.","context_count":1,"top_context_role":"dataset","top_context_polarity":"background","context_text":"AlphaCode[7], has demonstrated the feasibility of LLM for code generation tasks. Now, people are gradually extending code generation to the entire software engineering workflow[18, 19], using LLM from the agent's perspective for code generation[20], debugging[21, 22], and repairing[23]. Algorithmic Problems.Classic benchmarks for algorithmic scenarios also include MATH[ 24], CodeContests[25], APPS[26], and classic methods also include DeepCoder[27], AlphaEvolve[28]. Multi-Agent Systems.Multi-agent frameworks such as CAMEL[ 29], MetaGPT[30], AutoGen[31], and AgentVerse[32] have gained increasing attention, and are applied across diverse scenarios[33, 34, 35], including code generation[36, 22] and software engineering tasks[37, 38]."},{"citing_arxiv_id":"2509.16591","ref_index":10,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Heterogeneous Adaptive Policy Optimization: Tailoring Optimization to Every Token's Nature","primary_cat":"cs.CL","submitted_at":"2025-09-20T09:30:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HAPO is a new token-level policy optimization method for LLMs that continuously adapts four optimization stages using entropy, claiming consistent gains over DAPO on math, code, and logic tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.03800","ref_index":33,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"STELLA: A Multimodal LLM for Protein Functional Annotation via Unified Sequence-Structure Encoding","primary_cat":"q-bio.BM","submitted_at":"2025-06-04T10:09:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"STELLA aligns ESM3 bimodal sequence-structure encodings with Llama-3.1-8B text modeling to claim state-of-the-art results on protein functional description prediction and enzyme-catalyzed reaction prediction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.24864","ref_index":25,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models","primary_cat":"cs.CL","submitted_at":"2025-05-30T17:59:01+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Prolonged RL training with KL control and reference policy resetting enables LLMs to develop novel reasoning strategies inaccessible to base models even under extensive sampling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.09686","ref_index":50,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models","primary_cat":"cs.AI","submitted_at":"2025-01-16T17:37:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"like MATH-401 [177] evaluate pure arithmetic capabilities through 401 carefully structured expres- sions, while MultiArith [116] and AddSub [51] assess the ability to translate simple word problems into mathematical operations (such as addition or subtraction). Moving to elementary and high school levels, comprehensive datasets such as GSM8K [24] and MATH [50] present more sophis- ticated multi-step reasoning challenges, with GSM8K offering 8.5K grade school problems and MATH providing 12.5K problems across various mathematical domains with graduated difficulty levels. The evaluation ofadvanced mathematical capabilities is primarily conducted through competition and specialized test datasets. Collections like CHAMP [92] and ARB [5] present competition-level"},{"citing_arxiv_id":"2501.09775","ref_index":6,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident, Especially When They are Wrong","primary_cat":"cs.CL","submitted_at":"2025-01-16T10:27:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Reasoning before answering MCQs increases LLM confidence more for incorrect answers and degrades calibration on a 57-subject benchmark across seven models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2407.21787","ref_index":28,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Large Language Monkeys: Scaling Inference Compute with Repeated Sampling","primary_cat":"cs.LG","submitted_at":"2024-07-31T17:57:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"[26] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. [27] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically, 2017. URL https://arxiv.org/abs/1712.00409. 15 [28] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre."},{"citing_arxiv_id":"2405.14782","ref_index":279,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Lessons from the Trenches on Reproducible Evaluation of Language Models","primary_cat":"cs.CL","submitted_at":"2024-05-23T16:50:49+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}