{"total":13,"items":[{"citing_arxiv_id":"2606.27981","ref_index":255,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ToxiREX: A Dataset on Toxic REasoning in ConteXt","primary_cat":"cs.CL","submitted_at":"2026-06-26T11:30:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08169","ref_index":39,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"CLASP: Language-Driven Robot Skill Selection and Composition using Task-Parameterized Learning","primary_cat":"cs.RO","submitted_at":"2026-06-06T13:33:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CLASP combines TP-KMPs with VLMs for language-guided skill selection, covariance-weighted composition, and active learning requests, reporting 73.3-100% success on a 7-DoF manipulator.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06840","ref_index":130,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces","primary_cat":"cs.CL","submitted_at":"2026-06-05T02:32:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Reasoning in large output spaces proceeds via shortlisting then fine-grained reasoning; this characterization enables a mechanistic distillation strategy that outperforms standard distillation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02211","ref_index":51,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Consistency Training while Mitigating Obfuscation via Rate Matching","primary_cat":"cs.CL","submitted_at":"2026-06-01T13:10:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19723","ref_index":42,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges","primary_cat":"cs.CL","submitted_at":"2026-05-19T11:56:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A literature survey synthesizing benchmarks, architectures, training strategies, and evaluation methods for mathematical reasoning in LLMs, based on roughly 120 papers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10663","ref_index":9,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents","primary_cat":"cs.AI","submitted_at":"2026-05-11T14:43:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"reusable experience patterns directly into model parameters, it achieves remarkable performance gains over standard baselines on both seen and unseen tasks, even in the absence of test-time experience accumulation. Our code is available at https://github.com/Fanzy27/Evolving-RL. 1 Introduction Large language models (LLMs) have demonstrated remarkable capabilities across a broad range of tasks, including complex reasoning [9, 17, 11, 28] and autonomous agent decision-making [26, 8, 35]. However, once trained, LLMs are largely static: they lack the ability to continually adapt themselves to the complex out-of-distribution environments and tasks encountered during deployment. This fundamental limitation has motivated a growing body of research into test-time self-evolution [3, 31,"},{"citing_arxiv_id":"2605.09100","ref_index":16,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression","primary_cat":"cs.CL","submitted_at":"2026-05-09T18:15:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Yuwei Yan, Qing- long Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Jie Feng, Chen Gao, and Yong Li. Toward large reasoning models: A survey of reinforced reasoning with large language models.Patterns, 6(10):101370, 2025. ISSN 2666-3899. doi: https://doi.org/10.1016/j.patter.2025.101370. URL https://www.sciencedirect.com/science/article/pii/S2666389925002181. [16] Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, and Tatsunori Hashimoto. s1: Simple test-time scaling. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural"},{"citing_arxiv_id":"2605.08486","ref_index":25,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Teachers' Perceived Benefits and Risks of AI Across Fifty-Five Countries: An Audit of LLM Alignment and Steerability","primary_cat":"cs.CY","submitted_at":"2026-05-08T21:03:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Teachers' views on AI benefits and risks vary widely across 55 countries, but LLMs compress these differences, overestimate both sides, and show little improvement from country prompting or better reasoning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Transparency(Seoul, Republic of Korea)(FAccT '22). Association for Computing Machinery, New York, NY, USA, 1859-1876. doi:10.1145/3531146.3533233 [24] Zhuoren Jiang, Biao Huang, Jianan Ge, Chenxi Lin, Yueqian Xu, and Jianxing Yu. 2025. Simulating social perception with large language models: perceptions of China's common prosperity.Journal of Chinese Governance(2025), 1-29. [25] Alexander John Karran, Patrick Charland, Joé Trempe-Martineau, Ana Ortiz de Guinea Lopez de Arana, Anne-Marie Lesage, Sylvain Senecal, and Pierre- Majorique Leger. 2025. Multi-stakeholder perspective on responsible artificial intelligence and acceptability in education.npj Science of Learning10, 1 (2025), 44. [26] Mehdi Khamassi, Marceau Nahon, and Raja Chatila."},{"citing_arxiv_id":"2605.07776","ref_index":15,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Tracing Uncertainty in Language Model \"Reasoning\"","primary_cat":"cs.LG","submitted_at":"2026-05-08T14:16:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Uncertainty trace profiles from LM reasoning traces predict correct final answers with AUROC up to 0.807 and enable early error detection using only initial tokens.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Representations. OpenReview, 2025. doi: 10.48550/arXiv.2504.19483. [14] S. C. Hora. Aleatory and epistemic uncertainty in probability elicitation with an example from hazardous waste management. Reliability Engineering & System Safety, 54(2):217-223, 1996. ISSN 0951-8320. doi: 10.1016/S0951-8320(96)00077-4. Treatment of Aleatory and Epistemic Uncertainty. [15] J. Huang and K. C.-C. Chang. Towards Reasoning in Large Language Models: A Survey. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Findings of the Association for Computa- tional Linguistics: ACL 2023, pages 1049-1065, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.67. 10 [16] E. Hüllermeier and W."},{"citing_arxiv_id":"2507.15707","ref_index":13,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked?","primary_cat":"cs.CL","submitted_at":"2025-07-21T15:15:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"LLM accuracy on reasoning tasks differs significantly by question type, with step-by-step reasoning accuracy often uncorrelated to final answer selection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.09567","ref_index":290,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models","primary_cat":"cs.AI","submitted_at":"2025-03-12T17:35:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"space to save computational resources and improve inference performance. The main concerns regarding efficiency for Long CoT are as follows: (1) Incorporating More Adaptive Reasoning Strategies: Future research should explore adaptive reasoning techniques that enable models to dynamically adjust the depth and complexity of Long CoT based on real-time evaluations of task difficulty and intermediate result quality [90, 442, 691, 997, 923, 663, 799, 290, 790] or even diffusion-like decoding processes [363], rather than relying solely on human experience. (2) Leveraging efficient reasoning format: Another promising direction involves integrating multimodal, latent space, or other efficient reasoning formats to express logic more effectively [125, 662, 800]. For example, abstract geometric images or indescribable sounds, which require extensive"},{"citing_arxiv_id":"2411.15594","ref_index":53,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A Survey on LLM-as-a-Judge","primary_cat":"cs.CL","submitted_at":"2024-11-23T16:03:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"specialized knowledge structures. Wang et al. [153] present BIORAG, enhancing vector retrieval with hierarchical knowledge structures and a self-aware evaluated retriever. Li et al. [77] introduce DALK, combining an LLM with a continuously evolving Alzheimer's Disease knowledge graph, using self-aware knowledge retrieval for noise filtering. Jeong et al . [53] propose Self-BioRAG, adapting RAG principles to biomedical applications Liu et al. [92], with LLMs selecting the best evidence for answer generation. Within NLP, especially for tasks such as text generation, reasoning and retrieval, LLM-as-a- Judge enables flexible, scalable, and human-aligned evaluation. However, the open-endedness and diversity of NLP tasks (such as dialog or story generation) mean that the requirements for judgment"},{"citing_arxiv_id":"2312.08935","ref_index":63,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations","primary_cat":"cs.AI","submitted_at":"2023-12-14T13:41:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}