{"total":16,"items":[{"citing_arxiv_id":"2605.19276","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OpenCompass: A Universal Evaluation Platform for Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-19T02:50:11+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16517","ref_index":5,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Customizing an LLM for Enterprise Software Engineering","primary_cat":"cs.SE","submitted_at":"2026-05-15T18:11:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Gemini for Google, customized via continued pre-training on proprietary Google engineering data, delivers measurable productivity gains in a large internal developer study.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13171","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics","primary_cat":"cs.AI","submitted_at":"2026-05-13T08:33:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18801","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Position: Let's Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance","primary_cat":"cs.AI","submitted_at":"2026-05-11T11:44:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The authors propose creating data probes—synthetic sequences from defined random processes—to reveal how data properties drive LLM behavior across workflow stages.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09860","ref_index":12,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-11T01:43:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684-1704, 2025. [10] François Chollet. On the measure of intelligence.arXiv preprint arXiv:1911.01547, 2019. [11] Francois Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. Arc prize 2024: Technical report.arXiv preprint arXiv:2412.04604, 2024. [12] François Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. Arc prize 2025: Technical report, 2026. URLhttps://arxiv.org/abs/2601.10904. [13] Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition.Journal of artificial intelligence research, 13:227-303, 2000. 10 [14] Richard E Fikes, Peter E Hart, and Nils J Nilsson."},{"citing_arxiv_id":"2605.08498","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs","primary_cat":"cs.LG","submitted_at":"2026-05-08T21:28:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"producing direct contamination evidence on AIME 2024. GSM-SYMBOLIC[ 44] shows that even surface symbolic re-skinning of GSM8K [14] can move accuracy by tens of points. More recently, MATHDUELS[ 62] lets difficulty co-evolve with participants by casting evaluation as self-play be- tween authors and solvers, FRONTIERCS [ 42] curates open-ended tasks from expert programmers, and the ARC-AGI series [ 11, 12] pursues memorization-resistant puzzles by design. Together, these efforts mark a shift away from one-shot benchmarks and toward evaluation infrastructure that is expected to keep producing harder problems as models improve and inspire our benchmark in this way. Tool-use evaluation benchmarks.Tool use is now a frontier capability in its own right."},{"citing_arxiv_id":"2605.02442","ref_index":159,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Measuring AI Reasoning: A Guide for Researchers","primary_cat":"cs.AI","submitted_at":"2026-05-04T10:42:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18839","ref_index":162,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models","primary_cat":"cs.LG","submitted_at":"2026-04-20T21:06:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05568","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Tools and Persons: Who Are They? Classifying Robots and AI Agents for Proportional Governance","primary_cat":"cs.ET","submitted_at":"2026-04-07T08:08:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A CPST-based taxonomy sorts autonomous systems into Confined Actors, Socially-Aware Interactors, and CPST-Integrated Agents to enable proportional governance from enhanced liability to qualified personhood.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.27134","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Factorization Regret mediates compositional generalization in latent space","primary_cat":"cs.LG","submitted_at":"2026-03-28T04:46:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Factorization Regret measures how latent variable interactions affect performance, and RCCs enable learning them to achieve compositional generalization in partially observable tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The Journal of Neuroscience, 36:7817 - 7828, 2016. URLhttps://api.semanticscholar.org/CorpusID:9673546. [17] Franc ¸ois Chollet. On the measure of intelligence.arXiv preprint arXiv:1911.01547, 2019. [18] Francois Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. Arc prize 2024: Tech- nical report.ArXiv, abs/2412.04604, 2024. URLhttps://api.semanticscholar. org/CorpusID:274581906. [19] Yogita Chudasama and Trevor William Robbins. Dissociable contributions of the or- bitofrontal and infralimbic cortex to pavlovian autoshaping and discrimination reversal learn- ing: Further evidence for the functional heterogeneity of the rodent frontal cortex.The Jour- nal of Neuroscience, 23:8771 - 8780, 2003. URLhttps://api.semanticscholar. org/CorpusID:8018743."},{"citing_arxiv_id":"2510.07632","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models","primary_cat":"cs.AI","submitted_at":"2025-10-09T00:00:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces group matching score for better evaluation of compositional reasoning and Test-Time Matching (TTM) algorithm for unsupervised self-improvement in multimodal models, achieving SOTA gains including surpassing GPT-4.1 and estimated human performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.23108","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Artificial Phantasia: Emergent Mental Imagery in Large Language Models","primary_cat":"cs.AI","submitted_at":"2025-09-27T04:36:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs achieve higher accuracy than humans on compositional imagery tasks previously argued to require pictorial representations, supporting emergent propositional mental imagery in AI.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.21734","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hierarchical Reasoning Model","primary_cat":"cs.AI","submitted_at":"2025-06-26T19:39:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HRM is a recurrent architecture with high-level planning and low-level execution modules that reaches near-perfect accuracy on complex Sudoku, maze navigation, and ARC benchmarks using 27M parameters and 1000 samples without pre-training or CoT supervision.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.11831","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems","primary_cat":"cs.AI","submitted_at":"2025-05-17T04:34:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ARC-AGI-2 adds a larger, more complex set of tasks to the original ARC-AGI benchmark to give finer-grained measurement of fluid intelligence in AI.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.09567","ref_index":135,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models","primary_cat":"cs.AI","submitted_at":"2025-03-12T17:35:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":",Self-Critique [645], SelfCheck [549], Self-Verification [848], Critic-CoT [1109], Verifier [141], STaR[1012],ReST [225],Critic [216], ReFT [745], AceCoder [1014], DeepSeek-R1 [227], Critic-RM [991], o1-coder [1076], RM [536], Logic-RL [886], Self-Contrast [1062],AGSER [484],DeepSeek-Math [658],etc. Process Feedbacke.g.,ReAct [956],Reflexion [669],Math-Minos [196],Math-Shepherd [792], ER-PRM [1036],Eurus [1001],PAD [219], PA Vs [651], CTRL [891], QwQ [731], Skywork-o1[570], AceMath [500], PRIME [143],AURORA [710], RewardAgent [594], PDS [910], COT STEP [135], Step-DPO [365], ORPS [996],etc. Hybrid Feedbacke.g.,Step-KTO [454], Zhang et al. [1078],etc. Refinement (§5.2) Prompt-based Refine-ment Generation e.g.,Reflexion [669], SelfCheck [549], Self-Critique [645], Self-Refine [539], Refiner [590], MCTSr[1033], ReST-MCTS* [1032],LLM2 [930], GLoRe [238],Yang et al. [950],RR-MP [240], PAD [578], De-CRIM [186],StepCo [868], BackMATH [1055], ReARTeR [703], START [289], PHP [1101],etc."},{"citing_arxiv_id":"2501.14249","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Humanity's Last Exam","primary_cat":"cs.LG","submitted_at":"2025-01-24T05:27:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"We emphasize that future models should not only do better in terms of accuracy, but also strive to be compute-optimal. 5 Discussion Future Model Performance.While current LLMs achieve very low accuracy on HLE, recent history shows benchmarks are quickly saturated - with models dramatically progressing from near-zero to near-perfect performance in a short timeframe [13, 45]. Given the rapid pace of AI development, it is plausible that models could exceed 50% accuracy on HLE by the end of 2025. High accuracy on HLE would demonstrate expert-level performance on closed-ended, verifiable questions and cutting-edge scientific knowledge, but it would not alone suggest autonomous research capabilities or \"artificial general intelligence."}],"limit":50,"offset":0}