{"total":22,"items":[{"citing_arxiv_id":"2605.13634","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Europe and the Geopolitics of AGI: The Need for a Preparedness Plan","primary_cat":"cs.CY","submitted_at":"2026-05-13T15:00:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"AGI may arrive by 2030-2040 and reshape global power balances, requiring Europe to close gaps in compute, talent retention, industrial adoption, and unified policy responses through a coordinated preparedness agenda.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12673","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack","primary_cat":"cs.AI","submitted_at":"2026-05-12T19:22:45+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12015","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces","primary_cat":"cs.CR","submitted_at":"2026-05-12T12:03:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10906","ref_index":8,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DataMaster: Data-Centric Autonomous AI Research","primary_cat":"cs.LG","submitted_at":"2026-05-11T17:46:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DataMaster deploys an AI agent to autonomously engineer data via tree search over external sources, shared candidate pools, and memory of past outcomes, yielding 32% higher medal rates on MLE-Bench Lite and a small GPQA gain over the base instruct model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08904","ref_index":142,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces","primary_cat":"cs.AI","submitted_at":"2026-05-09T11:51:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07073","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TeamBench: Evaluating Agent Coordination under Enforced Role Separation","primary_cat":"cs.AI","submitted_at":"2026-05-08T00:48:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07039","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents","primary_cat":"cs.LG","submitted_at":"2026-05-07T23:38:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing, recommendation, and protein tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02661","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AcademiClaw: When Students Set Challenges for AI Agents","primary_cat":"cs.AI","submitted_at":"2026-05-04T14:40:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00347","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-01T02:05:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22230","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On Benchmark Hacking in ML Contests: Modeling, Insights and Design","primary_cat":"econ.GN","submitted_at":"2026-04-24T05:07:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"In a game-theoretic model of ML contests, low-type contestants engage in benchmark hacking while high-types focus on creative effort, with more skewed rewards improving overall outcomes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19341","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evaluation-driven Scaling for Scientific Discovery","primary_cat":"cs.LG","submitted_at":"2026-04-21T11:24:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17406","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale","primary_cat":"cs.AI","submitted_at":"2026-04-19T12:26:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EvoMaster is a self-evolving agent framework that achieves state-of-the-art results on scientific benchmarks by enabling iterative hypothesis refinement and knowledge accumulation across domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14116","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration","primary_cat":"cs.AI","submitted_at":"2026-04-15T17:38:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12290","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization","primary_cat":"cs.AI","submitted_at":"2026-04-14T05:02:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limited success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12102","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Spatial Atlas: Compute-Grounded Reasoning for Spatial-Aware Research Agent Benchmarks","primary_cat":"cs.AI","submitted_at":"2026-04-13T22:22:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Spatial Atlas implements compute-grounded reasoning via a structured scene graph engine and deterministic computations to deliver competitive accuracy on spatial QA and Kaggle ML benchmarks while preserving interpretability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10718","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?","primary_cat":"cs.AI","submitted_at":"2026-04-12T16:28:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09791","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Pioneer Agent: Continual Improvement of Small Language Models in Production","primary_cat":"cs.AI","submitted_at":"2026-04-10T18:13:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06169","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"In-Place Test-Time Training","primary_cat":"cs.LG","submitted_at":"2026-04-07T17:59:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05912","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks","primary_cat":"cs.CL","submitted_at":"2026-04-07T14:15:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04532","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation","primary_cat":"cs.CL","submitted_at":"2026-04-06T08:54:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Localizing judge prompts to five languages shows that LLM backbones interact with language in agent-as-a-judge evaluations, inverting rankings and revealing no universal best model with low inter-judge agreement.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.09567","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models","primary_cat":"cs.AI","submitted_at":"2025-03-12T17:35:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095, 2024. [68] Hyeong Soo Chang. On the convergence rate of mcts for the optimal value estimation in markov decision processes. IEEE Transactions on Automatic Control, pages 1-6, February 2025. doi: 10.1109/TAC.2025.3538807. URL https://ieeexplore.ieee.org/ document/10870057. [69] Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025. [70] Andong Chen, Yuchen Song, Wenxin Zhu, Kehai Chen, Muyun Yang, Tiejun Zhao, et al. Evaluating o1-like llms: Unlocking reasoning for translation through comprehensive analysis."},{"citing_arxiv_id":"2501.14249","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Humanity's Last Exam","primary_cat":"cs.LG","submitted_at":"2025-01-24T05:27:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"reference point for assessing these capabilities, we publicly release a large number of2,500 questions from HLE to enable this precise measurement, while maintaining a private test set to assess potential model overfitting. 2 Related Work LLM Benchmarks.Benchmarks are important tools for tracking the rapid advancement of LLM capabilities, including scientific [11, 13, 22, 30, 31, 45, 49, 55, 63] and mathematical reasoning [14, 18-20, 23, 32, 46, 52], code generation [7, 10-12, 21, 27, 62], and general-purpose human assistance [1, 8, 9, 26, 41, 43, 44, 49, 56]. Due to their objectivity and ease of automated scoring at scale, evaluations commonly include multiple-choice and short-answer questions [16, 43, 53, 54, 60], with benchmarks such as MMLU [22] also spanning a broad"}],"limit":50,"offset":0}