{"total":20,"items":[{"citing_arxiv_id":"2606.30182","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MirrorCode: AI can rebuild entire programs from behavior alone","primary_cat":"cs.AI","submitted_at":"2026-06-29T11:57:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MirrorCode benchmark shows current AI models achieving up to 56% success reimplementing 25 diverse full programs from behavior alone, including a 16,000-line bioinformatics toolkit.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11926","ref_index":160,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Toward Generalist Autonomous Research via Hypothesis-Tree Refinement","primary_cat":"cs.CL","submitted_at":"2026-06-10T10:57:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27492","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems","primary_cat":"cs.SE","submitted_at":"2026-05-26T16:28:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAMP evaluates 15 models on production-like serial workflows and reports completion rates collapsing from 100% to 20% with none finishing the full pipeline and costs varying by three orders of magnitude.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20744","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale","primary_cat":"cs.LG","submitted_at":"2026-05-20T05:46:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Presents Hack-Verifiable TextArena, a benchmark that embeds verifiable reward hacking opportunities into environments to enable deterministic measurement of exploitation by language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19156","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How Far Are We From True Auto-Research?","primary_cat":"cs.AI","submitted_at":"2026-05-18T22:20:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ResearchArena shows that agent-generated papers fail top-tier acceptance standards primarily due to fabricated results, underpowered experiments, and plan-execution mismatches that vary sharply by agent.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18661","ref_index":222,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AI for Auto-Research: Roadmap & User Guide","primary_cat":"cs.AI","submitted_at":"2026-05-18T17:08:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"represented in a shared and updateable workspace, then phase handoffs remain fragile despite the presence of many tools. 7.1.4 Multi-Agent and Community-Scale Systems Multi-agent systems distribute research tasks across specialized agents, such as researchers, engineers, reviewers, analyzers, writers, or simulated community members. FreePhDLabor [100], SciMaster [19], EvoScientist [127], UniScientist [99], Medical AI Scientist [222], AiScientist-LH [23], FARS [8], and AutoResearchClaw [117] illustrate different forms of multi-agent orchestration. Related community-scale systems such as VirSci [193], AgentRxiv [170], and ResearchTown [242] further simulate aspects of scientific collaboration, including idea exchange, manuscript writing, review, and revision. The motivation for multi-agent architectures is that research requires heterogeneous expertise and adversarial"},{"citing_arxiv_id":"2605.17373","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics","primary_cat":"cs.LG","submitted_at":"2026-05-17T10:30:38+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14445","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale","primary_cat":"cs.LG","submitted_at":"2026-05-14T06:39:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13950","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction","primary_cat":"cs.LG","submitted_at":"2026-05-13T18:00:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02050","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Principles and Guidelines for Randomized Controlled Trials in AI Evaluation","primary_cat":"cs.CY","submitted_at":"2026-05-03T20:37:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The authors adapt established RCT validity principles from other fields into a standardized framework with 33 guidelines tailored to AI evaluation contexts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24966","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Risk Reporting for Developers' Internal AI Model Use","primary_cat":"cs.CY","submitted_at":"2026-04-27T20:07:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14116","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration","primary_cat":"cs.AI","submitted_at":"2026-04-15T17:38:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"variants [5] and synthesize post-training objectives [28]. These methods remain constrained by predefined search spaces or focus on optimizing isolated components. In contrast, our work explores a more open-ended setting, directly automating the entire LLM training lifecycle. AI for Data Construction.Recent studies extensively utilize LLMs for data synthesis [31], evolutionary refinement [48], and quality filtering [26]. To facilitate these tasks, dedicated frameworks [3, 24] have been 3 TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration developed to ensure reproducible and scalable data engineering. However, these approaches typically employ LLMs as discrete tools within predetermined protocols. In contrast, we propose a holistic and fully automatic"},{"citing_arxiv_id":"2604.12290","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization","primary_cat":"cs.AI","submitted_at":"2026-04-14T05:02:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limited success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.09514","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies","primary_cat":"cs.CL","submitted_at":"2026-02-10T08:12:23+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.12826","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scheming Ability in LLM-to-LLM Strategic Interactions","primary_cat":"cs.CL","submitted_at":"2025-10-11T04:42:29+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Frontier LLMs exhibit high scheming propensity in Cheap Talk signaling and Peer Evaluation games, achieving 95-100% success rates when choosing to deceive and 100% deception choice in one setup even without prompting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.11473","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety","primary_cat":"cs.AI","submitted_at":"2025-07-15T16:43:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.06261","ref_index":89,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","primary_cat":"cs.CL","submitted_at":"2025-07-07T17:36:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.10517","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"KernelBench: Can LLMs Write Efficient GPU Kernels?","primary_cat":"cs.LG","submitted_at":"2025-02-14T19:30:53+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"KernelBench shows that even the best current LLMs generate correct and faster-than-baseline GPU kernels in fewer than 20 percent of realistic ML workloads.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.14249","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Humanity's Last Exam","primary_cat":"cs.LG","submitted_at":"2025-01-24T05:27:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"agents against human experts, 2024. URLhttps://arxiv.org/abs/2411.15114. [58] xAI. Grok-2 beta release, 2024. URLhttps://x.ai/blog/grok-2. [59] F. Yan, H. Mao, C. C.-J. Ji, T. Zhang, S. G. Patil, I. Stoica, and J. E. Gonzalez. Berkeley function call- ing leaderboard. https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_ leaderboard.html, 2024. [60] Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering, 2018. URL https://arxiv.org/abs/ 1809.09600. [61] S. Yao, N. Shinn, P. Razavi, and K. Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains, 2024. URLhttps://arxiv."},{"citing_arxiv_id":"2412.04984","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Frontier Models are Capable of In-context Scheming","primary_cat":"cs.AI","submitted_at":"2024-12-06T12:09:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}