{"total":11,"items":[{"citing_arxiv_id":"2606.11926","ref_index":135,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Toward Generalist Autonomous Research via Hypothesis-Tree Refinement","primary_cat":"cs.CL","submitted_at":"2026-06-10T10:57:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27328","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Governed Evolution of Agent Runtimes through Executable Operational Cognition","primary_cat":"cs.SE","submitted_at":"2026-05-26T17:36:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Introduces HarnessMutation as a governed mechanism for lifecycle-aware runtime adaptation in agent systems, modeling evolution as a bounded observable process over persistent operational memory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20743","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction","primary_cat":"cs.CV","submitted_at":"2026-05-20T05:46:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Draw2Think recasts geometric reasoning as agentic interaction with a constraint engine, achieving 95.9% predicate-level construction fidelity and up to 16.4% accuracy gains on solid geometry tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18747","ref_index":110,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Code as Agent Harness","primary_cat":"cs.CL","submitted_at":"2026-05-18T17:59:03+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Table 2: Representative systems where code serves as an action interface. Method Mechanism Action Paradigm Key Innovation AutoHarness [109] Harness Gen. Action validation Synthesizes code harnesses that mediate model actions and filter invalid environment interactions SayCan [9] Skill Selec. Affordance-based Links LLM plans to physical feasibility KnowNo [110] Skill Selec. Conformal prediction Calibrates planner uncertainty for ambiguous instructions SkillVLA [111] Skill Selec. Bimanual grounding Extends grounding to combinatorial skill reuse BOSS [112] Skill Selec. Skill bootstrapping Synthesizes new executable skill chains via guided practice LLM-Guided Traj. [113] Skill Selec. Trajectory generation Generates diverse manipulation trajectories and executable success"},{"citing_arxiv_id":"2605.10754","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents","primary_cat":"cs.AI","submitted_at":"2026-05-11T15:53:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"settings. Nevertheless, existing systems exhibit significant deficiencies in their revision capabil- ities [16]. The absence of robust closed-loop feedback constitutes a fundamental architectural bottleneck, systematically constraining adaptability to out-of-distribution tasks and novel operating environments. The emerging literature onagent harnesses[ 23] addresses this limitation at the engineering level: A harness enforces closed-loop execution by intercepting tool calls, structuring intermediate results, and re-injecting them into the model's context in a form that necessitates subse- quent deliberation and action. From the cybernetics perspective, the performance gains attributable to harnesses demonstrates that the feedback-closure implementation is critical for stable performance."},{"citing_arxiv_id":"2605.09186","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agentic MIP Research: Accelerated Constraint Handler Generation","primary_cat":"cs.AI","submitted_at":"2026-05-09T21:53:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM agents in a solver-aware harness recover global constraints from MIP formulations, generate executable propagation-only handlers for SCIP, and solve five additional MIPLIB 2017 instances.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08741","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-09T07:06:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Supervised fine-tuning (SFT) imitates static demonstrations, but does not teach the model to adapt its procedure. Reinforcement learning ∗Equal contribution. †Corresponding author. arXiv:2605.08741v1 [cs.CL] 9 May 2026 (RL) optimizes task-level rewards, but often provides sparse supervision that weakly identifies which procedural behaviors matter. On-policy distillation (OPD) [ 19, 30, 1], which trains on the student's own trajectories under dense token-level supervision from a teacher, is therefore a natural vehicle for internalizing the behavior of an inference-time harness. This leaves a central question: can the step-by-step procedure induced by such a harness be absorbed into the model parameters? We study this question throughon-policy harness self-distillation(OPHSD), a self-distillation method"},{"citing_arxiv_id":"2605.08520","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration","primary_cat":"cs.LG","submitted_at":"2026-05-08T22:04:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA workloads.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The same design also applies to ACE and Meta-Harness. 1 Introduction A growing line of recent work enables LLM agents to evolve themselves. Instead of updating model weights, these systems iteratively refine the non-parametric components that govern their behavior, including system prompts [2, 28, 29], context and memory [34, 22, 33], harness code [16, 14] and generated programs [20, 13, 3]. This emerging paradigm of test-time self-evolution [6] fundamentally relaxes the access requirements of weight-space adaptation: it requires neither the labeled trajectories used by supervised fine-tuning nor the gradient updates required by reinforcement learning. By having an LLM reflect on full execution traces rather than optimize against scalar rewards, this paradigm"},{"citing_arxiv_id":"2604.22937","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs","primary_cat":"cs.CL","submitted_at":"2026-04-24T18:22:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18576","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs","primary_cat":"cs.AI","submitted_at":"2026-04-20T17:57:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BLF achieves state-of-the-art binary forecasting on ForecastBench by using linguistic belief states updated in tool-use loops, hierarchical multi-trial logit averaging, and hierarchical Platt scaling calibration.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"not individually significant on AIBQ2 alone, but the direction agrees with the much larger and significant FB effects in Fig. 3. 23 Table 12:Metaculus Baseline Score ( MS) comparison on FB A∪B. ∗ FB leaderboard; † partial overlap. 95% bootstrap CIs in brackets. MS↑ Method Market Data All BLF+crowd+emp+cal88.3[81,94]39.9[35,45]64.1[59,69] Cassi∗ 81.5[75,87]33.1[28,38]57.3[53,61] GPT-5 ZS+freeze∗ 78.5[69,86]35.4[31,40]56.9[52,62] Grok 4.20∗† 72.2[57,84]37.2[32,43]54.7[47,62] Foresight-32B∗† 81.4[70,90]27.6[19,36]54.5[47,61] Crowd+emp (no LLM) 80.8[74,87]29.4[25,34]55.1[51,59] ZS+crowd+emp 75.5[69,83]9.0[−4,21]42.2[36,49] Table 13:Brier Score (BS) comparison on FB A ∪B (n=791 resolution dates from 400 questions). ∗ FB leaderboard; † partial overlap. 95% bootstrap CIs in brackets."},{"citing_arxiv_id":"2604.18292","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence","primary_cat":"cs.AI","submitted_at":"2026-04-20T14:01:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[58] Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. [59] Ryan Lopopolo. Harness engineering: leveraging codex in an agent-first world.https://openai.com/index/ harness-engineering/, feb 2026. OpenAI Engineering Blog. Accessed: 2026-04-06. [60] Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, and Kevin P. Murphy. Autoharness: improving llm agents by automatically synthesizing a code harness, 2026. URL https://arxiv.org/abs/2603.03329. [61] Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, and Ruoming Pang."}],"limit":50,"offset":0}