{"total":21,"items":[{"citing_arxiv_id":"2605.11665","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Nautilus: From One Prompt to Plug-and-Play Robot Learning","primary_cat":"cs.RO","submitted_at":"2026-05-12T07:26:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11388","ref_index":71,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Deep Reasoning in General Purpose Agents via Structured Meta-Cognition","primary_cat":"cs.CL","submitted_at":"2026-05-12T01:21:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10913","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace","primary_cat":"cs.AI","submitted_at":"2026-05-11T17:50:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"partial","one_line_summary":"Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10870","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory","primary_cat":"cs.AI","submitted_at":"2026-05-11T17:20:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10754","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents","primary_cat":"cs.AI","submitted_at":"2026-05-11T15:53:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10365","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values","primary_cat":"cs.AI","submitted_at":"2026-05-11T11:09:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"cc/paper_files/paper/2024/file/5a7c947568c1b1328ccc5230172e1e7c-Pap er-Conference.pdf. [41] Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses, 2026. URLhttps://arxiv.or g/abs/2603.28052. 15 Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values [42] Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, and Haitao Zheng. Natural-language agent harnesses.CoRR, abs/2603.25723, 2026. doi: 10.48550/ARXIV.2603.25723. URLhttps: //doi.org/10.48550/arXiv.2603.25723. [43] Shalom H Schwartz. Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries. InAdvances in experimental social psychology, volume 25,"},{"citing_arxiv_id":"2605.09998","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Continual Harness: Online Adaptation for Self-Improving Foundation Agents","primary_cat":"cs.LG","submitted_at":"2026-05-11T05:21:33+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and closing much of the gap to expert harnesses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09650","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Workspace Optimization: How to Train Your Agent","primary_cat":"cs.AI","submitted_at":"2026-05-10T16:52:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Workspace optimization evolves an agent's external workspace using multi-agent systems, with DreamTeam raising ARC-AGI-3 scores from 36% to 38.4% while using 31% fewer actions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09186","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Agentic MIP Research: Accelerated Constraint Handler Generation","primary_cat":"cs.AI","submitted_at":"2026-05-09T21:53:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM agents in a solver-aware harness recover global constraints from MIP formulations, generate executable propagation-only handlers for SCIP, and solve five additional MIPLIB 2017 instances.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08520","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration","primary_cat":"cs.LG","submitted_at":"2026-05-08T22:04:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA workloads.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08083","ref_index":20,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling","primary_cat":"cs.CL","submitted_at":"2026-05-08T17:59:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning tasks at low cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03808","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Agentic-imodels: Evolving agentic interpretability tools via autoresearch","primary_cat":"cs.AI","submitted_at":"2026-05-05T14:35:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03042","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration","primary_cat":"cs.SE","submitted_at":"2026-05-04T18:10:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ARIS is a three-layer open-source system that uses cross-model adversarial collaboration plus claim-auditing pipelines to make LLM-driven research workflows more reliable.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25850","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses","primary_cat":"cs.CL","submitted_at":"2026-04-28T16:55:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AHE automates coding-agent harness evolution via component, experience, and decision observability, raising Terminal-Bench 2 pass@1 from 69.7% to 77.0% with transfer gains across models and benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20801","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Synthesizing Multi-Agent Harnesses for Vulnerability Discovery","primary_cat":"cs.CR","submitted_at":"2026-04-22T17:27:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20938","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HARBOR: Automated Harness Optimization","primary_cat":"cs.LG","submitted_at":"2026-04-22T13:45:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HARBOR formalizes harness optimization as constrained noisy Bayesian optimization over mixed-variable spaces and reports a case study where it outperforms manual tuning on a production coding agent.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18576","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs","primary_cat":"cs.AI","submitted_at":"2026-04-20T17:57:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BLF achieves state-of-the-art binary forecasting on ForecastBench by using linguistic belief states updated in tool-use loops, hierarchical multi-trial logit averaging, and hierarchical Platt scaling calibration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18543","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ClawEnvKit: Automatic Environment Generation for Claw-Like Agents","primary_cat":"cs.AI","submitted_at":"2026-04-20T17:36:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lower cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13630","ref_index":2,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment","primary_cat":"cs.CR","submitted_at":"2026-04-15T08:59:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SafeHarness is a lifecycle-integrated security architecture for LLM agents that cuts unsafe behavior rate by 38% and attack success rate by 42% via four coordinated layers while keeping task utility intact.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13151","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Exploration and Exploitation Errors Are Measurable for Language Model Agents","primary_cat":"cs.AI","submitted_at":"2026-04-14T17:59:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A policy-agnostic metric and controllable 2D grid environments with task DAGs enable measurement of exploration and exploitation errors in language model agents from observed actions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05912","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks","primary_cat":"cs.CL","submitted_at":"2026-04-07T14:15:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}