{"total":29,"items":[{"citing_arxiv_id":"2605.12943","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reinforced Collaboration in Multi-Agent Flow Networks","primary_cat":"cs.LG","submitted_at":"2026-05-13T03:26:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MANGO optimizes multi-agent LLM workflows via flow networks, RL, and textual gradients, delivering up to 12.8% higher performance and 47.4% better efficiency while generalizing to new domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12376","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows","primary_cat":"cs.AI","submitted_at":"2026-05-12T16:42:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ProfiliTable is a profiling-driven multi-agent system that builds semantic context through exploration and closed-loop refinement to produce more reliable tabular data transformations than prior LLM approaches.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10516","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability","primary_cat":"cs.AI","submitted_at":"2026-05-11T13:06:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10052","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Swarm Skills: A Portable, Self-Evolving Multi-Agent System Specification for Coordination Engineering","primary_cat":"cs.CL","submitted_at":"2026-05-11T06:26:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Swarm Skills is a distributable specification for multi-agent workflows that includes roles, execution bounds, and a self-evolution algorithm to automatically improve coordination strategies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08831","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AssemPlanner: A Multi-Agent Based Task Planning Framework for Flexible Assembly System","primary_cat":"cs.RO","submitted_at":"2026-05-09T09:36:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"AssemPlanner is a ReAct-based multi-agent system that autonomously generates production plans from natural language inputs by integrating scheduling, knowledge, line balancing, and scene graph feedback.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07462","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment","primary_cat":"cs.CL","submitted_at":"2026-05-08T09:10:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03195","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?","primary_cat":"cs.AI","submitted_at":"2026-05-04T22:24:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A fine-tuned 4B model matches or exceeds frontier LLMs in terminal execution subagent tasks for coding agents, reducing main agent token usage by 30% with no performance loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00382","ref_index":118,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Social Bias in LLM-Generated Code: Benchmark and Mitigation","primary_cat":"cs.SE","submitted_at":"2026-05-01T04:06:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04097","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness","primary_cat":"q-bio.NC","submitted_at":"2026-04-30T20:48:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27974","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting","primary_cat":"cs.CV","submitted_at":"2026-04-30T15:03:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FineState-Bench and FineState-Metrics show LVLMs achieve only 22.8% average exact-state success in GUI interactions, with visual diagnostic hints improving results by up to 14.9 points.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27209","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves","primary_cat":"cs.SE","submitted_at":"2026-04-29T21:28:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26275","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Agentic AI in the Software Development Lifecycle: Architecture, Empirical Evidence, and the Reshaping of Software Engineering","primary_cat":"cs.SE","submitted_at":"2026-04-29T04:06:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Agentic AI systems are shifting software engineering from line-level code generation to delegated repository-scale execution under supervision, with SWE-bench performance rising from 1.96% to 78.4% and productivity gains of 13.6-55.8%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23580","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PhysCodeBench: Benchmarking Physics-Aware Symbolic Simulation of 3D Scenes via Self-Corrective Multi-Agent Refinement","primary_cat":"cs.RO","submitted_at":"2026-04-26T07:37:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PhysCodeBench benchmark and SMRF multi-agent framework enable better AI generation of physically accurate 3D simulation code, boosting performance by 31 points over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23338","ref_index":108,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework","primary_cat":"cs.CR","submitted_at":"2026-04-25T14:57:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23088","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Code Broker: A Multi-Agent System for Automated Code Quality Assessment","primary_cat":"cs.SE","submitted_at":"2026-04-25T00:53:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Code Broker deploys a five-agent hierarchy that combines LLM semantic analysis with static linting to generate actionable Python code quality reports.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19926","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CreativeGame:Toward Mechanic-Aware Creative Game Generation","primary_cat":"cs.AI","submitted_at":"2026-04-21T19:16:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CreativeGame enables iterative HTML5 game generation via mechanic-guided planning, lineage memory, runtime validation, and programmatic rewards to produce inspectable version-to-version mechanic evolution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19211","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ClawNet: Human-Symbiotic Agent Network for Cross-User Autonomous Cooperation","primary_cat":"cs.AI","submitted_at":"2026-04-21T08:15:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ClawNet digitizes human collaborative relationships into a network of identity-governed AI agents that collaborate on behalf of their owners through a central orchestrator enforcing binding and verification.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18133","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Multi-Agent Systems: From Classical Paradigms to Large Foundation Model-Enabled Futures","primary_cat":"cs.AI","submitted_at":"2026-04-20T12:00:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey comparing classical multi-agent systems with large foundation model-enabled multi-agent systems, showing how the latter enables semantic-level collaboration and greater adaptability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17419","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ARMove: Learning to Predict Human Mobility through Agentic Reasoning","primary_cat":"cs.MA","submitted_at":"2026-04-19T12:59:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ARMove is a transferable framework for human mobility prediction that combines agentic LLM reasoning, feature management, and large-small model synergy to outperform baselines on several metrics while improving interpretability and robustness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13103","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Fairness in Multi-Agent Systems for Software Engineering: An SDLC-Oriented Rapid Review","primary_cat":"cs.SE","submitted_at":"2026-04-10T13:49:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A rapid review of fairness in LLM-enabled multi-agent systems for the software development lifecycle concludes that the field lacks standardized evaluations, broad coverage, and effective governance, leaving it unprepared for deployable fair systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07769","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models","primary_cat":"cs.SE","submitted_at":"2026-04-09T03:48:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Data-influence-score filtering using validation-set loss on downstream coding tasks improves Code-LLM performance, with the most beneficial training data varying significantly across different programming tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08601","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OpenKedge: Governing Agentic Mutation with Execution-Bound Safety and Evidence Chains","primary_cat":"cs.AI","submitted_at":"2026-04-07T22:51:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"OpenKedge redefines AI agent state mutations as a governed process using intent proposals, policy-evaluated execution contracts, and cryptographic evidence chains to enable safe, auditable agentic behavior.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05952","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Towards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration","primary_cat":"cs.AI","submitted_at":"2026-04-07T14:46:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A deep research agent incorporates progressive confidence estimation and calibration to produce trustworthy reports with transparent confidence scores on claims.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04060","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks","primary_cat":"cs.CR","submitted_at":"2026-04-05T11:06:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.01151","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Detecting Multi-Agent Collusion Through Multi-Agent Interpretability","primary_cat":"cs.AI","submitted_at":"2026-04-01T17:08:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"NARCBench and five activation-probing methods detect multi-agent collusion with 0.73-1.00 AUROC across distribution shifts and steganographic tasks by aggregating per-agent signals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.13657","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Why Do Multi-Agent LLM Systems Fail?","primary_cat":"cs.AI","submitted_at":"2025-03-17T19:04:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"URL http://dx.doi.org/10. 1007/s11704-024-40231-1. [5] Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Chatdev: Communicative agents for software development.arXiv preprint arXiv:2307.07924, 2023. URL https://arxiv.org/abs/2307.07924. [6] Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai software"},{"citing_arxiv_id":"2309.07864","ref_index":110,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Rise and Potential of Large Language Model Based Agents: A Survey","primary_cat":"cs.AI","submitted_at":"2023-09-14T17:12:03+00:00","verdict":"ACCEPT","verdict_confidence":"HIGH","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"models undergo pre-training on large-scale corpora and demonstrate the capacity for few-shot and zero-shot generalization, allowing for seamless transfer between tasks without the need to update parameters [41; 105; 106; 107]. LLM-based agents have been applied to various real-world scenarios, 8 such as software development [108; 109] and scientific research [110]. Due to their natural language comprehension and generation capabilities, they can interact with each other seamlessly, giving rise to collaboration and competition among multiple agents [108; 109; 111; 112]. Furthermore, research suggests that allowing multiple agents to coexist can lead to the emergence of social phenomena [22]. 2.3 Why is LLM suitable as the primary component of an Agent's brain?"},{"citing_arxiv_id":"2308.07201","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate","primary_cat":"cs.CL","submitted_at":"2023-08-14T15:13:04+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Multi-agent debate among LLMs yields more reliable text evaluations than single-agent prompting by simulating collaborative human judgment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.19118","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate","primary_cat":"cs.CL","submitted_at":"2023-05-30T15:25:45+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Multi-agent debate with tit-for-tat arguments and a judge LLM improves reasoning by preventing LLMs from locking into incorrect initial solutions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}