{"total":59,"items":[{"citing_arxiv_id":"2607.01595","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Safe and Adaptive Cloud Healing: Verifying LLM-Generated Recovery Plans with a Neural-Symbolic World Model","primary_cat":"cs.AI","submitted_at":"2026-07-02T01:45:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PASE is a neuro-symbolic self-healing system that synthesizes LLM recovery plans, verifies them in simulation, and uses DRL to optimize prompts, claiming over 40% faster recovery on cloud fault data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30931","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RoPoLL: Robust Panel of LLM Judges","primary_cat":"cs.AI","submitted_at":"2026-06-29T21:34:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RoPoLL applies the geometric median to aggregate scores from LLM judge panels, yielding finite-sample error bounds and empirical robustness against biased contamination up to 50% rates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30602","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MESA: Prioritizing Vulnerable Communication Channels for Securing Multi-Agent Systems","primary_cat":"cs.CR","submitted_at":"2026-06-29T17:40:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MESA ranks MAS communication edges by vulnerability via graph-theoretic metrics and dynamic probes, achieving mean Spearman ρ=+0.60 correlation with empirical per-edge attack success and 3x interception gain when monitoring the top 10%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30556","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Poller: Are LLMs Suitable for Evaluating the Poetry Understanding Task?","primary_cat":"cs.CL","submitted_at":"2026-06-29T16:51:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Poller reduces LLM-human disagreement in evaluating Chinese poetry understanding by having LLMs role-play as authors, with reported error reductions of 94.55% and 89.53% on rhetorical techniques and defamiliarization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29746","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DEEPMED Search: An Open-Source Agentic Platform for Medical Deep Research with Introspective Verification","primary_cat":"cs.AI","submitted_at":"2026-06-29T03:42:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"DEEPMED Search is an open-source platform with source-adaptive routing and introspective multi-agent verification for generating citation-backed medical research reports.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20161","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ARTEMIS: Agent-guided Reliability-aware Temporal Mask Evolution for Imperfectly Supervised Video Polyp Segmentation","primary_cat":"cs.CV","submitted_at":"2026-06-18T12:27:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ARTEMIS combines a debate-and-judge vision-language agent with SAM2 propagation and reliability-aware robust learning to improve video polyp segmentation from points, scribbles, or limited dense labels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19494","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Hidden Anchors in Multi-Agent LLM Deliberation","primary_cat":"cs.AI","submitted_at":"2026-06-17T18:29:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Multi-agent LLM deliberation is modeled with recoverable hidden anchors that allow opinions to escape the convex hull of initial beliefs, unlike classical consensus models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.12748","ref_index":226,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agent-based models for the evolution of morphological alternation patterns","primary_cat":"cs.CL","submitted_at":"2026-06-10T23:26:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Multi-agent simulations with naturalistic lexicons and phonological rules show scale-free networks and Bernoulli adoption produce more plausible morphologies, evaluated by an LLM historical linguist debate system and tested via historical case studies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09249","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MAGIS: Evidence-Based Multi-Agent Reasoning for Interpretable Strabismus Clinical Decision-Making","primary_cat":"cs.CV","submitted_at":"2026-06-08T09:21:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MAGIS applies multi-agent reasoning with dual-evidence constrained context and corrective verification to raise weighted F1 from 72.0% to 91.3% on a strabismus benchmark while improving report consistency, alignment, and completeness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08367","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Emergence World: A Platform for Evaluating Long-Horizon Multi-Agent Autonomy","primary_cat":"cs.MA","submitted_at":"2026-06-06T22:59:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Emergence World is a model-agnostic multi-agent simulation platform integrating live data, 120+ tools, persistent memory, and democratic governance, illustrated by a 15-day study showing divergent outcomes across five LLM models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06462","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Benchmark Everything Everywhere All at Once","primary_cat":"cs.AI","submitted_at":"2026-06-04T17:52:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Benchmark Agent is an autonomous agentic system that constructs benchmarks for LLMs and MLLMs via query analysis, subtask design, annotation and quality control, yielding 15 benchmarks with minimal human input.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05670","ref_index":68,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows","primary_cat":"cs.AI","submitted_at":"2026-06-04T03:50:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Under controlled identical protocols, only one of six multi-agent LLM systems marginally exceeds a single-agent baseline on benchmark-balanced accuracy while the rest trail and cost more; a runtime workflow reaches 66.72% on GAIA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03650","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks","primary_cat":"cs.CL","submitted_at":"2026-06-02T13:41:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CoEval generates task-specific benchmarks by rotating models through teacher, student, and judge roles, then weights questions by discriminative power and judges by panel consensus to recover accurate model rankings without labels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00405","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Talking Words to Sharing Thoughts: Scalable Multi-LLM Aggregation via Structured Message Passing","primary_cat":"cs.GT","submitted_at":"2026-05-29T22:47:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A bipartite factor graph with message-passing protocol and asymmetric damping aggregates multi-LLM predictions, cutting token use by 97% and API calls by 6X while outperforming baselines on MMLU, MMLU-Pro, GPQA, and MedMCQA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27914","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm","primary_cat":"cs.CL","submitted_at":"2026-05-27T03:41:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00093","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why","primary_cat":"cs.CL","submitted_at":"2026-05-25T07:31:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"For binary LLM judge validation, Pearson's r, Spearman's ρ, Kendall's τ_b, phi, and Matthews correlation all equal a single number on non-degenerate data, Cohen's κ supplies the extra signal on label-rate drift, and a reporting checklist is provided.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18890","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Stop Drawing Scientific Claims from LLM Social Simulations Without Robustness Audits","primary_cat":"physics.soc-ph","submitted_at":"2026-05-17T00:21:53+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Minor perturbations in persona format, instruction framing, and network structure shift cooperation by up to 76 percentage points and polarization metrics consistently, showing that LLM social simulations require per-claim robustness audits via the new TRAILS taxonomy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15706","ref_index":23,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Differentiable Mixture-of-Agents Incentivizes Swarm Intelligence of Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-15T07:54:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DMoA is a differentiable multi-agent framework for LLMs that uses recurrent context-aware routing and predictive entropy for test-time adaptation, claiming SOTA results on 9 benchmarks with efficiency and robustness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11453","ref_index":6,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies","primary_cat":"cs.MA","submitted_at":"2026-05-12T03:11:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Successor-representation spectra of row-stochastic communication operators predict perturbation robustness, consensus speed, and error accumulation in multi-agent LLM topologies, with condition number showing perfect empirical rank correlation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Across this literature, topology is treated as a design parameter chosen by trial and error. We sit upstream of any specific prompting strategy and ask what the communication graph alone tells us about failure modes, before any inference is run. Evaluation of LLM systems.HELM [ 10] and BIG-Bench [11] measure single-model capability; AgentBench [9] and ChatEval [ 6] target multi-agent protocols. These instruments are outcome- oriented and post hoc. We pursue a pre-inference diagnostic derived from the graph itself, comple- mentary to outcome-based evaluation rather than a substitute for it. 2 Spectral analysis of message-passing systems.Spectral approaches to information flow on graphs are well established in graph signal processing and graph neural networks [ 26, 27], where they"},{"citing_arxiv_id":"2605.11376","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LLM-X: A Scalable Negotiation-Oriented Exchange for Communication Among Personal LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-05-12T01:04:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM-X is a scalable architecture for direct negotiation and communication among personal LLM agents, featuring federated gateways, typed protocols, and policy enforcement, shown stable in experiments with up to 12 agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"of open problems, emphasizing task allocation, reasoning through debates, layered context management, and memory design. Our work is complementary: while theirs outlines conceptual challenges, LLM-X contributes a concrete substrate and reproducible evaluation environment. Collaboration and Debate.Recent studies have leveraged multi-agent debate or role-play to enhance reasoning and fac- tuality [ 3, 5, 13, 15]. Other frameworks such asAutoGen[ 32], MetaGPT[ 8], and collaborative environments likeChatarena[ 33] showcase structured multi-agent interactions. These works high- light the potential of LLM-to-LLM communication but often lack a standardized communication substrate. LLM-X differs by intro- ducing a schema-validated protocol with explicit policy controls,"},{"citing_arxiv_id":"2605.10171","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"When Reviews Disagree: Fine-Grained Contradiction Analysis in Scientific Peer Reviews","primary_cat":"cs.CL","submitted_at":"2026-05-11T08:20:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces RevCI benchmark and IMPACT multi-agent framework for evidence-level contradiction detection and graded intensity scoring in peer reviews, distilled into efficient TIDE model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"function configured to retrieve pairs exhibiting po- tential semantic divergence. This high-recall fil- tering ensures that the system retains subtle con- tradictions while discarding clear agreements, pro- viding a unified candidate set for intensity assess- ment. Across review pairs, extracted evidence is accumulated into anAspect-Specific Evidence Pool E={E a1, . . . ,EaM }, where Eam = [ (i,j) E (i,j) am .(2) Deliberative Intensity Agent (DIA):A Delib- erative Intensity Agent (DIA) serves as the core reasoning unit for assigning graded contradiction intensity scores. Given an aspect-aligned evidence pair (e(j) 1 , e(j) 2 )∈ E aj, the agent functions as a probabilistic mapping that predicts a discrete in- tensity label αj ∈ {0,1,2,3} (following the rubric of contradiction intensity 5) and generates a sup- porting explanation/reason for the assigned label ρj: (αj, ρj) =g DIA (e(j) 1 , e(j) 2 , ri, rj),(3) where ri and rj denote the full review contexts. Conditioning on the full context enables the agent to interpret localized evidence spans within the broader evaluative discourse of each reviewer, dis- tinguishing genuine conflict from rhetorical differ- ences. IMPACT employs two DIAs (DIA-A and DIA-B) which share a functional specification but may be instantiated using diverse underlying LLMs to encourage reasoning variance. Intensity Agreement Checker:The Intensity Agreement Checker functions as a deterministic control gate. It compares the agents' initial in- dependent predictions, αA j and αB j , to determine whether they agree (i.e., αA j =α B j ). If agreement holds, the shared intensity label is accepted directly and propagated to downstream components with- out further interaction. Conversely, in the event of disagreement, the deliberation protocol is triggered and managed by the Disagreement Orchestrator. Disagreement Orchestrator:The Disagreement Orchestrator (DO) manages structured interaction 5Here, label 0 denotes \"no valid contradiction\" (i"},{"citing_arxiv_id":"2605.09278","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium","primary_cat":"cs.AI","submitted_at":"2026-05-10T03:04:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"consistently outperforms existing safeguards, remains robust under adversarial agents, and incurs negligible inference overhead. 1 Introduction Multi-agent debate (MAD) systems built on large language models (LLMs) have shown strong performance on complex reasoning [17, 35, 42, 70], embodied action [57, 71], and planning [24, 33, 38] tasks, where agents iteratively discuss, critique, and refine each other's outputs [9, 36, 43]. To support interactions beyond a single round, recent MAD systems add ashared memorythat persists intermediate reasoning, past actions, and episodic trajectories across rounds [3, 69, 89, 94]. While shared memory boosts long-horizon reasoning, it also opens a critical vulnerability: a corrupted memory state, which can subsequently contaminate all downstream memory-augmented reasoning"},{"citing_arxiv_id":"2605.08904","ref_index":104,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces","primary_cat":"cs.AI","submitted_at":"2026-05-09T11:51:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08715","ref_index":4,"ref_count":4,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems","primary_cat":"cs.CL","submitted_at":"2026-05-09T05:55:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"Monitoring reasoning models for misbehavior and the risks of promoting obfuscation.arXiv preprint arXiv:2503.11926, 2025. [3] Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi- agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025. [4] Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023. [5] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al."},{"citing_arxiv_id":"2605.07069","ref_index":20,"ref_count":3,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems","primary_cat":"cs.MA","submitted_at":"2026-05-08T00:30:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Agentic AI needs social theory as structural priors in the MASS framework to model emergent dynamics from multi-agent interactions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"multi-agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025. [19] Damon Centola. The spread of behavior in an online social network experiment.science, 329 (5996):1194-1197, 2010. [20] Shelly Chaiken and Alison Ledgerwood. A theory of heuristic and systematic information processing.Handbook of theories of social psychology, 1:246-266, 2012. [21] Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023. [22] Nuo Chen, Yicheng Tong, Yuzhe Yang, Yufei He, Xueyi Zhang, Zou Qingyun, Qian Wang, and Bingsheng He. Diversity collapse in multi-agent llm systems: Structural coupling and"},{"citing_arxiv_id":"2605.06161","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges","primary_cat":"cs.AI","submitted_at":"2026-05-07T12:49:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Our three-principle framework fills this gap by testing rubric-side invariance specifically in the agent safety domain. Evaluation framework design.Recent LLM-as-a-Judge frameworks span multi-dimensional rubric- based scoring [17], fine-tuned rubric-followers [21, 22], juries [43] aggregating diverse models to reduce single-judge bias, multi-agent debate [ 4], sub-judgment decomposition [ 36], and length- controlled scoring [10] debiasing against verbosity. These designs target how a verdict iscomputed; none asks whether the verdict isinvariantto how the evaluation policy is worded. EvalCards [ 9] standardize evaluation benchmark documentation. Our Judge Card extends this direction tojudge models, reporting invariance properties, not dataset properties."},{"citing_arxiv_id":"2605.03143","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Pact: A Choreographic Language for Agentic Ecosystems","primary_cat":"cs.PL","submitted_at":"2026-05-04T20:32:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Pact is a choreographic language extended with game-theoretic operations that maps every protocol to a formal game for reasoning about agent decisions and solving for decision policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27132","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TRUST: A Framework for Decentralized AI Service v.0.1","primary_cat":"cs.AI","submitted_at":"2026-04-29T19:32:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TRUST is a decentralized AI auditing framework that decomposes reasoning into HDAGs, maps agent interactions via the DAAN protocol to CIGs, and uses stake-weighted multi-tier consensus to achieve 72.4% accuracy while proving a Safety-Profitability Theorem that rewards honest auditors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26679","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MultEval: Supporting Collaborative Alignment for LLM-as-a-Judge Evaluation Criteria","primary_cat":"cs.HC","submitted_at":"2026-04-29T13:49:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MultEval supports collaborative creation of LLM-as-a-judge criteria by surfacing disagreements via consensus-building methods, allowing iterative revisions with examples and history, and keeping transparent how human judgments become automated rules.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21446","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AI-Gram: When Visual Agents Interact in a Social Network","primary_cat":"cs.AI","submitted_at":"2026-04-23T09:05:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Autonomous visual AI agents spontaneously form image reply chains, maintain stable individual styles, and produce richer style-diverse conversations than single agents can achieve alone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19589","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TeamFusion: Supporting Open-ended Teamwork with Multi-Agent Systems","primary_cat":"cs.MA","submitted_at":"2026-04-21T15:40:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TeamFusion uses per-member proxy agents and iterative structured discussions to generate more representative and consensual team deliverables than direct aggregation in open-ended tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18327","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PARM: Pipeline-Adapted Reward Model","primary_cat":"cs.AI","submitted_at":"2026-04-20T14:29:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PARM adapts reward models to multi-stage LLM pipelines via pipeline data and direct preference optimization, improving execution rate and solving accuracy on optimization benchmarks and showing transfer to GSM8K.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Example outputs at each stage are provided, including candidate formulations and solutions, as well as the final selected (gold) formulation and solution. Conceptually, pipeline-based LLM systems can be viewed as a subset of more general multi-agent or tool-augmented frameworks, where multiple intelligent components collabo- rate to solve complex problems [18]-[20]. In such settings, the boundaries between pipeline, workflow orchestration, and multi-agent systems become fluid: pipelines emphasize se- quential, stage-wise processing, while agent-based and tool- augmented approaches may involve more dynamic, interactive, or parallel coordination among components. Workflow orches- tration frameworks further generalize this idea by enabling"},{"citing_arxiv_id":"2604.17503","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology","primary_cat":"cs.AI","submitted_at":"2026-04-19T15:46:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Organizing LLM agents into structured collaboration graphs has progressed from static, role-fixed pipelines [12,24,33,40,46] toward dynamically optimizable topologies. Early graph-based frameworks showed that encoding human work- flows into DAG-structured agent networks consistently outperforms single-agent baselines [13,34], while debate-style topologies demonstrated the value of diverse agent perspectives [4,5,9,21,26]. A pivotal shift came with jointly learnable topologies: GPTSwarm [58] used RL to co-optimize node prompts and edge con- nectivity; G-Designer [50] introduced a variational graph auto-encoder for query- adaptive topology prediction; and MASS [55] revealed that prompt and topology 4 Z. Nie et al. search are mutually reinforcing. Preceding these learnable methods, DyLAN [29]"},{"citing_arxiv_id":"2604.10389","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"BLUEmed: Retrieval-Augmented Multi-Agent Debate for Clinical Error Detection","primary_cat":"cs.CL","submitted_at":"2026-04-12T00:30:31+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08362","ref_index":10,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces","primary_cat":"cs.CL","submitted_at":"2026-04-09T15:26:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces OmniBehavior benchmark from real-world data and shows LLMs exhibit hyper-activity, persona homogenization, and utopian bias in behavior simulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03656","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Retrieval: Modeling Confidence Decay and Deterministic Agentic Platforms in Generative Engine Optimization","primary_cat":"cs.AI","submitted_at":"2026-04-04T09:17:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Deterministic multi-agent intent routing can reduce hallucinations in generative engines to near zero by limiting LLMs to intent routers and handing off tasks to specialized agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02863","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EMS: Multi-Agent Voting via Efficient Majority-then-Stopping","primary_cat":"cs.AI","submitted_at":"2026-04-03T08:29:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"EMS reduces the average number of agents invoked for majority voting by 32% via reliability-aware prioritization and early stopping on six benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Recent advances in large language models (LLMs) have stimulated increasing interest in multi- agent systems (MAS) [1, 2, 4], where multiple LLM-based agents collaborate to solve complex tasks. Representative frameworks such as AutoGen [ 9] enable structured collaboration among multiple agents through role assignment, communication, and task decomposition. Other interaction-based approaches, including multi-agent debate [17, 12], further encourage agents to critique and refine one another's intermediate reasoning, thereby improving the quality of final outputs. While these studies demonstrate the effectiveness of collaborative reasoning for improving task performance, the efficiency of multi-agent inference, particularly the reduction of redundant agent calls during decision"},{"citing_arxiv_id":"2604.09679","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate","primary_cat":"cs.MA","submitted_at":"2026-04-03T06:58:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HCP-MAD reduces token costs in multi-agent debates by using heterogeneous consensus verification, adaptive pair-agent stopping, and escalated collective voting based on task complexity signals.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"majority voting [8, 14, 15] or weighted averaging strategies [16, 17] to aggregate the responses generated by multiple agents independently. However, it lacks the interaction among different agents, failing to resolve shared biases or deeper cognitive conflicts. Multi-Agent Debate (MAD) involves iteratively critiquing and refining intermediate solutions to facilitate the exchange of thoughts among agents [ 18, 15, 19]. However, many MAD methods [ 20, 21] employ a debate process with fixed interaction topologies and a predetermined number of rounds for all tasks, resulting in token redundancy and inaccuracies due to overfitting the debate. To enhance the efficiency of MAD, some recent studies [22, 3] aim to generate optimized intra-round topologies by refining the communication structure."},{"citing_arxiv_id":"2604.02674","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems","primary_cat":"cs.MA","submitted_at":"2026-04-03T03:08:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM agent societies develop power-law coordination cascades and intellectual elites through an integration bottleneck that grows with system size.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi- agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025. [7] Meeyoung Cha, Hamed Haddadi, Fabricio Benevenuto, and Krishna Gummadi. Measuring user influence in twitter: The million follower fallacy. InProceedings of the international AAAI conference on web and social media, volume 4, pages 10-17, 2010. [8] Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023. [9] Harrison Chase and LangChain Inc. Langgraph: Building stateful, multi-agent applications with llms, 2024. [10] Justin Chen, Swarnadeep Saha, and Mohit Bansal."},{"citing_arxiv_id":"2603.27771","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Emergent Social Intelligence Risks in Generative Multi-Agent Systems","primary_cat":"cs.MA","submitted_at":"2026-03-29T17:10:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Generative multi-agent systems exhibit emergent collusion and conformity behaviors that cannot be prevented by existing agent-level safeguards.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"agent collectives, despite no instruction to do so, spontaneously reproduce familiar failure patterns from human societies. 1. Introduction Multi-agent systems (MAS) built from modern generative models are increasingly capable of co- ordinating, competing, and negotiating over shared resources and structured workflows to solve complex tasks [43, 111]. As a result, MAS are rapidly expanding across a wide range of downstream applications [17, 57, 1, 122, 130]. With the growing social competence of these systems, agents can now perform complex interaction patterns such as buyer-seller negotiation [135], collaborative task execution [75], and large-scale information propagation [64]. As MAS increasingly resemble interacting societies of agents rather than isolated tools [55], assessing the safety and trustworthiness"},{"citing_arxiv_id":"2604.16314","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Software Self-Extension with SelfEvolve: an Agentic Architecture for Runtime Code Generation","primary_cat":"cs.SE","submitted_at":"2026-02-06T11:43:31+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SelfEvolve achieves 92.7% Pass@1 success on 11 runtime self-extension tasks and outperforms baselines like AutoGen by 61.8% with statistical significance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.22297","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate","primary_cat":"cs.CL","submitted_at":"2026-01-29T20:21:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SDRL trains LLMs via self-generated multi-path debates and joint optimization of standalone plus debate-conditioned responses to boost both single-model reasoning and multi-agent debate performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.05106","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Token-Level LLM Collaboration via FusionRoute","primary_cat":"cs.AI","submitted_at":"2026-01-08T16:53:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FusionRoute augments token-level expert routing with a trainable complementary logit generator to expand the policy class and recover optimal decoding under mild conditions, outperforming prior collaboration and merging methods on reasoning and generation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.15408","ref_index":80,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Chinese Short-Form Creative Content Generation via Explanation-Oriented Multi-Objective Optimization","primary_cat":"cs.CL","submitted_at":"2025-11-19T13:05:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MAGIC-HMO is a multi-agent framework that treats Chinese short-form creative NLG as heterogeneous multi-objective optimization over personalized constraints plus explanation reliability and outperforms baselines on a baby-naming benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.10287","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models","primary_cat":"cs.LG","submitted_at":"2025-11-13T13:18:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OutSafe-Bench supplies the first large-scale four-modality safety dataset and evaluation framework that exposes persistent unsafe outputs in nine leading multimodal LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.01188","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ZoFia: Zero-Shot Fake News Detection with Entity-Guided Retrieval and Multi-LLM Interaction","primary_cat":"cs.CL","submitted_at":"2025-11-03T03:29:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ZoFia is a zero-shot fake news detection framework that uses hierarchical entity salience retrieval followed by multi-LLM adversarial debate to improve robustness over single-model approaches.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.20325","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs","primary_cat":"cs.CL","submitted_at":"2025-08-28T00:07:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GUARD automates generation of guideline-violating questions and jailbreak diagnostics to test LLM compliance with government ethics guidelines, validated empirically on eight models and extended to vision-language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.09788","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit","primary_cat":"cs.MA","submitted_at":"2025-07-13T21:00:27+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.04565","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems","primary_cat":"cs.MA","submitted_at":"2025-06-05T02:34:43+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Multi-agent systems encompass any architecture involving two or more autonomous agents-LLMs that interact within a shared environment to achieve individual or collective goals. Communication strategy defines the protocols and methods through which LLM agents exchange information, coordinate actions, and negotiate meanings within a multi-agent setting. For example, ChatEval [ 16], a multi-agent debate framework that uses multiple LLMs with diverse roles to collaboratively evaluate generated text, aiming to simulate the quality and depth of human evaluation. ChatEval enables multiple LLM agents, each with a distinct role prompt (e.g., critic, scientist), to engage in structured discussion through designed communication strategies (one-by-one, simultaneous, or"},{"citing_arxiv_id":"2503.13657","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Why Do Multi-Agent LLM Systems Fail?","primary_cat":"cs.AI","submitted_at":"2025-03-17T19:04:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023. 14 [61] Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu, and Jun Zhao. Large language models are better reasoners with self-verification. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. [62] LangChain. Langgraph, 2024. URLhttps://www.langchain.com/langgraph. [63] Anthropic. Building effective agents, 2024. URLhttps://www.anthropic.com/research/ building-effective-agents. [64] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models."}],"limit":50,"offset":0}