arxiv: 2605.01750 · v2 · submitted 2026-05-03 · 💻 cs.MA · cs.AI

Recognition: unknown

Talk is Cheap, Communication is Hard: Dynamic Grounding Failures and Repair in Multi-Agent Negotiation

Yiheng Yao , Chelsea Zou , Robert D. Hawkins

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:11 UTC · model grok-4.3

classification 💻 cs.MA cs.AI

keywords multi-agent negotiationdynamic groundingLLM coordination failuresPareto optimalitygrounding repairresource allocationmulti-turn dialogue

0 comments

The pith

LLM agent pairs fail to reach Pareto-optimal allocations in multi-turn negotiation even when each agent identifies the optimum alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current large language models can solve a resource allocation problem correctly when working in isolation but consistently miss the jointly optimal outcome when two agents must negotiate over several turns. Failures arise from specific breakdowns in maintaining mutual understanding: agents lose track of conversation history, anchor to initial proposals, default to equal splits instead of coordinated gains, and mix up references to earlier statements. Baseline comparisons rule out simple explanations such as weak individual reasoning or lack of information sharing, pointing instead to the difficulty of forming, committing to, and executing a shared plan across interaction turns. This gap matters because many practical AI deployments require agents to coordinate over extended dialogues rather than in single exchanges.

Core claim

In an iterated multi-turn negotiation game where two agents allocate shared resources to private projects, LLM dyads fail to reach verifiable jointly optimal outcomes across models. Although single agents identify the Pareto-optimal allocations correctly, pairs exhibit four recurrent failure modes: loss of shared interaction history, stubborn anchoring to early proposals, preference for equal splits over reward-maximizing coordination, and referential binding errors across turns. The coordination gap persists after baselines demonstrate that individual reasoning limits and insufficient information exchange do not account for the shortfall, locating the bottleneck in dynamic grounding—the on-

What carries the argument

Dynamic grounding: the collaborative, multi-turn process of establishing and repairing mutual belief sufficient for joint plan formation, commitment, and execution.

If this is right

Negotiation benchmarks must include multi-turn repair rather than one-shot static tasks.
Model performance on coordination tasks depends on mechanisms for tracking joint history and commitments.
Equal-split defaults and anchoring suggest that reward-maximizing behavior requires explicit training for dynamic plan maintenance.
Referential errors across turns indicate the need for better cross-turn reference resolution in interactive settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same grounding failures may appear in longer-horizon tasks such as collaborative planning or tool use between agents.
Providing agents with an external shared ledger of proposals and commitments could serve as a minimal intervention to test the grounding hypothesis.
The pattern suggests that scaling model size alone may not close the gap without architectural changes for maintaining mutual context.

Load-bearing premise

The observed coordination failures are caused by dynamic grounding breakdowns rather than limits in single-agent reasoning or raw information transfer, as indicated by the baseline results.

What would settle it

An experiment in which agents equipped with explicit shared memory of all prior turns or a structured commitment protocol reach the Pareto-optimal allocation on a high fraction of trials.

Figures

Figures reproduced from arXiv: 2605.01750 by Chelsea Zou, Robert D. Hawkins, Yiheng Yao.

**Figure 1.** Figure 1: Illustration of the resource allocation game agents play. Each agent has a private set of projects with different requirements and makes purchases from a common pool of resources after exchanging up to 5 messages each. If joint resource purchases exceed capacity, an overdraw occurs and no rewards are given to either agent. The game is iterated over 4 rounds with the same or different partner and projects. … view at source ↗

**Figure 2.** Figure 2: Joint efficiency for self-play (solid) and cross-play (hatched) dyads across compatibility ratios. Evaluating cross-play pairings reveals that heterogeneous dyads consistently outperform self-play under competitive conditions ( view at source ↗

**Figure 3.** Figure 3: Value of cheap talk across compatibility ratios, aggregated over all models and conditions. Filled dots show cheap-talk performance; hollow dots show the no-talk baseline, and the shaded region represents the gain from communication. All three metrics are oriented so that higher is better: we report 1 − overdraw rate (fraction of rounds without supply violation), joint efficiency, and optimum rate. Cheap t… view at source ↗

**Figure 4.** Figure 4: Stable-shifting gap by model across M/C ratios. Positive values indicate stable outperforms shifting. GPT-5 Mini and Qwen 3.5 Flash benefit from shared history, but not Sonnet 4.5. points for Qwen 3.5 Flash and 6.3 points for GPT-5 Mini (averaged across M/C ratios; view at source ↗

**Figure 5.** Figure 5: Rate at which an early decision is reached prior to the 5-turn conversation limit across rounds by model type, partner, and project conditions. 5.3 Strategy taxonomy We classify negotiation behaviors using rule-based extraction (Appendix E) view at source ↗

**Figure 6.** Figure 6: Failure mode breakdown for suboptimal rounds by compatibility ratio. No Prior Context captures round-1 suboptimality before any shared history is established; Failed Improvement captures rounds 2+ where the allocation changed but remained suboptimal. These two buckets together account for the majority of failures across all conditions. Exploratory LLM-assisted annotations further decompose these buckets (§… view at source ↗

**Figure 7.** Figure 7: Proposer amnesia (game dbd45fed, Round 3). Agent B proposes “I take stone×2, wood×8; you take stone×8, wood×2.” Agent A agrees. Agent B’s thinking trace at decision time contains no reference to this agreement and submits wood×10. 38 view at source ↗

**Figure 7.** Figure 7: Judge-label enrichment in actionable failure regions. Bars show percentage-point differences in calibrated judge-label prevalence between the focal condition and its comparison set. Positive values indicate labels disproportionately associated with the focal failure condition, suggesting candidate targets for future intervention rather than causal effects. Our main quantitative results are based on objecti… view at source ↗

**Figure 8.** Figure 8: Self-commitment abandonment (game 70d67fb2, Round 1). Agent A proposes 6 stone + 2–3 gold, Agent B confirms. Agent A then announces “I’ll take 10 stone” and submits stone×10, causing overdraw (joint stone = 14 vs. supply = 10). 39 view at source ↗

**Figure 9.** Figure 9: Self-commitment abandonment (game 70d67fb2, Round 1). Agent A proposes 6 stone + 2–3 gold, Agent B confirms. Agent A then announces “I’ll take 10 stone” and submits stone×10, causing overdraw (joint stone = 14 vs. supply = 10). 40 [PITH_FULL_IMAGE:figures/full_fig_p040_9.png] view at source ↗

read the original abstract

Grounding is the collaborative process of establishing mutual belief sufficient for a communicative goal. While static grounding maps language to a shared context, dynamic grounding requires agents to negotiate meaning across turns. Current multi-agent Large Language Model (LLM) benchmarks largely emphasize static, one-shot tasks, overlooking whether agents can repair grounding breakdowns through interaction. We introduce an iterated multi-turn negotiation game where two agents allocate shared resources to private projects with verifiable jointly optimal outcomes. Although individual agents can identify Pareto-optimal allocations in isolation, agent dyads consistently fail to reach them across models. We identify four failure modes: (1) loss of shared interaction history, (2) stubborn anchoring to early proposals, (3) defaulting to equal splits over reward-maximizing coordination, and (4) referential binding errors across turns. Our baselines show that the coordination gap is not explained by individual reasoning limits or insufficient information exchange alone. Instead, the bottleneck lies in dynamic grounding: joint plan formation, commitment, and execution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's iterated negotiation benchmark usefully flags four dynamic grounding failure modes in LLM dyads, but the baselines do not cleanly separate those from general multi-turn context tracking problems.

read the letter

The core observation is that LLM pairs fail to reach jointly optimal resource allocations in this multi-turn game even though single agents solve it easily in isolation. The work documents four recurring breakdowns: agents drop shared history, anchor on early offers, default to equal splits instead of coordinated ones, and mix up references across turns. That pattern is the main new thing here, and it moves the field past one-shot static tasks toward something that actually requires ongoing repair of mutual understanding. The setup itself is clean and the failure modes are described plainly enough to be usable for follow-up work on prompting or memory mechanisms. Credit for that. The soft spot is the causal claim. The baselines compare individual Pareto identification to dyad play, but the dyad version adds turn-taking, proposal history, and progressive context length that the isolation condition never tests. Without controls that hold context management constant while varying only grounding demands, it is hard to know whether the gap is specific to dynamic grounding or just another instance of LLMs struggling with long interactions. The abstract gives no numbers, error bars, or protocol details, so the size and consistency of the effect stay hard to judge from the summary alone. This is worth reading for anyone building multi-agent LLM systems that need sustained coordination. The benchmark idea is solid and the observations are concrete, so it deserves a serious referee even if the interpretation of the results needs tightening with better controls.

Referee Report

2 major / 2 minor

Summary. The paper introduces an iterated multi-turn negotiation game where two LLM agents allocate shared resources to private projects with verifiable jointly optimal (Pareto) outcomes. It claims that individual agents can identify these optima in isolation, but agent dyads consistently fail across models due to breakdowns in dynamic grounding (joint plan formation, commitment, and execution). Four specific failure modes are identified: loss of shared interaction history, stubborn anchoring to early proposals, defaulting to equal splits, and referential binding errors across turns. Baselines are presented to argue that the coordination gap cannot be explained by individual reasoning limits or insufficient information exchange alone.

Significance. If the empirical results hold under tighter controls, the work would highlight a key limitation in multi-agent LLM systems: the difficulty of maintaining and repairing shared context over multiple turns. This has implications for designing agents capable of sustained collaboration, negotiation, and coordination, moving beyond one-shot static benchmarks. The concrete failure modes offer actionable insights for future model improvements or prompting strategies.

major comments (2)

[Baselines] Baselines section: the individual Pareto-identification baseline is described as an isolation task, while dyad play requires maintaining and repairing shared state across multiple turns. This leaves untested confounds such as progressive context dilution, loss of proposal history, or increased cognitive load from turn-taking, so the claim that the gap is specifically due to dynamic grounding (rather than general multi-turn LLM limitations) is not yet isolated.
[Abstract and §3] Abstract and §3 (Experimental Results): claims that dyads 'consistently fail' and that baselines 'show' the coordination gap is not explained by individual limits rest on unshown quantitative results, error bars, statistical tests, or detailed experimental protocol. Without these, the magnitude, reliability, and replicability of the central empirical observation cannot be assessed.

minor comments (2)

[Failure Modes] The four failure modes are enumerated clearly but would benefit from concrete dialogue excerpts or prevalence statistics from the runs to illustrate each mode.
[Introduction] Early definitions of 'static grounding' versus 'dynamic grounding' could be sharpened with a brief formal distinction or reference to prior literature on grounding in dialogue.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. We have carefully considered the major comments and provide point-by-point responses below. We believe the suggested revisions will significantly improve the clarity and rigor of our empirical claims.

read point-by-point responses

Referee: [Baselines] Baselines section: the individual Pareto-identification baseline is described as an isolation task, while dyad play requires maintaining and repairing shared state across multiple turns. This leaves untested confounds such as progressive context dilution, loss of proposal history, or increased cognitive load from turn-taking, so the claim that the gap is specifically due to dynamic grounding (rather than general multi-turn LLM limitations) is not yet isolated.

Authors: We acknowledge the validity of this concern. Our current individual baseline is indeed a single-turn isolation task, which does not fully capture the multi-turn dynamics present in the dyad setting. To better isolate the dynamic grounding failures, we will add a new baseline in the revised manuscript where a single LLM agent is prompted to simulate the entire multi-turn negotiation process internally, maintaining its own history. This will help control for general multi-turn limitations such as context dilution. We will report the results of this baseline alongside the existing ones in Section 4. This constitutes a partial revision as it strengthens the isolation but may require further experiments in future work. revision: partial
Referee: [Abstract and §3] Abstract and §3 (Experimental Results): claims that dyads 'consistently fail' and that baselines 'show' the coordination gap is not explained by individual limits rest on unshown quantitative results, error bars, statistical tests, or detailed experimental protocol. Without these, the magnitude, reliability, and replicability of the central empirical observation cannot be assessed.

Authors: We agree that the main text should present the key quantitative evidence more prominently. The detailed results, including success rates for dyads vs. baselines across models (with standard errors from 50 runs per condition), statistical tests (paired t-tests showing p < 0.01 for the coordination gap), and the full experimental protocol (including model versions, temperature settings, prompt templates, and game parameters) are currently in the appendix. In the revision, we will add a table summarizing these metrics to the main Experimental Results section (§3) and expand the protocol description in the main text. This will allow readers to assess the claims directly. We will also include the data and code for replicability. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical failure-mode analysis with independent baselines

full rationale

The paper reports experimental results on LLM dyads in an iterated negotiation task, documenting four failure modes and using baselines to rule out individual reasoning limits and insufficient information exchange. No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps exist. The central claim (coordination gap due to dynamic grounding) rests on direct comparisons between isolated-agent performance and dyad performance, which are falsifiable via the reported runs and not reducible to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two unverified domain assumptions: that isolated agents can solve the Pareto-optimal allocation and that the game possesses verifiable jointly optimal outcomes. No free parameters or invented entities are introduced.

axioms (2)

domain assumption Individual agents can identify Pareto-optimal allocations in isolation
Explicitly stated as a contrast to the dyad failures.
domain assumption The negotiation game has verifiable jointly optimal outcomes
Required for the claim that dyads fail to reach them.

pith-pipeline@v0.9.0 · 5477 in / 1232 out tokens · 46932 ms · 2026-05-14T21:11:39.190275+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

[1]

Journal of Personality and Social Psychology , volume=

Pooling of Unshared Information in Group Decision Making: Biased Information Sampling During Discussion , author=. Journal of Personality and Social Psychology , volume=

work page
[2]

Systematic Failures in Collective Reasoning under Distributed Information in Multi-Agent

Li, Yuxuan and Naito, Aoi and Shirado, Hirokazu , journal=. Systematic Failures in Collective Reasoning under Distributed Information in Multi-Agent

work page
[3]

International Conference on Learning Representations , year=

Evaluating Language Model Agency through Negotiations , author=. International Conference on Learning Representations , year=

work page
[4]

Eisenstein, Jacob and Huot, Fantine and Fisch, Adam and Berant, Jonathan and Lapata, Mirella , journal=

work page
[5]

Haris Aziz and Bart de Keijzer

Abdelnabi, Sahar and Gomaa, Amr and Sivaprasad, Sarath and Sch. Cooperation, Competition, and Maliciousness:. arXiv preprint arXiv:2309.17234 , year=

work page arXiv
[6]

arXiv preprint arXiv:2601.13295 , year=

CooperBench: Why Coding Agents Cannot be Your Teammates Yet , author=. arXiv preprint arXiv:2601.13295 , year=

work page arXiv
[7]

How Well Can

Bianchi, Federico and Chia, Patrick John and Yuksekgonul, Mert and Tagliabue, Jacopo and Jurafsky, Dan and Zou, James , journal=. How Well Can

work page
[8]

arXiv preprint arXiv:2406.18872 , year=

Efficacy of Language Model Self-Play in Non-Zero-Sum Games , author=. arXiv preprint arXiv:2406.18872 , year=

work page arXiv
[9]

Grounding `Grounding' in

Chandu, Khyathi Raghavi and Bisk, Yonatan and Black, Alan W , booktitle=. Grounding `Grounding' in

work page
[10]

Proceedings of the LUHME , year=

Building Common Ground in Dialogue: A Survey , author=. Proceedings of the LUHME , year=

work page
[11]

and Manning, Benjamin S

Qian, Crystal and Zhu, Kehang and Horton, John J. and Manning, Benjamin S. and Tsai, Vivian and Wexler, James , journal=. Strategic Tradeoffs Between Humans and

work page
[12]

Communication Enables Cooperation in

Madmoun, Hachem , journal=. Communication Enables Cooperation in

work page
[13]

Game-theoretic

Hua, Wenyue and Liu, Ollie and Li, Lingyao and Amayuelas, Alfonso and Chen, Julie and Jiang, Lucas and Jin, Mingyu and Fan, Lizhou and Sun, Fei and Wang, William and Wang, Xintong and Zhang, Yongfeng , journal=. Game-theoretic

work page
[14]

Evolving General Cooperation with a

Kleiman-Weiner, Max and Vient. Evolving General Cooperation with a. Proceedings of the National Academy of Sciences , year=

work page
[15]

Perspectives on Socially Shared Cognition , pages=

Grounding in Communication , author=. Perspectives on Socially Shared Cognition , pages=. 1991 , publisher=

work page 1991
[16]

Linguistics and Philosophy , volume=

Common Ground , author=. Linguistics and Philosophy , volume=

work page
[17]

Science , volume=

Judgment under Uncertainty: Heuristics and Biases , author=. Science , volume=

work page
[18]

Cognition , volume=

Referring as a Collaborative Process , author=. Cognition , volume=

work page
[19]

A Computational Theory of Grounding in Natural Dialogue , author=

work page
[20]

A Strategy of Win-Stay, Lose-Shift that Outperforms Tit-for-Tat in the

Nowak, Martin and Sigmund, Karl , journal=. A Strategy of Win-Stay, Lose-Shift that Outperforms Tit-for-Tat in the

work page
[21]

Findings of the Association for Computational Linguistics: ACL 2023 , year=

Discovering Language Model Behaviors with Model-Written Evaluations , author=. Findings of the Association for Computational Linguistics: ACL 2023 , year=

work page 2023
[22]

International Conference on Learning Representations , year=

Towards Understanding Sycophancy in Language Models , author=. International Conference on Learning Representations , year=

work page
[23]

How Does Cognitive Bias Affect Large Language Models?

Takenami, Yoshiki and Huang, Yin Jou and Murawaki, Yugo and Chu, Chenhui , booktitle=. How Does Cognitive Bias Affect Large Language Models?

work page
[24]

arXiv preprint arXiv:2406.04692 , year=

Mixture-of-Agents Enhances Large Language Model Capabilities , author=. arXiv preprint arXiv:2406.04692 , year=

work page arXiv
[25]

, title =

Clark, Herbert H. , title =

work page
[26]

arXiv preprint arXiv:2510.06182 , year=

Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context , author=. arXiv preprint arXiv:2510.06182 , year=

work page arXiv
[27]

Psychological Review , volume=

From partners to populations: A hierarchical Bayesian account of coordination and convention , author=. Psychological Review , volume=. 2023 , publisher=

work page 2023
[28]

, title =

Bratman, Michael E. , title =. The Philosophical Review , year =

work page
[29]

http://www.nber.org/papers/w34919

Araujo, Douglas KG and Uhlig, Harald. How does AI Distribute the pie? Large Language Models and the Ultimatum Game. 2026. doi:10.3386/w34919 , URL = "http://www.nber.org/papers/w34919", abstract =

work page doi:10.3386/w34919 2026