pith. machine review for the scientific record. sign in

arxiv: 2605.14537 · v1 · submitted 2026-05-14 · 💻 cs.AI

Recognition: no theorem link

Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:50 UTC · model grok-4.3

classification 💻 cs.AI
keywords multi-agent benchmarkLLM evaluationstrategic reasoningbidding and bargainingbluffingimperfect informationresource allocationagent failure modes
0
0 comments X

The pith

In the Cattle Trade benchmark, strategic coherence like spending efficiency and adaptive bidding predicts rank better than spending volume.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Cattle Trade, a benchmark game lasting 50 to 60 turns that combines auctions, hidden trade offers, bargaining, bluffing, and resource allocation to test LLMs as agents in a competitive multi-agent environment. It evaluates seven language models and three code agents over 242 games to determine which behaviors lead to higher performance. The key finding is that strategic coherence, particularly efficient use of resources and adjusting bids to different phases, associates more strongly with final rank than the total amount spent or any one subskill. Two simple heuristic agents outperform most of the tested LLMs, and the logs highlight recurring issues in LLMs such as overbidding and weak adaptation to opponents. This setup matters because it shows the need for benchmarks that assess how well agents combine multiple skills under conflicting incentives rather than testing them separately.

Core claim

Cattle Trade integrates auctions, trade challenges, bargaining, and bluffing into a single long-horizon multi-agent game with imperfect information and resource constraints. Evaluations across seven LLMs and three deterministic code agents in 242 games show that spending efficiency, resource discipline, and phase-adaptive bidding correlate more strongly with rank than spending volume or individual subskills. Two heuristic code agents rank higher than most LLMs, while behavioural analysis reveals LLM failure modes including overbidding, self-bidding, bankrupt trade challenge initiation, and limited opponent-state adaptation.

What carries the argument

The integrated Cattle Trade game mechanics that require agents to deploy bidding, bargaining, bluffing, and resource management jointly over many turns in an adversarial setting.

Load-bearing premise

The assumption that results from this specific Cattle Trade game design reflect broader agentic competence in strategic reasoning under imperfect information instead of being due to the particular game rules or turn structure.

What would settle it

Testing the same agents in a modified version of the game with different resource mechanics or fewer turns and observing whether the ranking by strategic coherence metrics stays the same.

Figures

Figures reproduced from arXiv: 2605.14537 by Clemens M\"uller, Robert M\"uller.

Figure 1
Figure 1. Figure 1: (a) Turn structure and hidden-offer trade challenge. Card counts are public; offer values are hidden, enabling 0-value card bluffs. (b) Ranking over 98 canonical games sorted by TrueSkill µ; median ± std in the right column. TrackerAgent and SetRaceAgent beat six and five of seven LLMs, respectively. TrackerAgent and SetRaceAgent outperform six and five of seven LLMs respectively. Only Gem￾ini 3 Flash clea… view at source ↗
Figure 2
Figure 2. Figure 2: Tournament policy dynamics (98 canonical games: 70 pure-LLM + 28 mixed comp1). Agents ordered by TrueSkill posterior mean µ; error bars on the top-left panel show ±3σ (99.7% interval). Top row: TrueSkill, win rate, and per-agent mean wealth across turns. Bottom row: per-agent mean quartets completed and mean score across turns (each line is the mean over ∼50 games for that agent; shaded band is the bootstr… view at source ↗
Figure 3
Figure 3. Figure 3: (a) LLM robustness across 7 code-agent compositions (exp2 all7, 168 mixed games, 4 games per (LLM, composition) cell; Sonnet’s 4 comp1 games omitted due to partial coverage). Axes C1–C7 are compositions; values are per-composition win-rate percentiles across the LLMs. Outward = beats more peers on that mix; large convex polygon = robust across mixes, spiky poly￾gon = composition-sensitive. (b) Economic pro… view at source ↗
Figure 4
Figure 4. Figure 4: Engine-level metrics (98 canonical games, 7 LLMs + 3 code agents): auction participation, [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Governance-level analysis (98 canonical games, 7 LLMs + 3 code agents): seat-position [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Strategic-profile radar across six orthogonal axes (98 canonical games, all 10 agents). Each axis is percentile-normalised across the ten agents, so outward means better: competitive strength (TrueSkill µ), spending efficiency (points per coin), auction discipline (1 − 1 2 (overbid + self-bid)), TC proficiency (TC challenger win rate), phase timing (log late/early bid-aggressiveness ratio), and quartet thr… view at source ↗
Figure 7
Figure 7. Figure 7: Strategic-profile radar per agent (98 canonical games). Same six axes as Figure 6; each [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Gross-outflow cost per quartet per animal (98 canonical games). Rows: 7 LLMs then 3 code agents; columns: 10 animal types in ascending value order. Cell value: total coins spent by that agent on that animal divided by quartets completed; n: quartets completed (cells with n=0 are blanked, since the ratio is undefined). This is a first-order view; it does not account for TC inflows or the multiplicative scor… view at source ↗
Figure 9
Figure 9. Figure 9: Explicit cost decomposition per animal (98 canonical games). [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Economic detail (98 canonical games). (a) Per-game capital efficiency distribution, η = score/gross outflow. (b) Pooled per-TC net coin ∆ per agent. Mass right of zero = agent pockets money on average; mass left = agent pays to win animals in TCs. The main-body scatter (Figure 3b) aggregates these two marginals into per-agent medians with IQR. Token usage per game varies roughly 20×: G3-F generates about … view at source ↗
read the original abstract

We introduce \textsc{Cattle Trade, a multi-agent benchmark for evaluating large language models (LLMs) as agents in strategic reasoning under imperfect information, adversarial interaction, and resource constraints. The benchmark combines auctions, hidden-offer trade challenges (TCs), bargaining, bluffing, opponent modeling, and resource allocation within a single long-horizon game lasting 50--60 turns. Unlike prior agent benchmarks that test these abilities in isolation, \textsc{Cattle Trade} evaluates whether agents integrate them across a competitive, multi-agent economic game with conflicting incentives. The benchmark logs every bid, TC offer, counteroffer, and card selection, enabling behavioural analysis beyond final scores or win rates. We evaluate seven cost-efficient language models and three deterministic code agents across 242 games. Strategic coherence, in particular spending efficiency, resource discipline, and phase-adaptive bidding, is associated with rank more strongly than spending volume or any single subskill. Two heuristic code agents outperform most tested LLMs, and behavioural traces surface recurring LLM failure modes including overbidding, self-bidding, bankrupt TC initiation, and weak opponent-state adaptation. Evaluating agentic competence requires benchmarks that test the joint deployment of multiple capabilities in multi-agent environments with conflicting incentives, uncertainty, and economic dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces Cattle Trade, a multi-agent benchmark for evaluating LLMs in integrated strategic reasoning under imperfect information. The game combines auctions, hidden-offer trade challenges, bargaining, bluffing, and resource allocation over 50-60 turns. Across 242 games with seven LLMs and three deterministic code agents, the paper reports that metrics of strategic coherence (spending efficiency, resource discipline, phase-adaptive bidding) correlate more strongly with final rank than spending volume or isolated subskills, that two heuristic code agents outperform most LLMs, and that LLMs exhibit recurring failure modes such as overbidding and weak opponent adaptation.

Significance. If the central associations hold under additional statistical controls, the benchmark offers a valuable integrated testbed for multi-agent economic reasoning that goes beyond isolated capability evaluations. The provision of full behavioral logs (bids, offers, card selections) is a strength that enables post-hoc analysis of failure modes. The finding that simple heuristics can outperform current LLMs in this setting highlights a concrete gap in current agentic systems.

major comments (3)
  1. [Results] Results section (behavioral metrics analysis): the claim that spending efficiency, resource discipline, and phase-adaptive bidding are associated with rank more strongly than spending volume lacks reported correlation coefficients, p-values, confidence intervals, or controls for multiple comparisons and game-to-game variance, undermining the strength of the central empirical claim.
  2. [Evaluation] Evaluation setup (242 games across 10 agents): no details are provided on how game parameters (card distributions, payoff matrices, turn limits) were selected, nor are sensitivity tests or ablations on rule variations reported; this leaves open the possibility that observed rank associations and code-agent superiority are artifacts of the specific 50-60 turn mechanics rather than generalizable strategic competence.
  3. [Abstract and Evaluation] Abstract and §4 (agent comparisons): the statement that two heuristic code agents outperform most tested LLMs is presented without per-agent win-rate tables, variance across repeated matches, or statistical tests comparing LLM vs. heuristic performance distributions.
minor comments (3)
  1. [Methods] Clarify the precise definitions and formulas used to compute spending efficiency and resource discipline in the methods or appendix.
  2. [Evaluation] Add a table summarizing the seven LLMs (model names, sizes, temperatures) and the three code agents' exact heuristics.
  3. [Results] Ensure all figures showing behavioral traces include axis labels, legends, and sample sizes.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments correctly identify areas where the empirical presentation can be strengthened with additional quantitative detail and robustness checks. We address each major comment below and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [Results] Results section (behavioral metrics analysis): the claim that spending efficiency, resource discipline, and phase-adaptive bidding are associated with rank more strongly than spending volume lacks reported correlation coefficients, p-values, confidence intervals, or controls for multiple comparisons and game-to-game variance, undermining the strength of the central empirical claim.

    Authors: We agree that the current presentation of the association between strategic coherence metrics and rank is qualitative and would benefit from explicit statistics. In the revised manuscript we will add Pearson and Spearman correlation coefficients (with 95% confidence intervals) between each behavioral metric and final rank, report p-values, apply Bonferroni or FDR correction for multiple comparisons, and include a mixed-effects regression controlling for game-to-game variance as a random effect. These additions will be placed in a new subsection of the Results. revision: yes

  2. Referee: [Evaluation] Evaluation setup (242 games across 10 agents): no details are provided on how game parameters (card distributions, payoff matrices, turn limits) were selected, nor are sensitivity tests or ablations on rule variations reported; this leaves open the possibility that observed rank associations and code-agent superiority are artifacts of the specific 50-60 turn mechanics rather than generalizable strategic competence.

    Authors: The game parameters were selected to create a balanced multi-stage economic environment that integrates auctions, bargaining, and resource constraints while remaining computationally tractable for LLM agents; card distributions follow a uniform random draw with fixed total value, payoff matrices are derived from standard Vickrey and Nash bargaining solutions, and the 50-60 turn horizon was chosen to allow multiple phases of play without excessive context length. In the revision we will add an explicit subsection detailing these choices with references to the underlying economic models. We will also report sensitivity results for two key variations (turn limit reduced to 30 and payoff scaling factor of 0.5) to demonstrate that the relative ordering of agents is robust. revision: yes

  3. Referee: [Abstract and Evaluation] Abstract and §4 (agent comparisons): the statement that two heuristic code agents outperform most tested LLMs is presented without per-agent win-rate tables, variance across repeated matches, or statistical tests comparing LLM vs. heuristic performance distributions.

    Authors: We acknowledge that the current text relies on aggregate statements without granular tables or inferential statistics. The revised version will include a new table in §4 showing per-agent mean rank, win rate, and standard deviation across the 242 games, broken down by LLM versus heuristic category. We will add Mann-Whitney U tests (with effect sizes) comparing the full performance distributions of the two best heuristics against the LLM group, together with a note on the number of repeated matches per agent pair. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct observational reporting

full rationale

The paper introduces a new multi-agent game benchmark and reports empirical results from running 242 games with 10 agents. No mathematical derivations, equations, parameter fitting, or predictive claims appear in the text. Behavioral metrics are computed directly from logged actions and correlated with observed ranks; these associations are presented as findings rather than as outputs of any model that was fitted to the same data. No self-citations are used to justify uniqueness theorems or ansatzes. The analysis is therefore self-contained and does not reduce any claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces a new benchmark without new free parameters, mathematical axioms beyond standard domain assumptions in AI evaluation, or invented entities; the contribution rests on the design of the game and the empirical comparison.

axioms (1)
  • domain assumption Multi-agent economic games with imperfect information can reveal integrated strategic capabilities in AI agents.
    The benchmark design and interpretation of results rest on this premise to justify evaluating joint deployment of skills.

pith-pipeline@v0.9.0 · 5525 in / 1395 out tokens · 53165 ms · 2026-05-15T01:50:27.270880+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 2 internal anchors

  1. [1]

    TrueSkill: A

    Herbrich, Ralf and Minka, Tom and Graepel, Thore , booktitle=. TrueSkill: A. 2006 , publisher=

  2. [2]

    Mastering the Game of

    Silver, David and Schrittwieser, Julian and Simonyan, Karen and Antonoglou, Ioannis and Huang, Aja and Guez, Arthur and Hubert, Thomas and Baker, Lucas and Lai, Matthew and Bolton, Adrian and others , journal=. Mastering the Game of. 2017 , publisher=

  3. [3]

    Superhuman

    Brown, Noam and Sandholm, Tuomas , journal=. Superhuman. 2019 , publisher=

  4. [4]

    Human-Level Play in the Game of

    Bakhtin, Anton and Brown, Noam and Dinan, Emily and Farina, Gabriele and Flaherty, Colin and Fried, Daniel and Goff, Andrew and Gray, Jonathan and Hu, Hengyuan and Jacob, Athul Paul and others , journal=. Human-Level Play in the Game of. 2022 , publisher=

  5. [5]

    International Conference on Learning Representations , year=

    Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=

  6. [6]

    Evaluating Large Language Models Trained on Code

    Evaluating Large Language Models Trained on Code , author=. arXiv preprint arXiv:2107.03374 , year=

  7. [7]

    Training Verifiers to Solve Math Word Problems

    Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

  8. [8]

    Transactions on Machine Learning Research , year=

    Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models , author=. Transactions on Machine Learning Research , year=

  9. [9]

    Challenging

    Suzgun, Mirac and Scales, Nathan and Sch. Challenging. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

  10. [10]

    Liu, Xiao and Yu, Hao and Zhang, Hanchen and Xu, Yifan and Lei, Xuanyu and Lai, Hanyu and Gu, Yu and Ding, Hangliang and Men, Kaiwen and Yang, Kejuan and others , booktitle=

  11. [11]

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P and others , booktitle=. Judging

  12. [12]

    Duan, Jinhao and Zhang, Renming and Diffenderfer, James and Kailkhura, Bhavya and Sun, Lichao and Stinis, Elias and Stamoulis, Dimitrios , journal=

  13. [13]

    From Text to Tactic: Evaluating

    Light, Jonathan and Cai, Min and Shen, Sheng and Hu, Ziniu , journal=. From Text to Tactic: Evaluating

  14. [14]

    Exploring Large Language Models for Communication Games: An Empirical Study on

    Xu, Yuzhuang and Wang, Shuo and Li, Peng and Luo, Fuwen and Wang, Xiaolong and Liu, Weidong and Liu, Yang , journal=. Exploring Large Language Models for Communication Games: An Empirical Study on

  15. [15]

    Do the Rewards Justify the Means?

    Pan, Alexander and Chan, Jun Shern and Zou, Andy and Li, Nathaniel and Basart, Steven and Woodside, Thomas and Zhang, Hanlin and Emmons, Scott and Hendrycks, Dan , booktitle=. Do the Rewards Justify the Means?

  16. [16]

    Suspicion-Agent: Playing Imperfect Information Games with Theory of Mind Aware

    Guo, Jiaxian and Yang, Bo and Yoo, Paul and Lin, Bill Yuchen and Iwasawa, Yusuke and Matsuo, Yutaka , journal=. Suspicion-Agent: Playing Imperfect Information Games with Theory of Mind Aware

  17. [17]

    Costarelli, Anthony and Allen, Mat and Hauksson, Roman and Sodunke, Grace and Hariharan, Suhas and Cheng, Carlson and Li, Wenjie and Clymer, Joshua and Yadav, Arjun , journal=

  18. [18]

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages=

    Deal or No Deal? End-to-End Learning for Negotiation Dialogues , author=. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages=

  19. [19]

    arXiv preprint arXiv:2305.19165 , year=

    Strategic Reasoning with Language Models , author=. arXiv preprint arXiv:2305.19165 , year=

  20. [20]

    Proceedings of the National Academy of Sciences , volume=

    Deception Abilities Emerge in Large Language Models , author=. Proceedings of the National Academy of Sciences , volume=

  21. [21]

    Hong, Sirui and Zhuge, Mingchen and Chen, Jonathan and Zheng, Xiawu and Cheng, Yuheng and Zhang, Ceyao and Wang, Jinlin and Wang, Zili and Yau, Steven Ka Shing and Lin, Zijuan and others , booktitle=

  22. [22]

    Advances in Neural Information Processing Systems , volume=

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , volume=

  23. [23]

    2007 , publisher=

    Gut Feelings: The Intelligence of the Unconscious , author=. 2007 , publisher=

  24. [24]

    and Chen, Deming , booktitle=

    Jia, Jingru and Yuan, Zehua and Pan, Junhao and McNamara, Paul E. and Chen, Deming , booktitle=

  25. [25]

    Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) , pages=

    Game Theory Meets Large Language Models: A Systematic Survey , author=. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) , pages=

  26. [26]

    Lin, Wenye and Roberts, Jonathan and Yang, Yunhan and Albanie, Samuel and Lu, Zongqing and Han, Kai , booktitle=

  27. [27]

    Learning Strategic Language Agents in the

    Xu, Zelai and Gu, Wanjun and Yu, Chao and Wu, Yi and Wang, Yu , booktitle=. Learning Strategic Language Agents in the

  28. [28]

    Nature Human Behaviour , volume=

    Playing Repeated Games with Large Language Models , author=. Nature Human Behaviour , volume=. 2025 , publisher=

  29. [29]

    Mao, Shaoguang and Cai, Yuzhe and Xia, Yan and Wu, Wenshan and Wang, Xun and Wang, Fengyi and Guan, Qiang and Ge, Tao and Wei, Furu , booktitle=

  30. [30]

    and Stoica, Ion and Rosing, Tajana and Jin, Haojian and Zhang, Hao , booktitle=

    Hu, Lanxiang and Huo, Mingjia and Zhang, Yuxuan and Yu, Haoyang and Xing, Eric P. and Stoica, Ion and Rosing, Tajana and Jin, Haojian and Zhang, Hao , booktitle=. lmgame-Bench: How Good are

  31. [31]

    Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS) , pages=

    Game of Thoughts: Iterative Reasoning in Game-Theoretic Domains with Large Language Models , author=. Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS) , pages=

  32. [32]

    2026 , howpublished =

    Artificial Analysis Intelligence Index v4.0 , author =. 2026 , howpublished =