pith. machine review for the scientific record. sign in

arxiv: 2605.12411 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI· cs.CL· cs.MA

Recognition: no theorem link

Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling

Eilam Shapira, Moshe Tennenholtz, Roi Reichart

Pith reviewed 2026-05-13 05:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.MA
keywords AI agent predictiontext-tabular modelingLLM-as-Observerbargaining gamesnegotiationfew-shot adaptationcounterpart modelingdecision forecasting
0
0 comments X

The pith

A text-tabular model with LLM hidden states as features predicts unfamiliar AI agents' decisions from a small number of past interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether one AI agent can forecast the next move of an unfamiliar counterpart agent after seeing only a handful of prior exchanges. It casts the task as target-adaptive text-tabular prediction: each decision point becomes a table row that merges game state, offer history, and dialogue text, while earlier complete games of the same target serve as labeled examples in the prompt. A tabular foundation model is trained on agents driven by frontier LLMs and evaluated on held-out scaffolded agents; the version that adds an LLM-as-Observer representation, which extracts hidden states rather than generated answers, beats both direct prompting baselines and simpler feature sets. If the result holds, it shows that decision-relevant signals exist in LLM internal states that are not surfaced by ordinary few-shot prompting.

Core claim

Formulating counterpart prediction as a target-adaptive text-tabular task enables effective adaptation to specific agents. Training on 13 frontier-LLM agents and testing on 91 held-out scaffolded agents, the full model outperforms direct LLM-as-Predictor prompting and game-plus-text baselines. Within the tabular model, Observer features improve response-prediction AUC by about 4 points across both tasks and reduce bargaining offer-prediction error by 14 percent at K=16. These gains demonstrate that hidden LLM representations expose decision-relevant signals that direct prompting does not surface.

What carries the argument

LLM-as-Observer: a small frozen LLM that reads the current decision-time state and dialogue, discards its generated answer, and supplies its hidden state as an additional decision-oriented feature inside a tabular foundation model that also encodes game-state features and LLM text representations.

If this is right

  • Observer features add predictive value beyond structured game state and raw text representations.
  • The same tabular architecture works for both discrete response prediction and continuous offer prediction.
  • Performance improves as the number of adaptation examples K increases up to at least 16.
  • Direct prompting of an LLM to act as a predictor is inferior to embedding its hidden states into a tabular model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be applied to predict actions of commercial AI assistants in procurement or customer-service settings if their interaction logs are available.
  • If the same hidden-state signals appear in human decision data, the framework might transfer to modeling human counterparts from short interaction histories.
  • Agents could be designed to obscure their internal states specifically to defeat this style of observer-based prediction.

Load-bearing premise

Controlled bargaining and negotiation games with scaffolded agents capture the decision-making patterns of real-world AI agents whose prompts, control logic, and rule-based fallbacks remain hidden.

What would settle it

Run the trained model on agents whose prompts and fallback rules are not constructed via the same scaffolding procedure used in the study, then measure whether AUC and offer-error gains disappear.

Figures

Figures reproduced from arXiv: 2605.12411 by Eilam Shapira, Moshe Tennenholtz, Roi Reichart.

Figure 1
Figure 1. Figure 1: Alice (seller) and Bob (buyer) negotiate via free-text offers. Following Bob’s $5,000 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Three approaches for predicting decisions of a target agent. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Observer gain over the game+text features baseline by relative depth. Observer gains are [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Schematic of the multimodal tabular row at a single decision point. The row concatenates [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
read the original abstract

AI agents negotiate and transact in natural language with unfamiliar counterparts: a buyer bot facing an unknown seller, or a procurement assistant negotiating with a supplier. In such interactions, the counterpart's LLM, prompts, control logic, and rule-based fallbacks are hidden, while each decision can have monetary consequences. We ask whether an agent can predict an unfamiliar counterpart's next decision from a few interactions. To avoid real-world logging confounds, we study this problem in controlled bargaining and negotiation games, formulating it as target-adaptive text-tabular prediction: each decision point is a table row combining structured game state, offer history, and dialogue, while $K$ previous games of the same target agent, i.e., the counterpart being modeled, are provided in the prompt as labeled adaptation examples. Our model is built on a tabular foundation model that represents rows using game-state features and LLM-based text representations, and adds LLM-as-Observer as an additional representation: a small frozen LLM reads the decision-time state and dialogue; its answer is discarded, and its hidden state becomes a decision-oriented feature, making the LLM an encoder rather than a direct few-shot predictor. Training on 13 frontier-LLM agents and testing on 91 held-out scaffolded agents, the full model outperforms direct LLM-as-Predictor prompting and game+text features baselines. Within this tabular model, Observer features contribute beyond the other feature schemes: at $K=16$, they improve response-prediction AUC by about 4 points across both tasks and reduce bargaining offer-prediction error by 14%. These results show that formulating counterpart prediction as a target-adaptive text-tabular task enables effective adaptation, and that hidden LLM representations expose decision-relevant signals that direct prompting does not surface.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes formulating prediction of unfamiliar AI agents' decisions in bargaining and negotiation games as a target-adaptive text-tabular task. A tabular foundation model is trained on 13 frontier-LLM agents using game-state features, text representations, and LLM-as-Observer hidden states (with K prior games as labeled adaptation examples in the prompt); it is evaluated on 91 held-out scaffolded agents. The full model outperforms direct LLM-as-Predictor prompting and game+text baselines, with Observer features adding ~4 AUC points to response prediction and reducing offer-prediction error by 14% at K=16.

Significance. If the empirical gains hold, the work would contribute to multi-agent systems by showing that hidden LLM representations can surface decision-relevant signals beyond direct prompting or structured features alone. Credit is due for the clean separation of the frozen LLM as an encoder (rather than few-shot predictor), the target-adaptive formulation, and the held-out agent split that avoids obvious circularity.

major comments (2)
  1. [Abstract] Abstract: the reported gains (approximately 4 AUC points and 14% error reduction at K=16) are presented without statistical significance tests, error bars, details on the 13+91 agent split, data exclusion criteria, or controls for confounds. These omissions make it impossible to assess whether the central performance claims are robust.
  2. [Abstract and Evaluation section] Abstract and Evaluation section: the quantitative results rest on testing against 91 scaffolded agents whose explicit control logic is known by construction. This setup may allow Observer features to exploit deterministic regularities that would not exist for frontier LLMs whose prompts, internal states, and rule-based fallbacks remain hidden—the motivating use case. The paper acknowledges the controlled setting “to avoid real-world logging confounds,” yet presents the gains as evidence for the general problem; this mismatch is load-bearing for the claimed applicability.
minor comments (2)
  1. [Methods] Clarify in the methods how the K adaptation examples are exactly formatted and whether any leakage from the target agent's identity occurs through the tabular features.
  2. [Experimental Setup] Add explicit definitions or a table for all feature schemes (game, text, Observer) and baselines to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported gains (approximately 4 AUC points and 14% error reduction at K=16) are presented without statistical significance tests, error bars, details on the 13+91 agent split, data exclusion criteria, or controls for confounds. These omissions make it impossible to assess whether the central performance claims are robust.

    Authors: We agree that these details are necessary for assessing robustness. In the revised manuscript, we will add statistical significance tests (e.g., paired tests across multiple random seeds) for the reported AUC and error improvements, include error bars (standard deviation over runs) in all quantitative results, provide a full description of the 13+91 agent split including how the scaffolded agents were generated and any exclusion criteria, and discuss controls for confounds such as game parameterization and agent diversity. These additions will be incorporated into both the abstract and the evaluation section. revision: yes

  2. Referee: [Abstract and Evaluation section] Abstract and Evaluation section: the quantitative results rest on testing against 91 scaffolded agents whose explicit control logic is known by construction. This setup may allow Observer features to exploit deterministic regularities that would not exist for frontier LLMs whose prompts, internal states, and rule-based fallbacks remain hidden—the motivating use case. The paper acknowledges the controlled setting “to avoid real-world logging confounds,” yet presents the gains as evidence for the general problem; this mismatch is load-bearing for the claimed applicability.

    Authors: We acknowledge the gap between the controlled scaffolded setting and fully opaque frontier LLMs. The scaffolded agents were explicitly designed to replicate observed decision patterns from frontier LLMs in the same games, enabling isolation of modeling effects without logging confounds. We will revise the abstract and evaluation section to more precisely frame the results as evidence for the target-adaptive formulation and the value of Observer features within controlled approximations of the motivating scenario. We will also expand the limitations discussion to explicitly note that deterministic regularities may be more exploitable here than in hidden-state cases, and outline future directions such as proxy evaluations on additional LLM variants. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical train/test split on held-out agents with frozen encoder

full rationale

The paper's core claim is an empirical result: a tabular model trained on interaction data from 13 frontier-LLM agents is evaluated on 91 held-out scaffolded agents, with performance gains attributed to Observer features extracted from a frozen LLM. No equations, derivations, or first-principles steps are presented that reduce predictions to fitted parameters by construction. The adaptation examples are explicitly labeled data from the target agent, the Observer LLM is frozen and not trained on the prediction task, and the test agents are disjoint from training. This setup contains no self-definitional loops, fitted-input-as-prediction, or load-bearing self-citations; the reported AUC and error improvements are standard held-out evaluation metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard supervised learning assumptions plus the domain premise that LLM hidden states encode decision-relevant information beyond surface text; no explicit free parameters or new invented entities are introduced beyond the modeling framework itself.

axioms (2)
  • domain assumption Hidden states from a frozen LLM reading decision-time state and dialogue contain predictive signals for the agent's next action that are not accessible via direct prompting.
    This premise directly motivates the LLM-as-Observer design and the claim that it contributes beyond other features.
  • domain assumption Controlled bargaining and negotiation games with scaffolded agents provide a valid proxy for real-world interactions with hidden LLM prompts and logic.
    Invoked to justify studying the problem in these games while avoiding real-world logging confounds.

pith-pipeline@v0.9.0 · 5629 in / 1423 out tokens · 66564 ms · 2026-05-13T05:05:55.929426+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 4 internal anchors

  1. [1]

    Cooperation, competition, and maliciousness: LLM-stakeholders interactive negotiation

    Sahar Abdelnabi, Amr Gomaa, Sarath Sivaprasad, Lea Schönherr, and Mario Fritz. Cooperation, competition, and maliciousness: LLM-stakeholders interactive negotiation. InAdvances in Neural Information Processing Systems: Datasets and Benchmarks Track, volume 37, 2024

  2. [2]

    A game-theoretic model and best-response learning method for ad hoc coordination in multiagent systems

    Stefano V Albrecht and Subramanian Ramamoorthy. A game-theoretic model and best-response learning method for ad hoc coordination in multiagent systems. InProceedings of the 12th International Conference on Autonomous Agents and Multiagent Systems, pages 1155–1156, 2013

  3. [3]

    Autonomous agents modelling other agents: A compre- hensive survey and open problems.Artificial Intelligence, 258:66–95, 2018

    Stefano V Albrecht and Peter Stone. Autonomous agents modelling other agents: A compre- hensive survey and open problems.Artificial Intelligence, 258:66–95, 2018

  4. [4]

    Belief and truth in hypothesised behaviours.Artificial Intelligence, 235:63–94, 2016

    Stefano V Albrecht, Jacob W Crandall, and Subramanian Ramamoorthy. Belief and truth in hypothesised behaviours.Artificial Intelligence, 235:63–94, 2016

  5. [5]

    Project Deal: A Claude-run marketplace experiment

    Anthropic. Project Deal: A Claude-run marketplace experiment. Anthropic research blog, 2026. https://www.anthropic.com/features/project-deal

  6. [6]

    TabSTAR: A tabular foundation model for tabular data with text fields

    Alan Arazi, Eilam Shapira, and Roi Reichart. TabSTAR: A tabular foundation model for tabular data with text fields. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=FrXHdcTEzE

  7. [7]

    MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

    Alan Arazi, Eilam Shapira, Shoham Grunblat, Mor Ventura, Elad Hoffer, Gioia Blayer, David Holzmüller, Lennart Purucker, Gaël Varoquaux, Frank Hutter, and Roi Reichart. MulTaBench: Benchmarking multimodal tabular learning with text and image.arXiv preprint arXiv:2605.10616, 2026

  8. [8]

    Tim Baarslag, Mark J. C. Hendrikx, Koen V . Hindriks, and Catholijn M. Jonker. Learning about the opponent in automated bilateral negotiation: A comprehensive survey of opponent modeling techniques.Autonomous Agents and Multi-Agent Systems, 30(5):849–898, 2016. doi: 10.1007/s10458-015-9309-1

  9. [9]

    When ethics and payoffs diverge: LLM agents in morally charged social dilemmas.arXiv preprint arXiv:2505.19212, 2025

    Steffen Backmann, David Guzman Piedrahita, Emanuel Tewolde, Rada Mihalcea, Bernhard Schölkopf, and Zhijing Jin. When ethics and payoffs diverge: LLM agents in morally charged social dilemmas.arXiv preprint arXiv:2505.19212, 2025

  10. [10]

    ElecTwit: A framework for studying persuasion in multi-agent social systems

    Michael Bao. ElecTwit: A framework for studying persuasion in multi-agent social systems. arXiv preprint arXiv:2601.00994, 2026

  11. [11]

    Probing classifiers: Promises, shortcomings, and advances

    Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022. doi: 10.1162/coli_a_00422

  12. [12]

    How well can LLMs negotiate? NegotiationArena platform and analysis

    Federico Bianchi, Patrick John Chia, Mert Yuksekgonul, Jacopo Tagliabue, Dan Jurafsky, and James Zou. How well can LLMs negotiate? NegotiationArena platform and analysis. InProceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 3935–3951. PMLR, 2024. 10

  13. [13]

    Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems 33 (NeurIPS 2020), volume 33, pages 1877–1901, 2020

  14. [14]

    Grounding strategic conversation: Using negotiation dialogues to predict trades in a win-lose game

    Anaïs Cadilhac, Nicholas Asher, Farah Benamara, and Alex Lascarides. Grounding strategic conversation: Using negotiation dialogues to predict trades in a win-lose game. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 357–368, 2013

  15. [15]

    I want to break free! persuasion and anti-social behavior of LLMs in multi-agent settings with social hierarchy.arXiv preprint arXiv:2410.07109, 2024

    Gian Maria Campedelli, Nicolò Penzo, Massimo Stefan, Roberto Dessì, Marco Guerini, Bruno Lepri, and Jacopo Staiano. I want to break free! persuasion and anti-social behavior of LLMs in multi-agent settings with social hierarchy.arXiv preprint arXiv:2410.07109, 2024

  16. [16]

    Opponent modeling in negotiation dialogues by related data adaptation

    Kushal Chawla, Gale Lucas, Jonathan May, and Jonathan Gratch. Opponent modeling in negotiation dialogues by related data adaptation. InFindings of the Association for Computa- tional Linguistics: NAACL 2022, pages 661–674, Seattle, United States, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-naacl.50

  17. [17]

    PowerNet: Multi-agent deep reinforcement learning for scalable powergrid control.IEEE Transactions on Power Systems, 37(2):1587–1599, 2022

    Dong Chen, Kaian Chen, Zhaojian Li, Tianshu Chu, Rui Yao, Feng Qiu, and Kaixiang Lin. PowerNet: Multi-agent deep reinforcement learning for scalable powergrid control.IEEE Transactions on Power Systems, 37(2):1587–1599, 2022

  18. [18]

    Put your money where your mouth is: Evaluating strategic planning and execution of LLM agents in an auction arena.arXiv preprint arXiv:2310.05746, 2023

    Jiangjie Chen, Siyu Yuan, Rong Ye, Bodhisattwa Prasad Majumder, and Kyle Richardson. Put your money where your mouth is: Evaluating strategic planning and execution of LLM agents in an auction arena.arXiv preprint arXiv:2310.05746, 2023

  19. [19]

    Coehoorn and Nicholas R

    Robert M. Coehoorn and Nicholas R. Jennings. Learning on opponent’s preferences to make effective multi-issue negotiation trade-offs. InProceedings of the 6th International Conference on Electronic Commerce, pages 59–68, 2004

  20. [20]

    What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties

    Alexis Conneau, German Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. InProceedings of the 56th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 2126–2136, Melbourne, Australia, 2018. Associat...

  21. [21]

    Scalable multiagent driving policies for reducing traffic congestion

    Jiaxun Cui, William Macke, Harel Yedidsion, Aastha Goyal, Daniel Urieli, and Peter Stone. Scalable multiagent driving policies for reducing traffic congestion. InProceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), 2021

  22. [22]

    A survey on in-context learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. A survey on in-context learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1107–1128, Miami, Florida, USA, 2024. Association for Computational Linguistics....

  23. [23]

    Attention graph for multi-robot social navigation with deep reinforcement learning

    Erwan Escudie, Laetitia Matignon, and Jacques Saraydaryan. Attention graph for multi-robot social navigation with deep reinforcement learning. InProceedings of the 23rd International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), 2024. Extended Abstract

  24. [24]

    Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, and Marius Hobbhahn

    Sara Fish, Yannai A. Gonczarowski, and Ran I. Shorrer. Algorithmic collusion by large language models.arXiv preprint arXiv:2404.00806, 2024

  25. [25]

    Inside-Out: Hidden factual knowledge in LLMs

    Zorik Gekhman, Eyal Ben David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpektor, Jonathan Herzig, and Roi Reichart. Inside-Out: Hidden factual knowledge in LLMs. In Conference on Language Modeling, 2025

  26. [26]

    A framework for sequential planning in multi-agent settings.Journal of Artificial Intelligence Research, 24:49–79, 2005

    Piotr J Gmytrasiewicz and Prashant Doshi. A framework for sequential planning in multi-agent settings.Journal of Artificial Intelligence Research, 24:49–79, 2005

  27. [27]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 11

  28. [28]

    Chawla, Olaf Wiest, and Xiangliang Zhang

    Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V . Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges. InProceedings of the 33rd International Joint Conference on Artificial Intelligence, pages 8048–8057, 2024. doi: 10.24963/ijcai.2024/890

  29. [29]

    Decoupling strategy and generation in negotiation dialogues

    He He, Derek Chen, Anusha Balakrishnan, and Percy Liang. Decoupling strategy and generation in negotiation dialogues. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2333–2343, Brussels, Belgium, 2018. Association for Computational Linguistics

  30. [30]

    Lan- guage of bargaining

    Mourad Heddaya, Solomon Dworkin, Chenhao Tan, Rob V oigt, and Alexander Zentefis. Lan- guage of bargaining. InProceedings of the 61st Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 13161–13185. Association for Computational Linguistics, 2023

  31. [31]

    John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word representations. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota, 2019. Association for Computation...

  32. [32]

    TabPFN: A transformer that solves small tabular classification problems in a second

    Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

  33. [33]

    Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025

    Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025

  34. [34]

    BERT has more to offer: BERT layers combination yields better sentence embeddings

    MohammadSaleh Hosseini, Munawara Munia, and Latifur Khan. BERT has more to offer: BERT layers combination yields better sentence embeddings. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 15419–15431, 2023

  35. [35]

    LLM economist: Large population models and mechanism design in multi-agent generative simulacra.arXiv preprint arXiv:2507.15815, 2025

    Seth Karten, Wenzhe Li, Zihan Ding, Samuel Kleiner, Yu Bai, and Chi Jin. LLM economist: Large population models and mechanism design in multi-agent generative simulacra.arXiv preprint arXiv:2507.15815, 2025

  36. [36]

    CARTE: Pretraining and transfer for tabular learning

    Myung Jun Kim, Leo Grinsztajn, and Gael Varoquaux. CARTE: Pretraining and transfer for tabular learning. InProceedings of the 41st International Conference on Machine Learning, pages 23843–23866, 2024

  37. [37]

    LLM embeddings for deep learning on tabular data.arXiv preprint arXiv:2502.11596, 2025

    Boshko Koloski, Andrei Margeloiu, Xiangjian Jiang, Blaž Škrlj, Nikola Simidjievski, and Mateja Jamnik. LLM embeddings for deep learning on tabular data.arXiv preprint arXiv:2502.11596, 2025

  38. [38]

    Are LLMs effective negotiators? systematic evaluation of the multifaceted capabilities of LLMs in negotiation dialogues

    Deuksin Kwon, Emily Weiss, Tara Kulshrestha, Kushal Chawla, Gale Lucas, and Jonathan Gratch. Are LLMs effective negotiators? systematic evaluation of the multifaceted capabilities of LLMs in negotiation dialogues. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 5391–5413, 2024

  39. [39]

    De- tecting winning arguments with large language models and persuasion strategies

    Tiziano Labruna, Arkadiusz Modzelewski, Giorgio Satta, and Giovanni Da San Martino. De- tecting winning arguments with large language models and persuasion strategies. InFindings of the Association for Computational Linguistics: EACL 2026, pages 1888–1915, Rabat, Morocco,

  40. [40]

    doi: 10.18653/v1/2026.findings-eacl.97

    Association for Computational Linguistics. doi: 10.18653/v1/2026.findings-eacl.97

  41. [41]

    Emergence of linguistic communication from referential games with symbolic and pixel input

    Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. Emergence of linguistic communication from referential games with symbolic and pixel input. InProceedings of the 6th International Conference on Learning Representations (ICLR), 2018

  42. [42]

    Theory of mind for multi-agent collaboration via large language models

    Huao Li, Yu Chong, Simon Stepputtis, Joseph Campbell, Dana Hughes, Charles Lewis, and Katia Sycara. Theory of mind for multi-agent collaboration via large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 180–192, 2023. 12

  43. [43]

    Spontaneous giving and calculated greed in language models

    Yuxuan Li and Hirokazu Shirado. Spontaneous giving and calculated greed in language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5271–5286, 2025

  44. [44]

    Communication enables cooperation in LLM agents: A comparison with curriculum-based approaches

    Hachem Madmoun and Salem Lahlou. Communication enables cooperation in LLM agents: A comparison with curriculum-based approaches. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pages 307–321, Rabat, Morocco, 2026. Association for Computational Linguistics. doi: 10.1865...

  45. [45]

    Albrecht

    Reuth Mirsky, Ignacio Carlucho, Arrasy Rahman, Elliot Fosong, William Macke, Mohan Sridharan, Peter Stone, and Stefano V . Albrecht. A survey of ad hoc teamwork research. In Multi-Agent Systems. EUMAS 2022, volume 13442 ofLecture Notes in Computer Science, pages 275–293. Springer, 2022. doi: 10.1007/978-3-031-20614-6_16

  46. [46]

    Towards benchmarking foundation models for tabular data with text

    Martin Mráz, Breenda Das, Anshul Gupta, Lennart Purucker, and Frank Hutter. Towards benchmarking foundation models for tabular data with text. InFoundation Models for Structured Data Workshop at ICML, 2025

  47. [47]

    Adaptive theory of mind for LLM-based multi-agent coordination

    Chunjiang Mu, Ya Zeng, Qiaosheng Zhang, Kun Shao, Chen Chu, Hao Guo, Danyang Jia, Zhen Wang, and Shuyue Hu. Adaptive theory of mind for LLM-based multi-agent coordination. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29608–29616,

  48. [48]

    doi: 10.1609/aaai.v40i35.40204

  49. [49]

    A survey of opponent modeling in adversarial domains.Journal of Artificial Intelligence Research, 73:277–327, 2022

    Samer B Nashed and Shlomo Zilberstein. A survey of opponent modeling in adversarial domains.Journal of Artificial Intelligence Research, 73:277–327, 2022

  50. [50]

    Learn- ing theory of mind via dynamic traits attribution

    Dung Nguyen, Phuoc Nguyen, Hung Le, Kien Do, Svetha Venkatesh, and Truyen Tran. Learn- ing theory of mind via dynamic traits attribution. InProceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pages 954–962, 2022

  51. [51]

    Memory- augmented theory of mind network

    Dung Nguyen, Phuoc Nguyen, Hung Le, Kien Do, Svetha Venkatesh, and Truyen Tran. Memory- augmented theory of mind network. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 11630–11637, 2023. doi: 10.1609/aaai.v37i10.26374

  52. [52]

    LLMs know more than they show: On the intrinsic representation of LLM hallucinations

    Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. LLMs know more than they show: On the intrinsic representation of LLM hallucinations. InThe Thirteenth International Conference on Learning Representations, 2025

  53. [53]

    Deep contextualized word representations

    Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2227–2237, New Orleans, Louisiana, 2018. Associatio...

  54. [54]

    GENTEEL- NEGOTIATOR: LLM-enhanced mixture-of-expert-based reinforcement learning approach for polite negotiation dialogue

    Priyanshu Priya, Rishikant Chigrupaatii, Mauajama Firdaus, and Asif Ekbal. GENTEEL- NEGOTIATOR: LLM-enhanced mixture-of-expert-based reinforcement learning approach for polite negotiation dialogue. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 25010–25018, 2025

  55. [55]

    TabICL: A tabular foundation model for in-context learning on large data

    Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. TabICL: A tabular foundation model for in-context learning on large data. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 50817–50847. PMLR, 2025

  56. [56]

    Neil Rabinowitz, Frank Perbet, Francis Song, Chiyuan Zhang, S. M. Ali Eslami, and Matthew Botvinick. Machine theory of mind. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 4218–4227. PMLR, 2018

  57. [57]

    Ribeiro, Gonçalo Rodrigues, Alberto Sardinha, and Francisco S

    João G. Ribeiro, Gonçalo Rodrigues, Alberto Sardinha, and Francisco S. Melo. TEAMSTER: Model-based reinforcement learning for ad hoc teamwork.Artificial Intelligence, 324:104013,

  58. [58]

    doi: 10.1016/j.artint.2023.104013. 13

  59. [59]

    The agentic economy.arXiv preprint arXiv:2505.15799, 2025

    David M. Rothschild, Markus Mobius, Jake M. Hofman, Eleanor W. Dillon, Daniel G. Goldstein, Nicole Immorlica, Sonia Jaffe, Brendan Lucier, Aleksandrs Slivkins, and Matthew V ogel. The agentic economy.arXiv preprint arXiv:2505.15799, 2025

  60. [60]

    Perfect equilibrium in a bargaining model.Econometrica, 50(1):97–109, 1982

    Ariel Rubinstein. Perfect equilibrium in a bargaining model.Econometrica, 50(1):97–109, 1982

  61. [61]

    Can LLMs replace economic choice prediction labs? The case of language-based persuasion games.arXiv preprint arXiv:2401.17435, 2024

    Eilam Shapira, Omer Madmon, Roi Reichart, and Moshe Tennenholtz. Can LLMs replace economic choice prediction labs? The case of language-based persuasion games.arXiv preprint arXiv:2401.17435, 2024

  62. [62]

    GLEE: A unified framework and benchmark for language-based economic environments.arXiv preprint arXiv:2410.05254, 2024

    Eilam Shapira, Omer Madmon, Itamar Reinman, Samuel Joseph Amouyal, Roi Reichart, and Moshe Tennenholtz. GLEE: A unified framework and benchmark for language-based economic environments.arXiv preprint arXiv:2410.05254, 2024

  63. [63]

    Human choice prediction in language-based persuasion games: Simulation-based off-policy evaluation

    Eilam Shapira, Omer Madmon, Reut Apel, Moshe Tennenholtz, and Roi Reichart. Human choice prediction in language-based persuasion games: Simulation-based off-policy evaluation. Transactions of the Association for Computational Linguistics, 13:980–1006, 2025

  64. [64]

    arXiv preprint arXiv:2603.17218 , year=

    Eilam Shapira, Moshe Tennenholtz, and Roi Reichart. Alignment makes language models normative, not descriptive.arXiv preprint arXiv:2603.17218, 2026

  65. [65]

    Xingjian Shi, Jonas Mueller, Nick Erickson, Mu Li, and Alexander J. Smola. Benchmarking multimodal AutoML for tabular data with text fields. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021

  66. [66]

    Layer by layer: Uncovering hidden representations in language models

    Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Nikul Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models. In Proceedings of the 42nd International Conference on Machine Learning, pages 55854–55875, 2025

  67. [67]

    Kaminka, Sarit Kraus, and Jeffrey S

    Peter Stone, Gal A. Kaminka, Sarit Kraus, and Jeffrey S. Rosenschein. Ad hoc autonomous agent teams: Collaboration without pre-coordination. InProceedings of the AAAI Conference on Artificial Intelligence, volume 24, pages 1504–1509, 2010. doi: 10.1609/aaai.v24i1.7529

  68. [68]

    Learning multiagent communication with backpropagation

    Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. Learning multiagent communication with backpropagation. InAdvances in Neural Information Processing Systems 29 (NIPS), 2016

  69. [69]

    Systematic biases in LLM simulations of debates

    Amir Taubenfeld, Yaniv Dover, Roi Reichart, and Ariel Goldstein. Systematic biases in LLM simulations of debates. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 251–267, 2024

  70. [70]

    BERT Rediscovers the Classical NLP Pipeline

    Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, 2019. doi: 10.18653/v1/P19-1452

  71. [71]

    N-agent ad hoc teamwork

    Caroline Wang, Arrasy Rahman, Ishan Durugkar, Elad Liebman, and Peter Stone. N-agent ad hoc teamwork. InAdvances in Neural Information Processing Systems, 2024

  72. [72]

    Multi-agent reinforcement learning for market making: Competition without collusion

    Ziyi Wang, Carmine Ventre, and Maria Polukarov. Multi-agent reinforcement learning for market making: Competition without collusion. InProceedings of the 6th ACM International Conference on AI in Finance (ICAIF), 2025

  73. [73]

    MALLES: A multi-agent LLMs-based economic sandbox with consumer preference alignment.arXiv preprint arXiv:2603.17694, 2026

    Yusen Wu, Yiran Liu, and Xiaotie Deng. MALLES: A multi-agent LLMs-based economic sandbox with consumer preference alignment.arXiv preprint arXiv:2603.17694, 2026

  74. [74]

    Measuring bargaining abilities of LLMs: A benchmark and a buyer-enhancement method

    Tian Xia, Zhiwei He, Tong Ren, Yibo Miao, Zhuosheng Zhang, Yang Yang, and Rui Wang. Measuring bargaining abilities of LLMs: A benchmark and a buyer-enhancement method. In Findings of the Association for Computational Linguistics: ACL 2024, pages 3579–3602, 2024

  75. [75]

    Towards dynamic theory of mind: Evaluating LLM adaptation to temporal evolution of human states

    Yang Xiao, Jiashuo Wang, Qiancheng Xu, Changhe Song, Chunpu Xu, Yi Cheng, Wenjie Li, and Pengfei Liu. Towards dynamic theory of mind: Evaluating LLM adaptation to temporal evolution of human states. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24036–24057, 2025. 14

  76. [76]

    M3-BENCH: Process-aware evaluation of LLM agents’ social behaviors in mixed-motive games.arXiv preprint arXiv:2601.08462, 2026

    Sixiong Xie, Zhuofan Shi, Haiyang Shen, Yun Ma, Gang Huang, and Xiang Jing. M3-BENCH: Process-aware evaluation of LLM agents’ social behaviors in mixed-motive games.arXiv preprint arXiv:2601.08462, 2026

  77. [77]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  78. [78]

    Improving dialog systems for negotia- tion with personality modeling

    Runzhe Yang, Jingxiao Chen, and Karthik Narasimhan. Improving dialog systems for negotia- tion with personality modeling. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Lan- guage Processing (Volume 1: Long Papers), pages 681–693. Association for Computationa...

  79. [79]

    AutoToM: Automated bayesian inverse planning and model discovery for open-ended theory of mind.arXiv preprint arXiv:2502.15676, 2025

    Zhining Zhang, Chuanyang Jin, Mung Yao Jia, and Tianmin Shu. AutoToM: Automated bayesian inverse planning and model discovery for open-ended theory of mind.arXiv preprint arXiv:2502.15676, 2025

  80. [80]

    Market-Bench: Benchmarking Large Language Models on Economic and Trade Competition

    Yushuo Zheng, Huiyu Duan, Zicheng Zhang, Yucheng Zhu, Xiongkuo Min, and Guangtao Zhai. Market-Bench: Benchmarking large language models on economic and trade competition. arXiv preprint arXiv:2604.05523, 2026

Showing first 80 references.