arxiv: 2605.12411 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI· cs.CL· cs.MA

Recognition: no theorem link

Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling

Eilam Shapira, Moshe Tennenholtz, Roi Reichart

Pith reviewed 2026-05-13 05:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.MA

keywords AI agent predictiontext-tabular modelingLLM-as-Observerbargaining gamesnegotiationfew-shot adaptationcounterpart modelingdecision forecasting

0 comments

The pith

A text-tabular model with LLM hidden states as features predicts unfamiliar AI agents' decisions from a small number of past interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether one AI agent can forecast the next move of an unfamiliar counterpart agent after seeing only a handful of prior exchanges. It casts the task as target-adaptive text-tabular prediction: each decision point becomes a table row that merges game state, offer history, and dialogue text, while earlier complete games of the same target serve as labeled examples in the prompt. A tabular foundation model is trained on agents driven by frontier LLMs and evaluated on held-out scaffolded agents; the version that adds an LLM-as-Observer representation, which extracts hidden states rather than generated answers, beats both direct prompting baselines and simpler feature sets. If the result holds, it shows that decision-relevant signals exist in LLM internal states that are not surfaced by ordinary few-shot prompting.

Core claim

Formulating counterpart prediction as a target-adaptive text-tabular task enables effective adaptation to specific agents. Training on 13 frontier-LLM agents and testing on 91 held-out scaffolded agents, the full model outperforms direct LLM-as-Predictor prompting and game-plus-text baselines. Within the tabular model, Observer features improve response-prediction AUC by about 4 points across both tasks and reduce bargaining offer-prediction error by 14 percent at K=16. These gains demonstrate that hidden LLM representations expose decision-relevant signals that direct prompting does not surface.

What carries the argument

LLM-as-Observer: a small frozen LLM that reads the current decision-time state and dialogue, discards its generated answer, and supplies its hidden state as an additional decision-oriented feature inside a tabular foundation model that also encodes game-state features and LLM text representations.

If this is right

Observer features add predictive value beyond structured game state and raw text representations.
The same tabular architecture works for both discrete response prediction and continuous offer prediction.
Performance improves as the number of adaptation examples K increases up to at least 16.
Direct prompting of an LLM to act as a predictor is inferior to embedding its hidden states into a tabular model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be applied to predict actions of commercial AI assistants in procurement or customer-service settings if their interaction logs are available.
If the same hidden-state signals appear in human decision data, the framework might transfer to modeling human counterparts from short interaction histories.
Agents could be designed to obscure their internal states specifically to defeat this style of observer-based prediction.

Load-bearing premise

Controlled bargaining and negotiation games with scaffolded agents capture the decision-making patterns of real-world AI agents whose prompts, control logic, and rule-based fallbacks remain hidden.

What would settle it

Run the trained model on agents whose prompts and fallback rules are not constructed via the same scaffolding procedure used in the study, then measure whether AUC and offer-error gains disappear.

Figures

Figures reproduced from arXiv: 2605.12411 by Eilam Shapira, Moshe Tennenholtz, Roi Reichart.

**Figure 2.** Figure 2: Three approaches for predicting decisions of a target agent. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Observer gain over the game+text features baseline by relative depth. Observer gains are [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Schematic of the multimodal tabular row at a single decision point. The row concatenates [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

read the original abstract

AI agents negotiate and transact in natural language with unfamiliar counterparts: a buyer bot facing an unknown seller, or a procurement assistant negotiating with a supplier. In such interactions, the counterpart's LLM, prompts, control logic, and rule-based fallbacks are hidden, while each decision can have monetary consequences. We ask whether an agent can predict an unfamiliar counterpart's next decision from a few interactions. To avoid real-world logging confounds, we study this problem in controlled bargaining and negotiation games, formulating it as target-adaptive text-tabular prediction: each decision point is a table row combining structured game state, offer history, and dialogue, while $K$ previous games of the same target agent, i.e., the counterpart being modeled, are provided in the prompt as labeled adaptation examples. Our model is built on a tabular foundation model that represents rows using game-state features and LLM-based text representations, and adds LLM-as-Observer as an additional representation: a small frozen LLM reads the decision-time state and dialogue; its answer is discarded, and its hidden state becomes a decision-oriented feature, making the LLM an encoder rather than a direct few-shot predictor. Training on 13 frontier-LLM agents and testing on 91 held-out scaffolded agents, the full model outperforms direct LLM-as-Predictor prompting and game+text features baselines. Within this tabular model, Observer features contribute beyond the other feature schemes: at $K=16$, they improve response-prediction AUC by about 4 points across both tasks and reduce bargaining offer-prediction error by 14%. These results show that formulating counterpart prediction as a target-adaptive text-tabular task enables effective adaptation, and that hidden LLM representations expose decision-relevant signals that direct prompting does not surface.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable target-adaptive text-tabular method that uses LLM hidden states as extra features and reports clear gains on held-out agents, but the gains are measured only against scaffolded test agents.

read the letter

The main takeaway is that training a tabular model on 13 frontier-LLM agents and testing on 91 held-out scaffolded ones produces measurable improvements when you add LLM hidden-state features from an observer. At K=16 the observer features lift response-prediction AUC by roughly 4 points and cut offer-prediction error by 14 percent over game-plus-text baselines and direct LLM prompting. The setup treats the problem as target-adaptive: the model sees K labeled games from the same target agent as in-context examples, which is a clean way to handle the “unfamiliar counterpart” case without retraining from scratch. Using the LLM only as an encoder (discard the answer, keep the hidden state) is a sensible move that avoids the usual prompting pitfalls. Those pieces are new enough and the controlled experiments are run carefully enough to be worth referee time. The soft spot is exactly the one the stress-test flags. The test agents are scaffolded with explicit control logic, so their next actions are more deterministic than a real frontier LLM whose prompt and internals are hidden. The paper chose the scaffolded setting to dodge logging confounds, which is reasonable for a first study, but it means the reported gains do not yet speak directly to the motivating scenario of truly opaque agents. I would also like to see error bars, significance tests, and any ablation on how the 13+91 split was constructed. This is for people working on multi-agent systems or AI safety who need practical ways to anticipate other agents’ moves in structured interactions. The modeling is thoughtful and the results are concrete, so it deserves a serious referee even if the generalization claim needs tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes formulating prediction of unfamiliar AI agents' decisions in bargaining and negotiation games as a target-adaptive text-tabular task. A tabular foundation model is trained on 13 frontier-LLM agents using game-state features, text representations, and LLM-as-Observer hidden states (with K prior games as labeled adaptation examples in the prompt); it is evaluated on 91 held-out scaffolded agents. The full model outperforms direct LLM-as-Predictor prompting and game+text baselines, with Observer features adding ~4 AUC points to response prediction and reducing offer-prediction error by 14% at K=16.

Significance. If the empirical gains hold, the work would contribute to multi-agent systems by showing that hidden LLM representations can surface decision-relevant signals beyond direct prompting or structured features alone. Credit is due for the clean separation of the frozen LLM as an encoder (rather than few-shot predictor), the target-adaptive formulation, and the held-out agent split that avoids obvious circularity.

major comments (2)

[Abstract] Abstract: the reported gains (approximately 4 AUC points and 14% error reduction at K=16) are presented without statistical significance tests, error bars, details on the 13+91 agent split, data exclusion criteria, or controls for confounds. These omissions make it impossible to assess whether the central performance claims are robust.
[Abstract and Evaluation section] Abstract and Evaluation section: the quantitative results rest on testing against 91 scaffolded agents whose explicit control logic is known by construction. This setup may allow Observer features to exploit deterministic regularities that would not exist for frontier LLMs whose prompts, internal states, and rule-based fallbacks remain hidden—the motivating use case. The paper acknowledges the controlled setting “to avoid real-world logging confounds,” yet presents the gains as evidence for the general problem; this mismatch is load-bearing for the claimed applicability.

minor comments (2)

[Methods] Clarify in the methods how the K adaptation examples are exactly formatted and whether any leakage from the target agent's identity occurs through the tabular features.
[Experimental Setup] Add explicit definitions or a table for all feature schemes (game, text, Observer) and baselines to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the reported gains (approximately 4 AUC points and 14% error reduction at K=16) are presented without statistical significance tests, error bars, details on the 13+91 agent split, data exclusion criteria, or controls for confounds. These omissions make it impossible to assess whether the central performance claims are robust.

Authors: We agree that these details are necessary for assessing robustness. In the revised manuscript, we will add statistical significance tests (e.g., paired tests across multiple random seeds) for the reported AUC and error improvements, include error bars (standard deviation over runs) in all quantitative results, provide a full description of the 13+91 agent split including how the scaffolded agents were generated and any exclusion criteria, and discuss controls for confounds such as game parameterization and agent diversity. These additions will be incorporated into both the abstract and the evaluation section. revision: yes
Referee: [Abstract and Evaluation section] Abstract and Evaluation section: the quantitative results rest on testing against 91 scaffolded agents whose explicit control logic is known by construction. This setup may allow Observer features to exploit deterministic regularities that would not exist for frontier LLMs whose prompts, internal states, and rule-based fallbacks remain hidden—the motivating use case. The paper acknowledges the controlled setting “to avoid real-world logging confounds,” yet presents the gains as evidence for the general problem; this mismatch is load-bearing for the claimed applicability.

Authors: We acknowledge the gap between the controlled scaffolded setting and fully opaque frontier LLMs. The scaffolded agents were explicitly designed to replicate observed decision patterns from frontier LLMs in the same games, enabling isolation of modeling effects without logging confounds. We will revise the abstract and evaluation section to more precisely frame the results as evidence for the target-adaptive formulation and the value of Observer features within controlled approximations of the motivating scenario. We will also expand the limitations discussion to explicitly note that deterministic regularities may be more exploitable here than in hidden-state cases, and outline future directions such as proxy evaluations on additional LLM variants. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical train/test split on held-out agents with frozen encoder

full rationale

The paper's core claim is an empirical result: a tabular model trained on interaction data from 13 frontier-LLM agents is evaluated on 91 held-out scaffolded agents, with performance gains attributed to Observer features extracted from a frozen LLM. No equations, derivations, or first-principles steps are presented that reduce predictions to fitted parameters by construction. The adaptation examples are explicitly labeled data from the target agent, the Observer LLM is frozen and not trained on the prediction task, and the test agents are disjoint from training. This setup contains no self-definitional loops, fitted-input-as-prediction, or load-bearing self-citations; the reported AUC and error improvements are standard held-out evaluation metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard supervised learning assumptions plus the domain premise that LLM hidden states encode decision-relevant information beyond surface text; no explicit free parameters or new invented entities are introduced beyond the modeling framework itself.

axioms (2)

domain assumption Hidden states from a frozen LLM reading decision-time state and dialogue contain predictive signals for the agent's next action that are not accessible via direct prompting.
This premise directly motivates the LLM-as-Observer design and the claim that it contributes beyond other features.
domain assumption Controlled bargaining and negotiation games with scaffolded agents provide a valid proxy for real-world interactions with hidden LLM prompts and logic.
Invoked to justify studying the problem in these games while avoiding real-world logging confounds.

pith-pipeline@v0.9.0 · 5629 in / 1423 out tokens · 66564 ms · 2026-05-13T05:05:55.929426+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 4 internal anchors

[1]

Cooperation, competition, and maliciousness: LLM-stakeholders interactive negotiation

Sahar Abdelnabi, Amr Gomaa, Sarath Sivaprasad, Lea Schönherr, and Mario Fritz. Cooperation, competition, and maliciousness: LLM-stakeholders interactive negotiation. InAdvances in Neural Information Processing Systems: Datasets and Benchmarks Track, volume 37, 2024

work page 2024
[2]

A game-theoretic model and best-response learning method for ad hoc coordination in multiagent systems

Stefano V Albrecht and Subramanian Ramamoorthy. A game-theoretic model and best-response learning method for ad hoc coordination in multiagent systems. InProceedings of the 12th International Conference on Autonomous Agents and Multiagent Systems, pages 1155–1156, 2013

work page 2013
[3]

Autonomous agents modelling other agents: A compre- hensive survey and open problems.Artificial Intelligence, 258:66–95, 2018

Stefano V Albrecht and Peter Stone. Autonomous agents modelling other agents: A compre- hensive survey and open problems.Artificial Intelligence, 258:66–95, 2018

work page 2018
[4]

Belief and truth in hypothesised behaviours.Artificial Intelligence, 235:63–94, 2016

Stefano V Albrecht, Jacob W Crandall, and Subramanian Ramamoorthy. Belief and truth in hypothesised behaviours.Artificial Intelligence, 235:63–94, 2016

work page 2016
[5]

Project Deal: A Claude-run marketplace experiment

Anthropic. Project Deal: A Claude-run marketplace experiment. Anthropic research blog, 2026. https://www.anthropic.com/features/project-deal

work page 2026
[6]

TabSTAR: A tabular foundation model for tabular data with text fields

Alan Arazi, Eilam Shapira, and Roi Reichart. TabSTAR: A tabular foundation model for tabular data with text fields. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=FrXHdcTEzE

work page 2025
[7]

MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

Alan Arazi, Eilam Shapira, Shoham Grunblat, Mor Ventura, Elad Hoffer, Gioia Blayer, David Holzmüller, Lennart Purucker, Gaël Varoquaux, Frank Hutter, and Roi Reichart. MulTaBench: Benchmarking multimodal tabular learning with text and image.arXiv preprint arXiv:2605.10616, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Tim Baarslag, Mark J. C. Hendrikx, Koen V . Hindriks, and Catholijn M. Jonker. Learning about the opponent in automated bilateral negotiation: A comprehensive survey of opponent modeling techniques.Autonomous Agents and Multi-Agent Systems, 30(5):849–898, 2016. doi: 10.1007/s10458-015-9309-1

work page doi:10.1007/s10458-015-9309-1 2016
[9]

When ethics and payoffs diverge: LLM agents in morally charged social dilemmas.arXiv preprint arXiv:2505.19212, 2025

Steffen Backmann, David Guzman Piedrahita, Emanuel Tewolde, Rada Mihalcea, Bernhard Schölkopf, and Zhijing Jin. When ethics and payoffs diverge: LLM agents in morally charged social dilemmas.arXiv preprint arXiv:2505.19212, 2025

work page arXiv 2025
[10]

ElecTwit: A framework for studying persuasion in multi-agent social systems

Michael Bao. ElecTwit: A framework for studying persuasion in multi-agent social systems. arXiv preprint arXiv:2601.00994, 2026

work page arXiv 2026
[11]

Probing classifiers: Promises, shortcomings, and advances

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022. doi: 10.1162/coli_a_00422

work page doi:10.1162/coli_a_00422 2022
[12]

How well can LLMs negotiate? NegotiationArena platform and analysis

Federico Bianchi, Patrick John Chia, Mert Yuksekgonul, Jacopo Tagliabue, Dan Jurafsky, and James Zou. How well can LLMs negotiate? NegotiationArena platform and analysis. InProceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 3935–3951. PMLR, 2024. 10

work page 2024
[13]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems 33 (NeurIPS 2020), volume 33, pages 1877–1901, 2020

work page 2020
[14]

Grounding strategic conversation: Using negotiation dialogues to predict trades in a win-lose game

Anaïs Cadilhac, Nicholas Asher, Farah Benamara, and Alex Lascarides. Grounding strategic conversation: Using negotiation dialogues to predict trades in a win-lose game. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 357–368, 2013

work page 2013
[15]

I want to break free! persuasion and anti-social behavior of LLMs in multi-agent settings with social hierarchy.arXiv preprint arXiv:2410.07109, 2024

Gian Maria Campedelli, Nicolò Penzo, Massimo Stefan, Roberto Dessì, Marco Guerini, Bruno Lepri, and Jacopo Staiano. I want to break free! persuasion and anti-social behavior of LLMs in multi-agent settings with social hierarchy.arXiv preprint arXiv:2410.07109, 2024

work page arXiv 2024
[16]

Opponent modeling in negotiation dialogues by related data adaptation

Kushal Chawla, Gale Lucas, Jonathan May, and Jonathan Gratch. Opponent modeling in negotiation dialogues by related data adaptation. InFindings of the Association for Computa- tional Linguistics: NAACL 2022, pages 661–674, Seattle, United States, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-naacl.50

work page doi:10.18653/v1/2022.findings-naacl.50 2022
[17]

PowerNet: Multi-agent deep reinforcement learning for scalable powergrid control.IEEE Transactions on Power Systems, 37(2):1587–1599, 2022

Dong Chen, Kaian Chen, Zhaojian Li, Tianshu Chu, Rui Yao, Feng Qiu, and Kaixiang Lin. PowerNet: Multi-agent deep reinforcement learning for scalable powergrid control.IEEE Transactions on Power Systems, 37(2):1587–1599, 2022

work page 2022
[18]

Put your money where your mouth is: Evaluating strategic planning and execution of LLM agents in an auction arena.arXiv preprint arXiv:2310.05746, 2023

Jiangjie Chen, Siyu Yuan, Rong Ye, Bodhisattwa Prasad Majumder, and Kyle Richardson. Put your money where your mouth is: Evaluating strategic planning and execution of LLM agents in an auction arena.arXiv preprint arXiv:2310.05746, 2023

work page arXiv 2023
[19]

Coehoorn and Nicholas R

Robert M. Coehoorn and Nicholas R. Jennings. Learning on opponent’s preferences to make effective multi-issue negotiation trade-offs. InProceedings of the 6th International Conference on Electronic Commerce, pages 59–68, 2004

work page 2004
[20]

What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties

Alexis Conneau, German Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. InProceedings of the 56th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 2126–2136, Melbourne, Australia, 2018. Associat...

work page doi:10.18653/v1/p18-1198 2018
[21]

Scalable multiagent driving policies for reducing traffic congestion

Jiaxun Cui, William Macke, Harel Yedidsion, Aastha Goyal, Daniel Urieli, and Peter Stone. Scalable multiagent driving policies for reducing traffic congestion. InProceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), 2021

work page 2021
[22]

A survey on in-context learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. A survey on in-context learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1107–1128, Miami, Florida, USA, 2024. Association for Computational Linguistics....

work page doi:10.18653/v1/2024.emnlp-main.64 2024
[23]

Attention graph for multi-robot social navigation with deep reinforcement learning

Erwan Escudie, Laetitia Matignon, and Jacques Saraydaryan. Attention graph for multi-robot social navigation with deep reinforcement learning. InProceedings of the 23rd International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), 2024. Extended Abstract

work page 2024
[24]

Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, and Marius Hobbhahn

Sara Fish, Yannai A. Gonczarowski, and Ran I. Shorrer. Algorithmic collusion by large language models.arXiv preprint arXiv:2404.00806, 2024

work page arXiv 2024
[25]

Inside-Out: Hidden factual knowledge in LLMs

Zorik Gekhman, Eyal Ben David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpektor, Jonathan Herzig, and Roi Reichart. Inside-Out: Hidden factual knowledge in LLMs. In Conference on Language Modeling, 2025

work page 2025
[26]

A framework for sequential planning in multi-agent settings.Journal of Artificial Intelligence Research, 24:49–79, 2005

Piotr J Gmytrasiewicz and Prashant Doshi. A framework for sequential planning in multi-agent settings.Journal of Artificial Intelligence Research, 24:49–79, 2005

work page 2005
[27]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Chawla, Olaf Wiest, and Xiangliang Zhang

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V . Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges. InProceedings of the 33rd International Joint Conference on Artificial Intelligence, pages 8048–8057, 2024. doi: 10.24963/ijcai.2024/890

work page doi:10.24963/ijcai.2024/890 2024
[29]

Decoupling strategy and generation in negotiation dialogues

He He, Derek Chen, Anusha Balakrishnan, and Percy Liang. Decoupling strategy and generation in negotiation dialogues. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2333–2343, Brussels, Belgium, 2018. Association for Computational Linguistics

work page 2018
[30]

Lan- guage of bargaining

Mourad Heddaya, Solomon Dworkin, Chenhao Tan, Rob V oigt, and Alexander Zentefis. Lan- guage of bargaining. InProceedings of the 61st Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 13161–13185. Association for Computational Linguistics, 2023

work page 2023
[31]

John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word representations. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota, 2019. Association for Computation...

work page doi:10.18653/v1/n19-1419 2019
[32]

TabPFN: A transformer that solves small tabular classification problems in a second

Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

work page 2023
[33]

Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025

work page 2025
[34]

BERT has more to offer: BERT layers combination yields better sentence embeddings

MohammadSaleh Hosseini, Munawara Munia, and Latifur Khan. BERT has more to offer: BERT layers combination yields better sentence embeddings. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 15419–15431, 2023

work page 2023
[35]

LLM economist: Large population models and mechanism design in multi-agent generative simulacra.arXiv preprint arXiv:2507.15815, 2025

Seth Karten, Wenzhe Li, Zihan Ding, Samuel Kleiner, Yu Bai, and Chi Jin. LLM economist: Large population models and mechanism design in multi-agent generative simulacra.arXiv preprint arXiv:2507.15815, 2025

work page arXiv 2025
[36]

CARTE: Pretraining and transfer for tabular learning

Myung Jun Kim, Leo Grinsztajn, and Gael Varoquaux. CARTE: Pretraining and transfer for tabular learning. InProceedings of the 41st International Conference on Machine Learning, pages 23843–23866, 2024

work page 2024
[37]

LLM embeddings for deep learning on tabular data.arXiv preprint arXiv:2502.11596, 2025

Boshko Koloski, Andrei Margeloiu, Xiangjian Jiang, Blaž Škrlj, Nikola Simidjievski, and Mateja Jamnik. LLM embeddings for deep learning on tabular data.arXiv preprint arXiv:2502.11596, 2025

work page arXiv 2025
[38]

Are LLMs effective negotiators? systematic evaluation of the multifaceted capabilities of LLMs in negotiation dialogues

Deuksin Kwon, Emily Weiss, Tara Kulshrestha, Kushal Chawla, Gale Lucas, and Jonathan Gratch. Are LLMs effective negotiators? systematic evaluation of the multifaceted capabilities of LLMs in negotiation dialogues. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 5391–5413, 2024

work page 2024
[39]

De- tecting winning arguments with large language models and persuasion strategies

Tiziano Labruna, Arkadiusz Modzelewski, Giorgio Satta, and Giovanni Da San Martino. De- tecting winning arguments with large language models and persuasion strategies. InFindings of the Association for Computational Linguistics: EACL 2026, pages 1888–1915, Rabat, Morocco,

work page 2026
[40]

doi: 10.18653/v1/2026.findings-eacl.97

Association for Computational Linguistics. doi: 10.18653/v1/2026.findings-eacl.97

work page doi:10.18653/v1/2026.findings-eacl.97 2026
[41]

Emergence of linguistic communication from referential games with symbolic and pixel input

Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. Emergence of linguistic communication from referential games with symbolic and pixel input. InProceedings of the 6th International Conference on Learning Representations (ICLR), 2018

work page 2018
[42]

Theory of mind for multi-agent collaboration via large language models

Huao Li, Yu Chong, Simon Stepputtis, Joseph Campbell, Dana Hughes, Charles Lewis, and Katia Sycara. Theory of mind for multi-agent collaboration via large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 180–192, 2023. 12

work page 2023
[43]

Spontaneous giving and calculated greed in language models

Yuxuan Li and Hirokazu Shirado. Spontaneous giving and calculated greed in language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5271–5286, 2025

work page 2025
[44]

Communication enables cooperation in LLM agents: A comparison with curriculum-based approaches

Hachem Madmoun and Salem Lahlou. Communication enables cooperation in LLM agents: A comparison with curriculum-based approaches. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pages 307–321, Rabat, Morocco, 2026. Association for Computational Linguistics. doi: 10.1865...

work page doi:10.18653/v1/2026.eacl-short.23 2026
[45]

Albrecht

Reuth Mirsky, Ignacio Carlucho, Arrasy Rahman, Elliot Fosong, William Macke, Mohan Sridharan, Peter Stone, and Stefano V . Albrecht. A survey of ad hoc teamwork research. In Multi-Agent Systems. EUMAS 2022, volume 13442 ofLecture Notes in Computer Science, pages 275–293. Springer, 2022. doi: 10.1007/978-3-031-20614-6_16

work page doi:10.1007/978-3-031-20614-6_16 2022
[46]

Towards benchmarking foundation models for tabular data with text

Martin Mráz, Breenda Das, Anshul Gupta, Lennart Purucker, and Frank Hutter. Towards benchmarking foundation models for tabular data with text. InFoundation Models for Structured Data Workshop at ICML, 2025

work page 2025
[47]

Adaptive theory of mind for LLM-based multi-agent coordination

Chunjiang Mu, Ya Zeng, Qiaosheng Zhang, Kun Shao, Chen Chu, Hao Guo, Danyang Jia, Zhen Wang, and Shuyue Hu. Adaptive theory of mind for LLM-based multi-agent coordination. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29608–29616,

work page
[48]

doi: 10.1609/aaai.v40i35.40204

work page doi:10.1609/aaai.v40i35.40204
[49]

A survey of opponent modeling in adversarial domains.Journal of Artificial Intelligence Research, 73:277–327, 2022

Samer B Nashed and Shlomo Zilberstein. A survey of opponent modeling in adversarial domains.Journal of Artificial Intelligence Research, 73:277–327, 2022

work page 2022
[50]

Learn- ing theory of mind via dynamic traits attribution

Dung Nguyen, Phuoc Nguyen, Hung Le, Kien Do, Svetha Venkatesh, and Truyen Tran. Learn- ing theory of mind via dynamic traits attribution. InProceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pages 954–962, 2022

work page 2022
[51]

Memory- augmented theory of mind network

Dung Nguyen, Phuoc Nguyen, Hung Le, Kien Do, Svetha Venkatesh, and Truyen Tran. Memory- augmented theory of mind network. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 11630–11637, 2023. doi: 10.1609/aaai.v37i10.26374

work page doi:10.1609/aaai.v37i10.26374 2023
[52]

LLMs know more than they show: On the intrinsic representation of LLM hallucinations

Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. LLMs know more than they show: On the intrinsic representation of LLM hallucinations. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[53]

Deep contextualized word representations

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2227–2237, New Orleans, Louisiana, 2018. Associatio...

work page 2018
[54]

GENTEEL- NEGOTIATOR: LLM-enhanced mixture-of-expert-based reinforcement learning approach for polite negotiation dialogue

Priyanshu Priya, Rishikant Chigrupaatii, Mauajama Firdaus, and Asif Ekbal. GENTEEL- NEGOTIATOR: LLM-enhanced mixture-of-expert-based reinforcement learning approach for polite negotiation dialogue. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 25010–25018, 2025

work page 2025
[55]

TabICL: A tabular foundation model for in-context learning on large data

Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. TabICL: A tabular foundation model for in-context learning on large data. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 50817–50847. PMLR, 2025

work page 2025
[56]

Neil Rabinowitz, Frank Perbet, Francis Song, Chiyuan Zhang, S. M. Ali Eslami, and Matthew Botvinick. Machine theory of mind. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 4218–4227. PMLR, 2018

work page 2018
[57]

Ribeiro, Gonçalo Rodrigues, Alberto Sardinha, and Francisco S

João G. Ribeiro, Gonçalo Rodrigues, Alberto Sardinha, and Francisco S. Melo. TEAMSTER: Model-based reinforcement learning for ad hoc teamwork.Artificial Intelligence, 324:104013,

work page
[58]

doi: 10.1016/j.artint.2023.104013. 13

work page doi:10.1016/j.artint.2023.104013 2023
[59]

The agentic economy.arXiv preprint arXiv:2505.15799, 2025

David M. Rothschild, Markus Mobius, Jake M. Hofman, Eleanor W. Dillon, Daniel G. Goldstein, Nicole Immorlica, Sonia Jaffe, Brendan Lucier, Aleksandrs Slivkins, and Matthew V ogel. The agentic economy.arXiv preprint arXiv:2505.15799, 2025

work page arXiv 2025
[60]

Perfect equilibrium in a bargaining model.Econometrica, 50(1):97–109, 1982

Ariel Rubinstein. Perfect equilibrium in a bargaining model.Econometrica, 50(1):97–109, 1982

work page 1982
[61]

Can LLMs replace economic choice prediction labs? The case of language-based persuasion games.arXiv preprint arXiv:2401.17435, 2024

Eilam Shapira, Omer Madmon, Roi Reichart, and Moshe Tennenholtz. Can LLMs replace economic choice prediction labs? The case of language-based persuasion games.arXiv preprint arXiv:2401.17435, 2024

work page arXiv 2024
[62]

GLEE: A unified framework and benchmark for language-based economic environments.arXiv preprint arXiv:2410.05254, 2024

Eilam Shapira, Omer Madmon, Itamar Reinman, Samuel Joseph Amouyal, Roi Reichart, and Moshe Tennenholtz. GLEE: A unified framework and benchmark for language-based economic environments.arXiv preprint arXiv:2410.05254, 2024

work page arXiv 2024
[63]

Human choice prediction in language-based persuasion games: Simulation-based off-policy evaluation

Eilam Shapira, Omer Madmon, Reut Apel, Moshe Tennenholtz, and Roi Reichart. Human choice prediction in language-based persuasion games: Simulation-based off-policy evaluation. Transactions of the Association for Computational Linguistics, 13:980–1006, 2025

work page 2025
[64]

arXiv preprint arXiv:2603.17218 , year=

Eilam Shapira, Moshe Tennenholtz, and Roi Reichart. Alignment makes language models normative, not descriptive.arXiv preprint arXiv:2603.17218, 2026

work page arXiv 2026
[65]

Xingjian Shi, Jonas Mueller, Nick Erickson, Mu Li, and Alexander J. Smola. Benchmarking multimodal AutoML for tabular data with text fields. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021

work page 2021
[66]

Layer by layer: Uncovering hidden representations in language models

Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Nikul Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models. In Proceedings of the 42nd International Conference on Machine Learning, pages 55854–55875, 2025

work page 2025
[67]

Kaminka, Sarit Kraus, and Jeffrey S

Peter Stone, Gal A. Kaminka, Sarit Kraus, and Jeffrey S. Rosenschein. Ad hoc autonomous agent teams: Collaboration without pre-coordination. InProceedings of the AAAI Conference on Artificial Intelligence, volume 24, pages 1504–1509, 2010. doi: 10.1609/aaai.v24i1.7529

work page doi:10.1609/aaai.v24i1.7529 2010
[68]

Learning multiagent communication with backpropagation

Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. Learning multiagent communication with backpropagation. InAdvances in Neural Information Processing Systems 29 (NIPS), 2016

work page 2016
[69]

Systematic biases in LLM simulations of debates

Amir Taubenfeld, Yaniv Dover, Roi Reichart, and Ariel Goldstein. Systematic biases in LLM simulations of debates. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 251–267, 2024

work page 2024
[70]

BERT Rediscovers the Classical NLP Pipeline

Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, 2019. doi: 10.18653/v1/P19-1452

work page doi:10.18653/v1/p19-1452 2019
[71]

N-agent ad hoc teamwork

Caroline Wang, Arrasy Rahman, Ishan Durugkar, Elad Liebman, and Peter Stone. N-agent ad hoc teamwork. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[72]

Multi-agent reinforcement learning for market making: Competition without collusion

Ziyi Wang, Carmine Ventre, and Maria Polukarov. Multi-agent reinforcement learning for market making: Competition without collusion. InProceedings of the 6th ACM International Conference on AI in Finance (ICAIF), 2025

work page 2025
[73]

MALLES: A multi-agent LLMs-based economic sandbox with consumer preference alignment.arXiv preprint arXiv:2603.17694, 2026

Yusen Wu, Yiran Liu, and Xiaotie Deng. MALLES: A multi-agent LLMs-based economic sandbox with consumer preference alignment.arXiv preprint arXiv:2603.17694, 2026

work page arXiv 2026
[74]

Measuring bargaining abilities of LLMs: A benchmark and a buyer-enhancement method

Tian Xia, Zhiwei He, Tong Ren, Yibo Miao, Zhuosheng Zhang, Yang Yang, and Rui Wang. Measuring bargaining abilities of LLMs: A benchmark and a buyer-enhancement method. In Findings of the Association for Computational Linguistics: ACL 2024, pages 3579–3602, 2024

work page 2024
[75]

Towards dynamic theory of mind: Evaluating LLM adaptation to temporal evolution of human states

Yang Xiao, Jiashuo Wang, Qiancheng Xu, Changhe Song, Chunpu Xu, Yi Cheng, Wenjie Li, and Pengfei Liu. Towards dynamic theory of mind: Evaluating LLM adaptation to temporal evolution of human states. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24036–24057, 2025. 14

work page 2025
[76]

M3-BENCH: Process-aware evaluation of LLM agents’ social behaviors in mixed-motive games.arXiv preprint arXiv:2601.08462, 2026

Sixiong Xie, Zhuofan Shi, Haiyang Shen, Yun Ma, Gang Huang, and Xiang Jing. M3-BENCH: Process-aware evaluation of LLM agents’ social behaviors in mixed-motive games.arXiv preprint arXiv:2601.08462, 2026

work page arXiv 2026
[77]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[78]

Improving dialog systems for negotia- tion with personality modeling

Runzhe Yang, Jingxiao Chen, and Karthik Narasimhan. Improving dialog systems for negotia- tion with personality modeling. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Lan- guage Processing (Volume 1: Long Papers), pages 681–693. Association for Computationa...

work page 2021
[79]

AutoToM: Automated bayesian inverse planning and model discovery for open-ended theory of mind.arXiv preprint arXiv:2502.15676, 2025

Zhining Zhang, Chuanyang Jin, Mung Yao Jia, and Tianmin Shu. AutoToM: Automated bayesian inverse planning and model discovery for open-ended theory of mind.arXiv preprint arXiv:2502.15676, 2025

work page arXiv 2025
[80]

Market-Bench: Benchmarking Large Language Models on Economic and Trade Competition

Yushuo Zheng, Huiyu Duan, Zicheng Zhang, Yucheng Zhu, Xiongkuo Min, and Guangtao Zhai. Market-Bench: Benchmarking large language models on economic and trade competition. arXiv preprint arXiv:2604.05523, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

Showing first 80 references.