arxiv: 2604.23897 · v1 · submitted 2026-04-26 · 💻 cs.AI · econ.GN· q-fin.EC

Recognition: unknown

MarketBench: Evaluating AI Agents as Market Participants

Andrey Fradkin, Rohit Krishnan

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:00 UTC · model grok-4.3

classification 💻 cs.AI econ.GNq-fin.EC

keywords AI agentsmarket coordinationself-calibrationLLM evaluationSWE-benchauctionstoken usagetask allocation

0 comments

The pith

AI agents misreport their own task success chances and costs, causing market auctions to allocate work differently than full knowledge would.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MarketBench to test whether AI agents can give accurate signals about how likely they are to finish a task and how much it will cost them. It applies the benchmark to six recent LLMs on a 93-task slice of software engineering problems. The models prove miscalibrated, both over- and under-estimating success rates and token consumption. Auctions run on these self-reports produce different task assignments than auctions that use the models' actual performance records. Adding details from earlier runs into the prompt raises calibration a little but leaves a clear distance from perfect information. The results identify self-assessment as the main obstacle to using markets to coordinate AI agents.

Core claim

Large language models are miscalibrated when they estimate both their probability of completing software tasks and the tokens they will use. Auctions formed from these self-reports produce allocations that diverge from those that would result under full information about each model's true performance. Supplying the models with capability data drawn from prior experiments raises their calibration modestly yet still leaves a measurable gap to the full-information benchmark. Market-based scaffolding built around the same models is also measured and inherits the same calibration limits.

What carries the argument

MarketBench, a procedure that elicits self-reported success probabilities and token costs from LLMs on software tasks, then compares the resulting auction allocations against those produced by ground-truth performance data.

If this is right

Auctions built from LLM self-reports will diverge from full-information allocations on these tasks.
Adding prior-experiment capability information improves calibration but only modestly narrows the gap.
Self-assessment remains a bottleneck that prevents efficient market coordination of AI agents.
Market scaffolding can be implemented with current LLMs but will carry forward the same miscalibration effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

External verification of capabilities could substitute for self-reports in market designs.
Training or prompting techniques focused specifically on self-calibration may be needed beyond simple context addition.
The same self-assessment limits are likely to appear when AI agents must price or bid on non-software tasks.
Markets might still function if they incorporate repeated interaction and reputation rather than one-shot self-reports.

Load-bearing premise

The self-reported probabilities and costs measured on this 93-task software subset accurately reflect how the same models would assess and perform when participating in broader market-style coordination.

What would settle it

A direct side-by-side run of the same 93 tasks in which auction outcomes using only self-reports are compared with outcomes that use the models' actual success rates and token counts to check whether the allocation gap remains.

read the original abstract

Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly. In order to effectively participate in markets, agents need to have informative signals of their own ability to successfully complete a task and the cost of doing so. We propose MarketBench, a benchmark for assessing whether AI agents have these capabilities. We use a 93-task subset of SWE-bench Lite, a software engineering benchmark, with six recently released LLMs as a demonstration. These LLMs are miscalibrated on both success probability and token usage, and auctions built from these self-reports diverge from a full-information allocation. A follow-up intervention where we add information about capabilities from prior experiments to the context improves calibration, but only modestly narrows the gap to a full-information benchmark. We also document the performance of a market-based scaffolding with these LLMs. Our results point to self-assessment as a key bottleneck for market-style coordination of AI agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs miscalibrate their success odds and token costs on these static tasks, which makes the auctions perform worse than full-info baselines, but the evidence for a general market-coordination bottleneck is still thin.

read the letter

The paper's core finding is that six recent LLMs give poorly calibrated self-reports on both success probability and token usage for a 93-task slice of SWE-bench Lite. When those reports feed into auctions, the resulting allocations diverge from what a full-information benchmark would produce. Adding prior-run performance data to the prompt improves calibration modestly but does not close most of the gap. They also sketch a market-based scaffolding and report its performance with the same models.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces MarketBench, a benchmark for evaluating whether AI agents can provide informative self-assessments of task success probability and cost to participate in markets. Using a 93-task subset of SWE-bench Lite and six LLMs, the authors report miscalibration on both success probability and token usage, show that auctions constructed from these self-reports diverge from a full-information allocation, and demonstrate that adding prior-experiment capability information to the context yields modest calibration gains. They also document performance of a market-based scaffolding approach and conclude that self-assessment constitutes a key bottleneck for market-style coordination of AI agents.

Significance. If the empirical results prove robust, the work identifies a concrete limitation in current LLMs that could impede economic coordination mechanisms for multi-agent systems. The benchmark supplies a reproducible testbed for tracking progress on self-calibration, and the intervention experiment offers initial evidence on mitigation. The framing around market participation is novel within the AI-agent literature and could stimulate further research on agent economics.

major comments (3)

[Benchmark construction] Benchmark construction (description of the 93-task subset): the paper does not specify the selection criteria or stratification used to choose the 93 tasks from SWE-bench Lite, nor does it report whether the subset preserves the difficulty distribution or success-rate statistics of the full benchmark. This choice is load-bearing for the claim that the observed miscalibration is representative rather than an artifact of the particular tasks.
[Auction results] Auction divergence results (results on self-report-based vs. full-information allocations): the full-information allocation is computed from ground-truth execution outcomes, but the manuscript provides no sensitivity analysis to the specific auction format employed or to how costs would be endogenized in a real market. Because the central claim concerns market mechanisms, this gap weakens the inference that self-report miscalibration would produce comparable inefficiency under realistic market clearing.
[Conclusion] Generalization to market coordination (conclusion and discussion): the experiments are confined to static, single-shot software-engineering tasks with exogenous token costs. No evidence is presented on whether comparable miscalibration or auction divergence appears when tasks are interdependent, when agents can observe one another, or when costs are determined endogenously. This assumption is load-bearing for the headline implication that self-assessment is a bottleneck for market-style coordination.

minor comments (3)

[Abstract] The abstract refers to 'auctions built from these self-reports' without naming the auction mechanism; the mechanism should be stated in the abstract or introduction for immediate clarity.
[Results] Calibration plots and tables lack explicit reporting of sample sizes per LLM, confidence intervals on the miscalibration metrics, or the precise definition of 'token usage' cost (prompt vs. completion tokens).
[Intervention experiment] The intervention experiment (adding prior-experiment information) would benefit from a control condition that adds unrelated information of equal length to isolate the effect of capability data.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to improve clarity, robustness, and scope discussion.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction (description of the 93-task subset): the paper does not specify the selection criteria or stratification used to choose the 93 tasks from SWE-bench Lite, nor does it report whether the subset preserves the difficulty distribution or success-rate statistics of the full benchmark. This choice is load-bearing for the claim that the observed miscalibration is representative rather than an artifact of the particular tasks.

Authors: We agree that explicit details on task selection are necessary to support claims of representativeness. The 93 tasks were sampled to maintain a distribution of difficulty levels comparable to SWE-bench Lite, but this was not documented sufficiently. In the revised manuscript we will add a dedicated subsection describing the selection criteria, any stratification applied, and comparative statistics (e.g., success-rate distributions) between the subset and the full benchmark. revision: yes
Referee: [Auction results] Auction divergence results (results on self-report-based vs. full-information allocations): the full-information allocation is computed from ground-truth execution outcomes, but the manuscript provides no sensitivity analysis to the specific auction format employed or to how costs would be endogenized in a real market. Because the central claim concerns market mechanisms, this gap weakens the inference that self-report miscalibration would produce comparable inefficiency under realistic market clearing.

Authors: We acknowledge that the current results rely on a single auction format and exogenous costs. While the core demonstration is the divergence caused by miscalibration rather than a comprehensive market simulation, we will add a sensitivity analysis using at least one alternative format (e.g., second-price) and expand the discussion of cost assumptions. Fully endogenizing costs and running exhaustive market-clearing simulations lies beyond the scope of this benchmark paper but will be noted as future work. revision: partial
Referee: [Conclusion] Generalization to market coordination (conclusion and discussion): the experiments are confined to static, single-shot software-engineering tasks with exogenous token costs. No evidence is presented on whether comparable miscalibration or auction divergence appears when tasks are interdependent, when agents can observe one another, or when costs are determined endogenously. This assumption is load-bearing for the headline implication that self-assessment is a bottleneck for market-style coordination.

Authors: The referee is correct that the experiments are limited to static, single-shot tasks. This controlled setting was chosen to isolate self-assessment capabilities; we do not claim direct generalization to interdependent or dynamic markets. In the revised discussion we will more explicitly state these scope limitations, outline how MarketBench could be extended to multi-agent or endogenous-cost settings, and temper the conclusion to reflect that self-assessment is shown to be a bottleneck in the evaluated regime rather than proven for all market-style coordination. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical evaluation on external benchmark

full rationale

The paper's core results derive from direct measurement of LLM self-reports against ground-truth outcomes on a fixed external subset of SWE-bench Lite (93 tasks), followed by construction of auctions using those reports versus a full-information allocation computed from actual execution results. No equations or claims reduce the reported miscalibration, auction divergence, or intervention effect to a fitted parameter or self-referential definition. The full-information benchmark is independent of the self-reports, and no self-citations or ansatzes are invoked as load-bearing premises. The evaluation is therefore self-contained against the external benchmark data.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The paper relies on the assumption that self-assessment of success probability and cost is key for market participation, and uses existing benchmarks without introducing new entities. Abstract provides no explicit free parameters or detailed axioms beyond the high-level motivation.

free parameters (2)

Choice of 93-task subset from SWE-bench Lite
The specific subset selection may influence results but is not detailed in the abstract.
Selection of six LLMs
Choice of models affects the demonstration of miscalibration.

axioms (1)

domain assumption Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly.
Stated in the abstract as the foundational justification for the benchmark.

pith-pipeline@v0.9.0 · 5462 in / 1444 out tokens · 75433 ms · 2026-05-08T06:00:41.590179+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 11 canonical work pages · 4 internal anchors

[1]

Scaling Small Agents Through Strategy Auctions

“Scaling Small Agents Through Strategy Auctions. ” arXiv preprint arXiv:2602.02751 https://arxiv.org/abs/ 2602.02751. Anthropic

work page arXiv
[2]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

“MetaGPT: Meta Programming for A Multi‐Agent Collaborative Framework. ” arXiv preprint arXiv:2308.00352. Jimenez, Carlos E., John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan

work page internal anchor Pith review arXiv
[3]

Holistic agent leaderboard: The missing infrastructure for AI agent evaluation.arXiv preprint arXiv:2510.11977, 2025

“Holistic Agent Leaderboard: The Miss‐ ing Infrastructure for AI Agent Evaluation. ” arXiv preprint arXiv:2510.11977https://arxiv. org/abs/2510.11977. Kwa, Thomas, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, et al

work page arXiv
[4]

arXiv preprint arXiv:2503.14499 , year=

“Measuring ai ability to complete long tasks. ”arXiv preprint arXiv:2503.14499. Li, Guohao, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem

work page arXiv
[5]

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

“CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society. ”arXiv preprint arXiv:2303.17760. OpenAI

work page internal anchor Pith review arXiv
[6]

Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, and Jerry Tworek

“GDPval: Evaluating AI Model Performance on Real‐World Economically Valuable Tasks. ” arXiv preprint arXiv:2510.04374https://arxiv.org/abs/2510.04374. Rabanser, Stephan, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, and Arvind Narayanan

work page arXiv
[7]

Towards a science of ai agent reliability, 2026

“Towards a Science of AI Agent Reliability. ” arXiv preprint arXiv:2602.16666 https://arxiv.org/abs/2602.16666. Song, Xinyuan, Zeyu Wang, Siyi Wu, Tianyu Shi, and Lynn Ai

work page arXiv
[8]

Gradientsys: A multi-agent llm scheduler with react orchestration,

“Gradientsys: A Multi‐Agent LLM Scheduler with ReAct Orchestration. ” arXiv preprint arXiv:2507 .06520https://arxiv. org/abs/2507.06520. Tomašev, Nenad, Matija Franklin, and Simon Osindero

work page arXiv
[9]

Intelligent AI delegation

“Intelligent AI Delegation. ”https: //arxiv.org/abs/2602.11865. Varian, Hal R. 2007 . “Position Auctions. ” International Journal of Industrial Organization 25 (6): 1163–1178. 15 Wellman, Michael P

work page arXiv 2007
[10]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

“ AutoGen: Enabling Next‐Gen LLM Applications via Multi‐Agent Conversation. ”arXiv preprint arXiv:2308.08155. Xiong, Miao, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi

work page internal anchor Pith review arXiv
[11]

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

“Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in Large Language Models. ”arXiv preprint arXiv:2306.13063. Zhang, Mozhi, Mianqiu Huang, Rundong Shi, Linsen Guo, Chong Peng, Peng Yan, Yaqian Zhou, and Xipeng Qiu

work page internal anchor Pith review arXiv
[12]

Calibrating the Confidence of Large Language Models by Eliciting Fidelity

“Calibrating the Confidence of Large Language Models by Eliciting Fidelity. ” In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Pro- cessing. Appendix A. Self‐Knowledge Card for Direct Calibration The self‐knowledge calibration follow‐up prepended a short held‐out history card before the standard direct calibration prompt. The...

2024