arxiv: 2604.07355 · v1 · submitted 2026-03-28 · 💻 cs.LG · cs.AI· econ.GN· q-fin.EC

Recognition: no theorem link

Prediction Arena: Benchmarking AI Models on Real-World Prediction Markets

Jaden Zhang , Gardenia Liu , Oliver Johansson , Hileamlak Yitayew , Kamryn Ohly , Grace Li

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AIecon.GNq-fin.EC

keywords prediction marketsAI benchmarkingautonomous tradingfrontier modelsKalshiPolymarketsettlement accuracy

0 comments

The pith

AI models trading real money on live markets lose 22 percent on Kalshi but nearly break even on Polymarket.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Prediction Arena puts frontier AI models in charge of real capital on actual prediction exchanges, forcing them to make repeated buy and sell decisions with objective settlement outcomes. Six models each started with ten thousand dollars and traded every fifteen to forty-five minutes for fifty-seven days. On Kalshi they lost between sixteen and thirty-one percent; the same models lost only one percent on average when trading the parallel Polymarket contracts, and one next-generation model gained six percent on Polymarket while refusing every Kalshi opportunity. Prediction accuracy plus the willingness to act on correct forecasts explained most of the difference, while sheer research volume showed no relation to results. The platform contrast reveals that market interface and rules can determine which models succeed under genuine financial pressure.

Core claim

When frontier models operate as independent agents on live exchanges, returns differ sharply by platform: Cohort 1 averaged minus twenty-two point six percent on Kalshi against minus one point one percent on Polymarket, with grok-4-20-checkpoint posting a seventy-one point four percent settlement win rate. Prediction accuracy and follow-through on correct calls drive outcomes, whereas research volume shows no correlation; a Cohort 2 model achieved plus six point zero two percent on Polymarket with zero Kalshi trades, confirming platform design as the decisive factor.

What carries the argument

Autonomous AI trading agents that start with ten thousand dollars and execute independent buy or sell decisions every fifteen to forty-five minutes on real Kalshi and Polymarket contracts.

If this is right

Initial forecast accuracy combined with decisive action on correct calls determines final returns more than any other measured factor.
Platform rules and interfaces can reverse which models generate the best results even when the underlying events are identical.
Research volume alone does not predict trading performance under real capital constraints.
Settlement win rate and exit timing provide clearer signals of model quality than aggregate research effort.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmarks that ignore live execution costs and platform interfaces may systematically overstate current model decision-making ability.
Future evaluations could test whether models improve when given explicit platform-adaptation modules rather than generic trading prompts.
Extending the arena to additional exchanges would show whether the observed platform effect is specific to Kalshi and Polymarket or generalizes.

Load-bearing premise

Live prediction markets supply objective, ungamable ground truth and the models trade without hidden human oversight or platform-specific execution biases.

What would settle it

If models posted consistent positive returns across both platforms or if settlement outcomes proved manipulable after the fact, the claim that this setup supplies reliable, platform-sensitive evaluation would be overturned.

Figures

Figures reproduced from arXiv: 2604.07355 by Gardenia Liu, Grace Li, Hileamlak Yitayew, Jaden Zhang, Kamryn Ohly, Oliver Johansson.

**Figure 1.** Figure 1: Trading cycle flow showing the sequence of operations from market data sync through metrics calculation. Each cycle [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 3.** Figure 3: Settlement win rate comparison across all Cohort 1 [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 2.** Figure 2: Final account values as of March 9, 2026 for all [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 4.** Figure 4: Full 57-day account value evolution for Cohort 1 (Jan 12–Mar 9, 2026). Phase 2 is shaded grey. Note the peak of [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Cross-generation comparison: Cohort 1 Phase 1 returns vs. Cohort 2 paper-trading returns. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Cohort 1 Polymarket live-trading returns (Feb 9–Mar [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Cohort 2 Polymarket paper-trading returns (Mar 6–9, [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Initial prediction accuracy comparison across mod [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Exit strategy comparison showing settlement rates versus early exit rates across models, revealing distinct trading [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Settlement win rate heatmap by model and market [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

read the original abstract

We introduce Prediction Arena, a benchmark for evaluating AI models' predictive accuracy and decision-making by enabling them to trade autonomously on live prediction markets with real capital. Unlike synthetic benchmarks, Prediction Arena tests models in environments where trades execute on actual exchanges (Kalshi and Polymarket), providing objective ground truth that cannot be gamed or overfitted. Each model operates as an independent agent starting with $10,000, making autonomous decisions every 15-45 minutes. Over a 57-day longitudinal evaluation (January 12 to March 9, 2026), we track two cohorts: six frontier models in live trading (Cohort 1, full period) and four next-generation models in paper trading (Cohort 2, 3-day preliminary). For Cohort 1, final Kalshi returns range from -16.0% to -30.8%. Our analysis identifies a clear performance hierarchy: initial prediction accuracy and the ability to capitalize on correct predictions are the main drivers, while research volume shows no correlation with outcomes. A striking cross-platform contrast emerges from parallel Polymarket live trading: Cohort 1 models averaged only -1.1% on Polymarket vs. -22.6% on Kalshi, with grok-4-20-checkpoint achieving a 71.4% settlement win rate - the highest across any platform or cohort. gemini-3.1-pro-preview (Cohort 2), which executed zero trades on Kalshi, achieved +6.02% on Polymarket in 3 days - the best return of any model across either cohort - demonstrating that platform design has a profound effect on which models succeed. Beyond performance, we analyze computational efficiency (token usage, cycle time), settlement accuracy, exit patterns, and market preferences, providing a comprehensive view of how frontier models behave under real financial pressure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Real-capital trading on live markets is a novel benchmark setup, but the platform performance gap needs tighter controls on execution and basic stats to hold up.

read the letter

The paper's main move is letting frontier models act as independent agents trading real money on Kalshi and Polymarket with 15-45 minute cycles over 57 days. That setup is new relative to the synthetic benchmarks the abstract cites, and it produces concrete cross-platform numbers instead of simulated scores. The parallel runs on two exchanges and the tracking of token use, cycle times, and settlement patterns are useful additions for anyone thinking about how these models behave under actual financial pressure. The hierarchy they report, with prediction accuracy and capitalization on correct calls as the main drivers, lines up with what one would expect from decision-making tasks. The Polymarket edge for some models, including the 71.4% win rate for one checkpoint, is the clearest signal in the results. The soft spots are straightforward. The abstract gives only summary returns with no error bars, no statistical tests, and no trade-level logs, so the reported hierarchies and the -1.1% versus -22.6% platform contrast cannot be checked for robustness against volatility or liquidity effects. The stress-test concern about identical autonomous execution is fair; without details on API wrappers, overrides, or per-platform prompt changes, the platform-design claim rests on an assumption that may not be fully met. All Kalshi outcomes being losses also leaves open whether the period simply favored certain market structures. This is for groups working on AI evaluation and real-world deployment of predictive systems. A reader focused on benchmarks would get value from the live-capital framing even if the numbers need more support. It deserves a serious referee to examine the methods section and request the missing controls and data.

Referee Report

3 major / 1 minor

Summary. The paper introduces Prediction Arena, a benchmark in which frontier AI models trade autonomously with real capital on live prediction markets (Kalshi and Polymarket). Over a 57-day period, Cohort 1 models (six frontier systems) produce Kalshi returns between -16.0% and -30.8%, with a reported performance hierarchy driven by initial prediction accuracy rather than research volume; parallel Polymarket trading yields markedly better average returns (-1.1% vs. -22.6%), and shorter Cohort 2 paper-trading runs show one model reaching +6.02%. The central claim is that platform design exerts a profound effect on which models succeed.

Significance. If the autonomy of execution and the statistical robustness of the cross-platform contrast can be verified, the work supplies a rare real-capital, real-market test of AI decision-making under financial pressure. The use of objective settlement outcomes from external exchanges is a methodological strength that distinguishes it from synthetic benchmarks.

major comments (3)

[Abstract] Abstract: the headline cross-platform contrast (Cohort 1: -1.1% Polymarket vs. -22.6% Kalshi) is presented without error bars, confidence intervals, or any statistical test that accounts for market volatility or differing liquidity regimes; this renders the claim that 'platform design has a profound effect' unsupported by the reported numbers alone.
[Abstract] Abstract: the assertion that models 'operate as an independent agent' making 'autonomous decisions every 15-45 minutes' is load-bearing for the platform-effect conclusion, yet the manuscript supplies no description of API wrappers, error-handling fallbacks, logging of overrides, or confirmation that identical model weights/checkpoints were deployed without per-platform prompt engineering or human monitoring.
[Abstract] Abstract: the 71.4% settlement win rate for grok-4-20-checkpoint is reported without per-trade attribution, full trade logs, or verification that every position was initiated by the model rather than by manual intervention; absent these data the autonomy premise cannot be evaluated.

minor comments (1)

[Abstract] The abstract states that 'research volume shows no correlation with outcomes' but does not define how research volume was quantified or which statistical measure was used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful review and constructive comments. We agree that the abstract requires additional statistical support and implementation details to substantiate the autonomy and platform-effect claims. We have revised the manuscript accordingly, expanding the Methods and Results sections and adding supplementary materials. Point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: the headline cross-platform contrast (Cohort 1: -1.1% Polymarket vs. -22.6% Kalshi) is presented without error bars, confidence intervals, or any statistical test that accounts for market volatility or differing liquidity regimes; this renders the claim that 'platform design has a profound effect' unsupported by the reported numbers alone.

Authors: We accept this point. The original abstract presented aggregate returns without measures of variability or formal testing. In the revised manuscript we have added daily-return standard deviations as error bars for both platforms, computed a paired t-test on per-model platform differences (accounting for trade volume and liquidity differences via heteroskedasticity-robust standard errors), and report a statistically significant platform effect (p = 0.031). These statistics now appear in both the abstract and the main results section. revision: yes
Referee: [Abstract] Abstract: the assertion that models 'operate as an independent agent' making 'autonomous decisions every 15-45 minutes' is load-bearing for the platform-effect conclusion, yet the manuscript supplies no description of API wrappers, error-handling fallbacks, logging of overrides, or confirmation that identical model weights/checkpoints were deployed without per-platform prompt engineering or human monitoring.

Authors: We agree that the autonomy claim requires explicit documentation. The revised Methods section now includes: (1) a description of the unified API wrapper layer used for both exchanges, (2) the exact error-handling and retry logic, (3) the complete decision-logging schema that records every model output, API call, and any fallback action, and (4) confirmation that identical model checkpoints and prompt templates were used on both platforms with no per-platform prompt engineering. The logs show zero human overrides during the 57-day period; monitoring was limited to initial deployment and daily health checks that did not alter trading logic. revision: yes
Referee: [Abstract] Abstract: the 71.4% settlement win rate for grok-4-20-checkpoint is reported without per-trade attribution, full trade logs, or verification that every position was initiated by the model rather than by manual intervention; absent these data the autonomy premise cannot be evaluated.

Authors: We have addressed this by releasing the complete trade log for grok-4-20-checkpoint as Supplementary Data S1. Each row contains the model-generated decision timestamp, predicted probability, position size, entry/exit prices, and final settlement outcome. All 28 trades are traceable to autonomous API calls with no manual entries. The 71.4% win rate is computed directly from these logged settlements. We also added a short verification paragraph in the Results section confirming that the override log contains no human-initiated trades for this model. revision: yes

Circularity Check

0 steps flagged

Empirical trading outcomes on external markets contain no circular derivations

full rationale

The paper reports direct empirical results from autonomous model trading on live Kalshi and Polymarket exchanges over 57 days, with returns, win rates, and cross-platform contrasts computed from observed settlement outcomes and trade logs. No equations, fitted parameters, self-citations, or ansatzes are invoked to derive the performance hierarchy or platform-effect claim; the metrics are computed from external market data rather than reducing to any internal definition or prior author result by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the premise that prediction-market settlements supply unbiased ground truth and that model actions are fully autonomous.

axioms (1)

domain assumption Live prediction markets provide objective ground truth that cannot be gamed or overfitted.
Stated directly as the core advantage over synthetic benchmarks.

pith-pipeline@v0.9.0 · 5665 in / 1221 out tokens · 55198 ms · 2026-05-14T22:53:53.060723+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Foresight Arena: An On-Chain Benchmark for Evaluating AI Forecasting Agents
cs.MA 2026-05 conditional novelty 7.0

Foresight Arena is an on-chain benchmark using Brier and novel Alpha scores to evaluate AI forecasting agents on live prediction markets via Polygon smart contracts.
Coordination as an Architectural Layer for LLM-Based Multi-Agent Systems
cs.MA 2026-05 unverdicted novelty 6.0

Coordination treated as a separable architectural layer in LLM multi-agent systems yields distinguishable Murphy-decomposed performance signatures on prediction-market tasks, with some configurations dominating a cost...
Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs
cs.AI 2026-04 unverdicted novelty 6.0

BLF achieves state-of-the-art binary forecasting on ForecastBench by using linguistic belief states updated in tool-use loops, hierarchical multi-trial logit averaging, and hierarchical Platt scaling calibration.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · cited by 3 Pith papers · 2 internal anchors

[1]

J. E. Berg, F. D. Nelson, and T. A. Rietz. Prediction market accuracy in the long run. International Journal of Forecasting, 24 0 (2): 0 285--300, 2008

work page 2008
[2]

R. Hanson. Logarithmic market scoring rules for modular combinatorial information aggregation. Journal of Prediction Markets, 1 0 (1): 0 3--15, 2007

work page 2007
[3]

A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem

Z. Jiang, D. Xu, and J. Liang. A deep reinforcement learning framework for the financial portfolio management problem. arXiv preprint arXiv:1706.10059, 2017. URL https://arxiv.org/abs/1706.10059

work page internal anchor Pith review Pith/arXiv arXiv 2017
[4]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. SWE -bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66

work page 2024
[5]

L \'o pez de Prado

M. L \'o pez de Prado. Advances in Financial Machine Learning. Wiley, 2018. ISBN 978-1-119-48208-6

work page 2018
[6]

M. A. Merrill, A. G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, et al. Terminal-Bench : Benchmarking agents on hard, realistic tasks in command line interfaces, 2026. URL https://arxiv.org/abs/2601.11868. ICLR 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Schoenegger, I

P. Schoenegger, I. Tuminauskaite, P. S. Park, and P. E. Tetlock. Wisdom of the silicon crowd: LLM ensemble prediction capabilities rival human crowd accuracy. arXiv preprint arXiv:2402.19379, 2024. URL https://arxiv.org/abs/2402.19379

work page arXiv 2024
[8]

Vidgen, A

B. Vidgen, A. Mann, A. Fennelly, J. W. Stanly, L. Rothman, M. Burstein, J. Benchek, et al. APEX-Agents , 2026. URL https://arxiv.org/abs/2601.14242

work page arXiv 2026
[9]

Wolfers and E

J. Wolfers and E. Zitzewitz. Prediction markets. Journal of Economic Perspectives, 18 0 (2): 0 107--126, 2004

work page 2004
[10]

A. Zou, T. Xiao, R. Jia, J. Kwon, M. Mazeika, R. Li, D. Song, J. Steinhardt, O. Evans, and D. Hendrycks. Forecasting future world events with neural networks. In Advances in Neural Information Processing Systems, volume 35, 2022. URL https://arxiv.org/abs/2206.15474

work page arXiv 2022