Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.
Forecastbench: A dynamic benchmark of ai forecasting capabilities.arXiv preprint arXiv:2409.19839, 2024
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7roles
background 1polarities
background 1representative citing papers
StakeBench is a new benchmark using market-derived supervision from resolved prediction markets to test LLMs on commitment detection, side identification, action anticipation, and odds projection, revealing partial success on sides but structural failures on higher tasks.
OracleProto is a reproducible framework that uses model-cutoff alignment, temporal masking, and leakage detection to create low-leakage benchmarks for LLM native forecasting from past events.
Foresight Arena is an on-chain benchmark using Brier and novel Alpha scores to evaluate AI forecasting agents on live prediction markets via Polygon smart contracts.
Energy-Arena is a dynamic, forward-looking benchmarking platform that standardizes ex-ante submissions and rolling ex-post evaluations for operational energy forecasting to improve transparency and comparability.
CT Open is a new live platform with an automated LLM-powered decontamination pipeline that supplies uncontaminated benchmarks for predicting clinical trial outcomes.
Coordination treated as a separable architectural layer in LLM multi-agent systems yields distinguishable Murphy-decomposed performance signatures on prediction-market tasks, with some configurations dominating a cost-quality Pareto frontier.
citing papers explorer
-
PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data
Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.
-
StakeBench: Evaluating Language Understanding Grounded in Market Commitment
StakeBench is a new benchmark using market-derived supervision from resolved prediction markets to test LLMs on commitment detection, side identification, action anticipation, and odds projection, revealing partial success on sides but structural failures on higher tasks.
-
OracleProto: A Reproducible Framework for Benchmarking LLM Native Forecasting via Knowledge Cutoff and Temporal Masking
OracleProto is a reproducible framework that uses model-cutoff alignment, temporal masking, and leakage detection to create low-leakage benchmarks for LLM native forecasting from past events.
-
Foresight Arena: An On-Chain Benchmark for Evaluating AI Forecasting Agents
Foresight Arena is an on-chain benchmark using Brier and novel Alpha scores to evaluate AI forecasting agents on live prediction markets via Polygon smart contracts.
-
Energy-Arena: A Dynamic Benchmark for Operational Energy Forecasting
Energy-Arena is a dynamic, forward-looking benchmarking platform that standardizes ex-ante submissions and rolling ex-post evaluations for operational energy forecasting to improve transparency and comparability.
-
CT Open: An Open-Access, Uncontaminated, Live Platform for the Open Challenge of Clinical Trial Outcome Prediction
CT Open is a new live platform with an automated LLM-powered decontamination pipeline that supplies uncontaminated benchmarks for predicting clinical trial outcomes.
-
Coordination as an Architectural Layer for LLM-Based Multi-Agent Systems
Coordination treated as a separable architectural layer in LLM multi-agent systems yields distinguishable Murphy-decomposed performance signatures on prediction-market tasks, with some configurations dominating a cost-quality Pareto frontier.