pith. machine review for the scientific record. sign in

arxiv: 2605.15188 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

FutureSim: Replaying World Events to Evaluate Adaptive Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords AI agentsadaptive agentsevent forecastingbenchmarksimulationworld eventstest-time adaptation
0
0 comments X

The pith

FutureSim evaluates AI agents by replaying real historical events in order and shows even the best achieve only 25 percent accuracy on future predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a simulation environment called FutureSim that replays actual world events chronologically, including news articles and resolving questions, to test how well AI agents can adapt and forecast without access to future information. Agents must predict outcomes over a three-month period starting in 2026, using only information available up to their knowledge cutoff. A sympathetic reader would care because AI systems are increasingly placed in dynamic real-world settings where they must update beliefs as new data arrives over long time spans. The evaluation finds a clear performance gap, with top agents reaching 25 percent accuracy while others underperform a simple no-prediction baseline on Brier skill score. This setup allows studying capabilities like long-horizon adaptation and reasoning under uncertainty in a grounded way.

Core claim

FutureSim builds grounded simulations that replay real-world events in the order they occurred, allowing agents to forecast world events beyond their knowledge cutoff while interacting with chronological real news articles and resolving questions. When frontier agents are tested over January to March 2026, the benchmark reveals a clear separation in capabilities, with the best agent's accuracy at 25 percent and many agents showing worse Brier skill scores than making no prediction at all.

What carries the argument

FutureSim, a simulation that replays real news articles arriving and questions resolving over a simulated period in chronological order without future knowledge leakage.

Load-bearing premise

That replaying real historical events chronologically without future knowledge leakage accurately measures an agent's adaptive capabilities in open-ended real-world settings.

What would settle it

Running the same questions with agents given full access to future information and observing whether accuracy remains low would show the benchmark fails to isolate adaptation from prior knowledge.

Figures

Figures reproduced from arXiv: 2605.15188 by Ameya Prabhu, Arvindh Arun, Jonas Geiping, Maksym Andriushchenko, Moritz Hardt, Nikhil Chandak, Shashwat Goel, Steffen Staab.

Figure 1
Figure 1. Figure 1: In FutureSim, agents have to keep updating their predictions about future world events, by searching over an evolving news corpus up to the current simulation date. We evaluate all models in their recommended harness at maximum reasoning effort over 3 seeds. Models consume over 10M unique tokens and perform 500-4000+ tool calls over the course of the simulation. We see a clear sep￾aration in capabilities, … view at source ↗
Figure 2
Figure 2. Figure 2: In FutureSim, agents are evaluated in a dynamic forecasting environment. They can search news up to the current simulation date (access to future information is restricted), gather feedback from resolved questions and choose when to update their predictions. The environment only enforces two actions: submit() required to submit or revise predictions for a question and next_day() to advance the simulation b… view at source ↗
Figure 3
Figure 3. Figure 3: Agent performance on FutureSim. GPT 5.5 by far performs the best in both accuracy and Brier skill score. (Top) Open-weight models achieve significantly higher accuracy in our modified harness, improving consistently over the course of the simulation. (Bottom) Except for GPT 5.5, all models start at a negative Brier skill score, and while they fail to improve significantly by default, they cross over to a p… view at source ↗
Figure 4
Figure 4. Figure 4: Comparisons of frontier agent prediction to human crowd aggregates on Polymarket. (Left) We find that GPT 5.5 leads the real market aggregate for some questions, including the Super Bowl market, which traded 700M in total volume. That said, it performs relatively poorly on some other markets, with Claude Opus 4.6 closely tracking GPT 5.5 predictions but usually slightly worse. (Right) We zoom in on the mar… view at source ↗
Figure 5
Figure 5. Figure 5: (Left) Comparing test-time adaptation across agents. We start different models in our improved harness at the lowest performing agent’s (Qwen 3.6 Plus) initial prediction set, to maximize scope for improvement over time. We find agents get anchored to the initial predictions, failing to adapt them sufficiently to even reach the no prediction baseline of 0 brier skill score, even when their own capabilities… view at source ↗
Figure 6
Figure 6. Figure 6: Benefits from search. We evaluate GPT 5.5 at xhigh reasoning effort in four different settings, showing the large benefits of agentic search (green vs red) and utilizing the evolving context corpus in FutureSim. Unlike many existing search benchmarks (Yang et al., 2018; Wei et al., 2025), in FutureSim, the search is not for past facts knowable perfectly from the accessible documents. Rather, (i) the docume… view at source ↗
Figure 7
Figure 7. Figure 7: Benefits from scaling test-time compute. We run GPT 5.5 in all available rea￾soning efforts, finding more inference com￾pute leads to better accuracy on FutureSim. These results show FutureSim can support research on reasoning-intensive sequential search agents (Jin et al., 2025), as well as better underlying search tools (Khattab & Zaharia, 2020; Shao et al., 2025) for the dynamic, and uniquely bayesian s… view at source ↗
Figure 8
Figure 8. Figure 8: Multi-agent dynamics. When we run multiple copies of DeepSeek V3.2 agents simultaneously, we see agent predic￾tions start moving toward the aggregate, un￾like independent single agent runs where pre￾dictions diverge over time. We are excited to support research on performance￾cost scaling for test-time compute paradigms like parallel aggregation (Venkatraman et al., 2026), multi￾agent systems (Tran & Kiela… view at source ↗
Figure 9
Figure 9. Figure 9: Number of actions. We report the total number of actions for each model during the simulation. The results show how long-horizon our benchmark is, with GPT 5.5 in Codex, the best performing agent taking around 4,000 actions across runs ranging over multiple context window compaction calls. We find that the number of actions taken by different agents is correlated with the test-time adaptation improvement r… view at source ↗
Figure 10
Figure 10. Figure 10: Test-time adapation on the subset where all models make prediction updates. We once again start with Qwen 3.6 Plus predictions on all questions, and then restrict the analysis to the subset of 46 questions where all models submit at least one forecast. Across both accuracy and brier skill score, we see consistent trends in test-time adaptation with the main paper plot shown over the full question pool, sh… view at source ↗
Figure 11
Figure 11. Figure 11: Benefits from search. We evaluate GPT 5.5 xhigh reasoning effort in four different settings to isolate the benefits of agentic search over updating context in FutureSim, this time measuring brier skill score. Consistent with the accuracy trend, we find large improvements from daily context updates (blue line) compared to when no articles beyond the first date are added during the simulation (orange line),… view at source ↗
Figure 12
Figure 12. Figure 12: Effect of scaling test-time compute on brier skill score. We run GPT 5.5 in all five available reasoning efforts to see how additional inference compute changes brier skill score on FutureSim. We find higher reasoning effort consistently leads to better brier skill score, although the effect plateaus for this model after reasoning effort high. Notably, reasoning effort “none” has extremely poor brier skil… view at source ↗
Figure 13
Figure 13. Figure 13: reports the accuracy and Brier skill score trajectories for the multi-agent experiment described in Section 5.5. The multi-agent runs seem to start at and maintain higher accuracy than independent single agent runs for all three DeepSeek v3.2 agents. While deeper exploration of this phenomenon is left to future work due to cost reasons, one hypothesis for why the initial predictions (before any inter-agen… view at source ↗
read the original abstract

AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes FutureSim, a benchmark that replays real historical news articles in chronological order to evaluate frontier AI agents' ability to adapt and forecast world events over a three-month period (January–March 2026) beyond their training cutoffs. It reports a best-agent accuracy of 25% with several agents showing Brier skill scores worse than a no-prediction baseline, and presents ablations on long-horizon adaptation, search, memory, and uncertainty reasoning.

Significance. If the evaluation design successfully isolates test-time adaptation from pre-trained knowledge, FutureSim would offer a grounded, reproducible way to measure open-ended real-world forecasting capabilities over long horizons, filling a gap left by static benchmarks or synthetic environments. The chronological replay approach and use of native agent harnesses are strengths that could support falsifiable claims about adaptation.

major comments (2)
  1. [Abstract / Evaluation] Abstract and evaluation section: the headline claim of capability separation (25% accuracy, Brier scores worse than null) assumes performance derives from adaptation to the replay stream rather than pre-trained knowledge of 2026 events. No control is reported in which the replay is replaced by a static prompt or empty context; without this, the observed differences could reflect training-corpus overlap instead of adaptive behavior.
  2. [Abstract] Abstract: concrete numerical results (25% accuracy, Brier skill comparisons) are presented without accompanying methodological details on event selection criteria, question resolution process, statistical significance testing, or how agent predictions are elicited and scored. These omissions make the support for the central claims unverifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of our evaluation design and presentation. We have revised the manuscript to strengthen the claims regarding adaptation versus pre-trained knowledge and to provide fuller methodological details. Below we respond point by point.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and evaluation section: the headline claim of capability separation (25% accuracy, Brier scores worse than null) assumes performance derives from adaptation to the replay stream rather than pre-trained knowledge. No control is reported in which the replay is replaced by a static prompt or empty context; without this, the observed differences could reflect training-corpus overlap instead of adaptive behavior.

    Authors: We agree that an explicit control isolating the contribution of the chronological replay is necessary to support claims of test-time adaptation. In the revised manuscript we have added this control: agents receive the same initial setup and questions but with an empty or static context instead of the live news replay stream. Results show a substantial drop in both accuracy and Brier skill score under the no-replay condition, consistent with the interpretation that observed performance reflects adaptation to incoming information rather than leakage from pre-training data. We have updated the abstract, evaluation section, and added a new subsection describing the control. revision: yes

  2. Referee: [Abstract] Abstract: concrete numerical results (25% accuracy, Brier skill comparisons) are presented without accompanying methodological details on event selection criteria, question resolution process, statistical significance testing, or how agent predictions are elicited and scored. These omissions make the support for the central claims unverifiable.

    Authors: We accept that the original abstract was too terse on methodology. The revised version now includes a concise methods paragraph summarizing: (i) event selection (real-world news items with verifiable post-hoc outcomes drawn from public sources), (ii) question resolution (binary or probabilistic outcomes determined by official records after the simulated period ends), (iii) statistical testing (bootstrap confidence intervals and paired significance tests reported in the main text), and (iv) prediction elicitation and scoring (standardized prompts within each agent's native harness, evaluated via Brier score and accuracy). A new dedicated Methods section expands on these points with full procedural details. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or reported results

full rationale

The paper defines FutureSim as an external simulation that replays verifiable real-world news articles and event resolutions in chronological order from January to March 2026. The headline empirical claims (best-agent accuracy of 25%, multiple agents below null Brier skill) are obtained by executing frontier models in their native harnesses and computing standard metrics on the resulting forecasts. No equations, parameter fits, or self-citations are used to derive these numbers; the separation is an observed outcome of the external evaluation. Ablations on adaptation, search, memory, and uncertainty are described as diagnostic tools rather than load-bearing premises that reduce to the paper's own inputs. The derivation chain therefore remains self-contained against independent real-world events and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the approach depends on real-world data replay and standard scoring metrics.

pith-pipeline@v0.9.0 · 5516 in / 1087 out tokens · 72801 ms · 2026-05-15T03:08:02.403868+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

  1. [1]

    Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch

    URLhttps://arxiv.org/abs/2502.15840. Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch. Emergent com- plexity via multi-agent competition. InInternational Conference on Learning Representations, 2018. URLhttps://openreview.net/forum?id=Sy0GnUxCb. Noam Brown, Anton Bakhtin, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel F...

  2. [2]

    Preethi Seshadri, Samuel Cahyawijaya, Ayomide Odumakinde, Sameer Singh, and Seraphina Goldfarb-Tarrant

    URLhttps://proceedings.mlr.press/v119/perdomo20a.html. Preethi Seshadri, Samuel Cahyawijaya, Ayomide Odumakinde, Sameer Singh, and Seraphina Goldfarb-Tarrant. Lost in simulation: Llm-simulated users are unreliable proxies for human users in agentic evaluations, 2026. URLhttps://arxiv.org/abs/2601.17087. Rulin Shao, Rui Qiao, Varsha Kishore, Niklas Muennig...

  3. [3]

    just give the model shell and tool access

    We provide the prompt in Appendix E.4. • We remove questions that models can answer confidently using search capped to June 2025, as they are too stale or easy. We also remove questions that models still fail to answer even with full web search access as of April 2026, as this might be due to label noise. This approach makes our evaluation quite distinct ...

  4. [4]

    Context consumption feedback:After each tool call, the agent receives feedback about remaining context budget and approximate context occupancy. This is useful because the task spans thousands of turns, and without explicit budget awareness, agents often spend a lot of context browsing or performing repeated file reads, leaving too little room for final r...

  5. [5]

    The goal is to make memory writing and retrieval deliberate actions rather than accidental byproducts of shell usage

    Structured memory tools:Instead of asking the agent to maintain free-form notes arbitrarily in its workspace, we expose external memory through explicit tool calls with named entries and bounded fields. The goal is to make memory writing and retrieval deliberate actions rather than accidental byproducts of shell usage. This structure also makes it easier ...

  6. [6]

    Per-question memory:In addition to global notes, the harness maintains memory entries attached to individual questions. This is motivated by the fact that forecasting requires a mix of cross-question lessons and question-specific evidence: a general lesson about overconfidence should be stored differently from a candidate list or event-specific rationale ...

  7. [7]

    During this phase, the agent is encouraged to compress what it learned into persistent notes and leave a cleaner state for the next day

    Forced memory phase:When the agent ends a day, or when the context budget becomes too tight, the harness enters an explicit memory-update phase before actually advancing. During this phase, the agent is encouraged to compress what it learned into persistent notes and leave a cleaner state for the next day. The motivation is to prevent a common failure 18 ...

  8. [8]

    Which country will the silver medalist in the women’s downhill alpine skiing event at the Milan-Cortina Winter Olympics represent by 8 February 2026?

    Procedural forecasting scaffolding:The prompt encourages a concrete workflow: inspect the active questions, prioritize the ones most worth updating, search for relevant evidence, submit forecasts, update memory, and only then proceed to the next day. This scaffolding is intentionally lightweight: it does not tell the model what the answer is, but it does ...

  9. [14]

    Unknown",

    No Placeholders: "Unknown", "TBD", and "Other" hurt your score. ## AVAILABLE DATA You have access to a news article database, which is updated daily through a search tool, that you can use to find evidence for your forecasts.,→ You can access the market.csv file (READ-ONLY) in your workspace containing <num_questions> questions (<num_active> active/unreso...

  10. [15]

    Questions resolving the next day (filter`market.csv`by`resolution_date`== tomorrow) -- make sure your prediction is up-to-date before calling next_day.,→

  11. [16]

    Questions without predictions (if any)

  12. [17]

    Questions where today's news search reveals new information

  13. [18]

    Questions approaching resolution date that you haven't checked recently

  14. [19]

    Columns: qid (str), question (str), last_updated (str), memory (str), category (str) <prior_memory_location_or_empty_memory_note> Inspect`mem_df`by reading <prior_mem_csv>

    Skip questions where there is no new evidence ## YOUR MEMORY Current meta-insights with their indices: <meta_insight_index> `mem_df`holds your per-question notes (reasoning, evidence, calibration) -- 1 row per question. Columns: qid (str), question (str), last_updated (str), memory (str), category (str) <prior_memory_location_or_empty_memory_note> Inspect...

  15. [20]

    Accuracy + Calibration: assign calibrated probabilities that reflect true likelihoods

  16. [21]

    Time-Weighted Score: forecasts made earlier matter, but updating is rewarded when new evidence arrives

  17. [22]

    Prediction-Count Incentive: unanswered active questions receive zero contribution

  18. [23]

    End-of-Session Metrics are shown after each session

  19. [24]

    Max Outcomes: submit at most <max_outcomes_per_question> outcomes per question

  20. [25]

    Unknown",

    No Placeholders: "Unknown", "TBD", and "Other" hurt your score. ## AVAILABLE DATA You have access to a news article database which is updated daily through a search tool, that you can use to find evidence for your forecasts.,→ You also have access to a read-only`market.csv`file in your workspace with <num_questions> questions (<num_active> active/unresolv...

  21. [26]

    Update`mem_df`for questions you researched or forecasted today using`mcp__forecast__mem_add`/ `mcp__forecast__mem_update`.,→

  22. [27]

    If today's work revealed a reusable pattern, lesson, or calibration rule, promote it into a meta-insight

  23. [28]

    Unknown",

    If a prior meta-insight is stale or contradicted, revise or delete it. Do not use meta-insights as a daily activity log. If you learned nothing reusable today, it is fine to skip meta-insight writes.,→ ## SUBMISSION RULES - qid must be from an active (`is_resolved=False`) question you identified from market.csv. - Each`mcp__forecast__submit_forecasts`call...

  24. [29]

    <existing_prediction_1>

  25. [30]

    1" or "3

    <existing_prediction_2> ... N. <existing_prediction_N> Does the new prediction match any of the existing predictions semantically? - Match if they mean the same thing or if new prediction is more specific - Do NOT match if new prediction is vaguer/more general If yes, respond with ONLY the number (e.g., "1" or "3"). If no match exists, respond with "None"...