EpiEvolve: Self-Evolving Agents for Streaming Pandemic Forecasting under Regime Shifts
Pith reviewed 2026-06-28 05:38 UTC · model grok-4.3
The pith
A self-evolving agent adapts a fixed LLM forecaster to shifting pandemic regimes by storing past outcomes and reflecting on delayed labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EpiEvolve wraps an LLM forecaster with fixed weights and adapts it in a streaming setting by storing forecast outcomes in hierarchical episodic memory, reflecting on delayed labels, retrieving regime-relevant cases, and distilling recurring errors into strategic rules, all while following a chronological protocol that prevents future leakage.
What carries the argument
Hierarchical episodic memory with a reflection step and regime-aware retrieval that distills errors into reusable rules.
If this is right
- Fixed-weight LLM forecasters can handle regime shifts in streaming data without retraining.
- Reflection on delayed labels and retrieval of similar past cases each measurably improve adaptation speed.
- Ablation results confirm that memory, reflection, and regime-aware retrieval each add to the observed gains.
- The chronological protocol allows reuse of the model's own past predictions while maintaining temporal integrity.
Where Pith is reading between the lines
- The same memory-and-reflection structure could extend to other streaming tasks that face delayed feedback and pattern shifts.
- Keeping model weights fixed while updating context only may lower the cost of maintaining operational forecasting systems over long periods.
- The approach implies that explicit storage of past errors can substitute for frequent model updates when regimes change.
Load-bearing premise
The memory storage, reflection, and retrieval steps can be executed under a strict time-ordered protocol that blocks all future information yet still generalizes across the five regimes without any post-hoc adjustments that change the reported numbers.
What would settle it
If disabling the reflection step or regime-aware retrieval in an otherwise identical run causes recovery lag to return to five weeks and accuracy to fall back to 0.561, the contribution of those components would be falsified.
Figures
read the original abstract
Epidemic LLM forecasters are usually trained and evaluated as static supervised models, whereas operational pandemic forecasting is a streaming process in which labels arrive after predictions and disease regimes shift over time. We study this mismatch in weekly COVID-19 hospitalization trend forecasting across five variant regimes. We introduce EpiEvolve, a self-evolving agent that wraps an LLM forecaster trained on the warm-start period and keeps its weights fixed during streaming. EpiEvolve adapts by storing forecast outcomes in a hierarchical episodic memory, reflecting on delayed labels, retrieving cases relevant to the current regime, and distilling recurring errors into strategic rules. The resulting context lets the forecaster reuse its own past predictions and outcomes in later weeks while following a chronological protocol that prevents future leakage. On the streaming dataset, EpiEvolve reaches $0.629$ average accuracy, compared with $0.561$ for the static backbone and $0.325$ for the external CDC ensemble, and reduces recovery lag after regime shifts from $5$ to $2$ weeks. Ablations show that reflection, strategic memory, and regime-aware retrieval each contribute to the gains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EpiEvolve, a self-evolving LLM agent for weekly COVID-19 hospitalization trend forecasting in a streaming setting across five variant regimes. The LLM backbone is trained only on a warm-start period and kept fixed; adaptation occurs via storage of outcomes in hierarchical episodic memory, reflection on delayed labels, regime-aware retrieval, and distillation of recurring errors into rules, all under a claimed chronological protocol that prevents future leakage. On the streaming dataset the method reports 0.629 average accuracy (vs. 0.561 static backbone and 0.325 CDC ensemble) and reduces post-regime-shift recovery lag from 5 to 2 weeks; ablations attribute gains to reflection, strategic memory, and regime-aware retrieval.
Significance. If the no-leakage protocol can be shown to be strictly chronological and the empirical gains survive proper statistical controls, the framework would supply a concrete, reusable template for self-evolving agents in non-stationary streaming prediction tasks. The explicit ablation of the three adaptation components is a methodological strength that isolates their individual contributions.
major comments (3)
- [Abstract] Abstract: the headline figures (0.629 accuracy, 5-to-2-week lag reduction) are supplied without error bars, statistical significance tests, dataset sizes, number of weeks per regime, or any description of how the five regimes were delineated, rendering it impossible to judge whether the reported improvement over the static backbone is robust.
- [Abstract] Abstract (description of hierarchical episodic memory and regime-aware retrieval): the paper asserts that regime indexing, past-episode querying, and distillation from delayed labels are performed without future leakage, yet supplies no pseudocode, equations, or concrete implementation steps showing how regimes are identified from past data only, how retrieval is restricted to timesteps t' < t, or how labels arriving weeks later are incorporated without conditioning on post-t outcomes; this mechanism is load-bearing for the central streaming-adaptation claim.
- [Ablation study] Ablation study (mentioned in abstract): the claim that each of reflection, strategic memory, and regime-aware retrieval contributes to the gains cannot be evaluated because the ablation protocol itself is not described (e.g., whether ablations preserve the chronological constraint or whether they inadvertently allow information from later regimes).
minor comments (1)
- [Abstract] The term 'strategic memory' appears in the ablation sentence but is not defined or distinguished from 'hierarchical episodic memory' in the main description; a short clarifying sentence would remove ambiguity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments emphasizing the need for greater detail in the abstract and ablation descriptions. We agree that these elements require expansion to allow proper assessment of the reported results and the no-leakage claims, and we will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline figures (0.629 accuracy, 5-to-2-week lag reduction) are supplied without error bars, statistical significance tests, dataset sizes, number of weeks per regime, or any description of how the five regimes were delineated, rendering it impossible to judge whether the reported improvement over the static backbone is robust.
Authors: We agree that the abstract should convey more context for the metrics. In revision we will add error bars to the headline figures, report statistical significance (paired tests against the static baseline), note the total number of weeks (~150 across regimes), and briefly describe regime delineation by documented variant emergence dates, with fuller details moved to the main text. revision: yes
-
Referee: [Abstract] Abstract (description of hierarchical episodic memory and regime-aware retrieval): the paper asserts that regime indexing, past-episode querying, and distillation from delayed labels are performed without future leakage, yet supplies no pseudocode, equations, or concrete implementation steps showing how regimes are identified from past data only, how retrieval is restricted to timesteps t' < t, or how labels arriving weeks later are incorporated without conditioning on post-t outcomes; this mechanism is load-bearing for the central streaming-adaptation claim.
Authors: We will add a dedicated subsection in Methods containing pseudocode and equations that formalize the chronological constraints: regime detection uses only data available at t, retrieval queries are masked to t' < t, and delayed labels are stored and reflected upon only after their arrival without reference to any post-t information. The abstract will be updated to reference this subsection. revision: yes
-
Referee: [Ablation study] Ablation study (mentioned in abstract): the claim that each of reflection, strategic memory, and regime-aware retrieval contributes to the gains cannot be evaluated because the ablation protocol itself is not described (e.g., whether ablations preserve the chronological constraint or whether they inadvertently allow information from later regimes).
Authors: The ablations were performed under exactly the same chronological protocol as the main experiments. We will expand the ablation subsection to state this explicitly and to describe, for each removed component, how regime indexing, retrieval, and label incorporation remain restricted to past data only. revision: yes
Circularity Check
No circularity; results are direct empirical comparisons on streaming data.
full rationale
The paper describes an agent architecture (hierarchical episodic memory, reflection, regime-aware retrieval) and reports measured accuracies (0.629 vs. 0.561 static backbone) and lag reductions under a claimed chronological protocol. No equations, fitted parameters, or self-citations are invoked to derive these figures; the metrics are external performance numbers obtained by running the system on held-out streaming data. The leakage-prevention claim is an implementation assumption whose validity affects replicability but does not create a definitional or self-referential reduction in the reported results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A chronological protocol can be enforced that prevents any future leakage while still allowing delayed labels to inform reflection and retrieval.
invented entities (1)
-
hierarchical episodic memory
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Nature Computational Science , volume=
Advancing real-time infectious disease forecasting using large language models , author=. Nature Computational Science , volume=. 2025 , publisher=
2025
-
[2]
Proceedings of the National Academy of Sciences , volume=
Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the United States , author=. Proceedings of the National Academy of Sciences , volume=. 2022 , publisher=
2022
-
[3]
American Journal of Public Health , volume=
Collaborative hubs: making the most of predictive epidemic modeling , author=. American Journal of Public Health , volume=. 2022 , publisher=
2022
-
[4]
arXiv preprint arXiv:2505.12738 , year=
EpiLLM: unlocking the potential of large language models in epidemic forecasting , author=. arXiv preprint arXiv:2505.12738 , year=
-
[5]
medRxiv , pages=
Fine-tuned large language models enhance influenza forecasting , author=. medRxiv , pages=. 2025 , publisher=
2025
-
[6]
Time-LLM: Time Series Forecasting by Reprogramming Large Language Models
Time-llm: Time series forecasting by reprogramming large language models , author=. arXiv preprint arXiv:2310.01728 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Chronos: Learning the Language of Time Series
Chronos: Learning the language of time series , author=. arXiv preprint arXiv:2403.07815 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Advances in neural information processing systems , volume=
Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=
-
[9]
ReAct: Synergizing Reasoning and Acting in Language Models
React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Proceedings of the 36th annual acm symposium on user interface software and technology , pages=
Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th annual acm symposium on user interface software and technology , pages=
-
[11]
Advances in neural information processing systems , volume=
Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=
-
[12]
Advances in neural information processing systems , volume=
Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=
-
[13]
ACM computing surveys (CSUR) , volume=
A survey on concept drift adaptation , author=. ACM computing surveys (CSUR) , volume=. 2014 , publisher=
2014
-
[14]
arXiv preprint arXiv:2307.04986 , year=
Epidemic modeling with generative agents , author=. arXiv preprint arXiv:2307.04986 , year=
-
[15]
arXiv preprint arXiv:2602.00299 , year=
Agentic Framework for Epidemiological Modeling , author=. arXiv preprint arXiv:2602.00299 , year=
-
[16]
IEEE Transactions on Artificial Intelligence , year=
EpidemIQs: Prompt-to-paper LLM agents for epidemic modeling and analysis , author=. IEEE Transactions on Artificial Intelligence , year=
-
[17]
arXiv preprint arXiv:2512.10313 , year=
EpiPlanAgent: Agentic Automated Epidemic Response Planning , author=. arXiv preprint arXiv:2512.10313 , year=
-
[18]
arXiv preprint arXiv:2601.04245 , year=
AI Agents as Policymakers in Simulated Epidemics , author=. arXiv preprint arXiv:2601.04245 , year=
-
[19]
arXiv preprint arXiv:2601.09264 , year=
Coordinated Pandemic Control with Large Language Model Agents as Policymaking Assistants , author=. arXiv preprint arXiv:2601.09264 , year=
-
[20]
STReasoner: Empowering LLMs for Spatio-Temporal Reasoning in Time Series via Spatial-Aware Reinforcement Learning , author=. arXiv preprint arXiv:2601.03248 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Forty-second International Conference on Machine Learning , year=
EARTH: Epidemiology-Aware Neural ODE with Continuous Disease Transmission Graph , author=. Forty-second International Conference on Machine Learning , year=
-
[22]
arXiv preprint arXiv:2501.11733 , year=
Mobile-agent-e: Self-evolving mobile assistant for complex tasks , author=. arXiv preprint arXiv:2501.11733 , year=
-
[23]
Group-evolving agents: Open-ended self-improvement via experience sharing, 2026
Group-Evolving Agents: Open-Ended Self-Improvement via Experience Sharing , author=. arXiv preprint arXiv:2602.04837 , year=
-
[24]
Nature communications , volume=
Evaluation of the US COVID-19 Scenario Modeling Hub for informing pandemic response under uncertainty , author=. Nature communications , volume=. 2023 , publisher=
2023
-
[25]
arXiv preprint arXiv:2509.03990 , year=
Meta-Policy Reflexion: Reusable Reflective Memory and Rule Admissibility for Resource-Efficient LLM Agent , author=. arXiv preprint arXiv:2509.03990 , year=
-
[26]
EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle
Evolver: Self-evolving llm agents through an experience-driven lifecycle , author=. arXiv preprint arXiv:2510.16079 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-memory: Benchmarking llm agent test-time learning with self-evolving memory , author=. arXiv preprint arXiv:2511.20857 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Memgen: Weaving genera- tive latent memory for self-evolving agents,
Memgen: Weaving generative latent memory for self-evolving agents , author=. arXiv preprint arXiv:2509.24704 , year=
-
[29]
MemEvolve: Meta-Evolution of Agent Memory Systems
Memevolve: Meta-evolution of agent memory systems , author=. arXiv preprint arXiv:2512.18746 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
5 Emergence, United States , author=
Large-Scale Genomic Analysis of SARS-CoV-2 Omicron BA. 5 Emergence, United States , author=. Emerging infectious diseases , volume=
-
[31]
Cell , volume=
Alarming antibody evasion properties of rising SARS-CoV-2 BQ and XBB subvariants , author=. Cell , volume=. 2023 , publisher=
2023
-
[32]
International Conference on Learning Representations , volume=
Self-updatable large language models by integrating context into model parameters , author=. International Conference on Learning Representations , volume=
-
[33]
arXiv preprint arXiv:2505.20633 , year=
Test-time learning for large language models , author=. arXiv preprint arXiv:2505.20633 , year=
-
[34]
The Fourteenth International Conference on Learning Representations , year=
Test-time adaptation for llm agents via environment interaction , author=. The Fourteenth International Conference on Learning Representations , year=
-
[35]
arXiv preprint arXiv:2501.13453 , year=
Spurious forgetting in continual learning of language models , author=. arXiv preprint arXiv:2501.13453 , year=
-
[36]
IEEE Transactions on Knowledge and Data Engineering , year=
Unsupervised concept drift detection from deep learning representations in real-time , author=. IEEE Transactions on Knowledge and Data Engineering , year=
-
[37]
arXiv preprint arXiv:2505.04318 , year=
Detecting Concept Drift in Neural Networks Using Chi-squared Goodness of Fit Testing , author=. arXiv preprint arXiv:2505.04318 , year=
-
[38]
Advances in Neural Information Processing Systems , volume=
A-mem: Agentic memory for llm agents , author=. Advances in Neural Information Processing Systems , volume=
-
[39]
Findings of the Association for Computational Linguistics: ACL 2025 , pages=
Towards adaptive memory-based optimization for enhanced retrieval-augmented generation , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=
2025
-
[40]
2025 5th International Conference on Neural Networks, Information and Communication Engineering (NNICE) , pages=
Research on the online update method for retrieval-augmented generation (rag) model with incremental learning , author=. 2025 5th International Conference on Neural Networks, Information and Communication Engineering (NNICE) , pages=. 2025 , organization=
2025
-
[41]
arXiv preprint arXiv:2502.03393 , year=
Pre-training Epidemic Time Series Forecasters with Compartmental Prototypes , author=. arXiv preprint arXiv:2502.03393 , year=
-
[42]
Advances in Neural Information Processing Systems , volume=
Streambench: Towards benchmarking continuous improvement of language agents , author=. Advances in Neural Information Processing Systems , volume=
-
[43]
Advances in Neural Information Processing Systems , volume=
Label delay in online continual learning , author=. Advances in Neural Information Processing Systems , volume=
-
[44]
arXiv preprint arXiv:2508.02085 , year=
Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents , author=. arXiv preprint arXiv:2508.02085 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.