The Epi-LLM Framework: probing LLM behavioral priors through epidemiological agent-based models
Pith reviewed 2026-06-28 11:26 UTC · model grok-4.3
The pith
LLM agents in epidemic simulations reduce peak infections with quarantine rates matching human patterns.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the Epi-LLM framework, agents powered by large language models reason and adapt over an outbreak contact network. Compared to a no-intervention SEIR baseline and human data from the AUIB epigame, these agents reduced peak active infections with quarantine compliance peaking at 58-65% on day six. Perceived health severity was the strongest predictor of quarantine behaviour, with a binomial generalised linear model yielding β = 0.33, p = 0.002 and pseudo-R² of 0.055, close to the human trial value of 0.072. LLM architecture is a key determinant of epidemic dynamics, with low-variance architectures offering greater internal validity for testing behavioural rules and high-variance models bett
What carries the argument
The Epi-LLM framework, which places LLM agents into an agent-based epidemiological model so they can reason dynamically about quarantine on a simulated contact network and adapt over the course of an outbreak.
If this is right
- LLM agents from four architectures lower peak active infections relative to a no-intervention SEIR baseline.
- Quarantine compliance among the agents reaches 58-65 percent by day six and is driven most strongly by perceived health severity.
- The statistical relationship between perceived severity and quarantine decisions yields a pseudo-R² value comparable to that observed in the human trial.
- Low-variance LLM architectures supply greater internal validity when the goal is to test specific behavioural rules.
- High-variance models may better approximate the variability of real-world human decision-making during epidemics.
Where Pith is reading between the lines
- The framework could support rapid testing of many intervention scenarios at low cost before any real-world deployment.
- Extending the same agent setup to additional protective actions such as masking or vaccination uptake would test whether the observed behavioral patterns generalize.
- If the alignment with human data holds under further checks, the method might help identify which public messages most effectively raise perceived severity.
- Architecture-specific differences suggest researchers should select models according to whether the priority is consistency or population-level diversity.
Load-bearing premise
The AUIB epigame human participant data forms a valid external benchmark for direct comparison with LLM agent quarantine decisions, and the simulated contact network plus agent reasoning capture the essential features of real human behavioral responses during epidemics.
What would settle it
A replication run in which LLM agents produce quarantine compliance rates or a generalised linear model coefficient for perceived severity that differs substantially from the reported human trial values would undermine the claim of comparable behavioral priors.
read the original abstract
Human behaviour during epidemics affects infectious disease dynamics, but quantifying this remains deeply challenging. Here we introduce the Epi-LLM framework: a novel integration of agent-based modelling, real-life epigames, and large language models (LLMs) in which a synthetic society of agents reasons and adapts dynamically over an outbreak contact network. Comparing synthetic agent behaviour against a no-intervention SEIR baseline and human participant data from the AUIB epigame study, we find that LLM agents across four different architectures reduced peak active infections, with quarantine compliance peaking at 58-65% on day six of the 15-day simulation. A binomial generalised linear model showed that perceived health severity was the strongest predictor of quarantine behaviour ($\beta = 0.33, p = 0.002$), yielding a pseudo-$R^2$ of 0.055, comparable to the 0.072 observed in the human trial. LLM architecture is a key determinant of epidemic dynamics: low-variance architectures offer greater internal validity for testing behavioural rules, while high-variance models may better represent real-world decision-making. Geographic labels alone do not induce culturally differentiated behaviour; explicit attitudinal parameterisation is required. This proof-of-principle work lays the groundwork for deploying the Epi-LLM framework as a scalable, risk-free simulation environment for pandemic preparedness research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Epi-LLM framework integrating agent-based epidemiological models with LLMs to create synthetic agents that reason and adapt over an outbreak contact network. It compares behavior of agents from four LLM architectures against a no-intervention SEIR baseline and human data from the AUIB epigame, reporting that LLM agents reduce peak active infections, achieve quarantine compliance of 58-65% peaking on day six, and yield a binomial GLM in which perceived health severity is the strongest predictor of quarantine (eta=0.33, p=0.002) with pseudo-R^{2}=0.055, comparable to the human trial value of 0.072. Additional claims are that LLM architecture determines epidemic dynamics (low-variance models for internal validity, high-variance for realism) and that geographic labels alone do not produce culturally differentiated behavior without explicit attitudinal parameterization.
Significance. If the central empirical comparisons hold after methods clarification, the framework would provide a scalable, risk-free platform for testing behavioral rules and interventions in pandemic scenarios, extending agent-based modeling by leveraging LLM priors. The direct quantitative match to an external human epigame dataset on GLM predictors and the architecture-specific findings would be notable contributions to multi-agent systems and behavioral epidemiology.
major comments (3)
- [Abstract] Abstract and Methods: the headline GLM result (perceived health severity eta=0.33, pseudo-R^{2}=0.055 vs. human 0.072) and the claim of comparable quarantine behavior rest on unverified equivalence of the synthetic contact network, daily information available to agents, and operationalization of 'perceived health severity' to the AUIB epigame protocol; no description of network generation, SEIR parameters, or prompt templates is supplied, so differences in peak infections or compliance rates cannot be attributed to the tested behavioral rules rather than mismatched mechanics.
- [Results] Results section on architecture effects: the assertion that 'LLM architecture is a key determinant of epidemic dynamics' with low-variance models offering greater internal validity requires explicit reporting of per-architecture variance in compliance rates, peak infections, and GLM coefficients across the four models tested; without these quantities or statistical controls for multiple comparisons, the distinction between low- and high-variance architectures remains unsupported.
- [Discussion] Comparison to external benchmark: the central claim that LLM agents exhibit human-like behavioral priors is load-bearing on the AUIB epigame data constituting a valid benchmark, yet the manuscript provides no verification that the simulated daily information, contact network structure, or severity judgment elicitation match the human trial protocol closely enough for the pseudo-R^{2} values to be meaningfully compared.
minor comments (2)
- [Abstract] The abstract states 'quarantine compliance peaking at 58-65% on day six' but does not define the exact operationalization of compliance (e.g., binary decision per agent per day or aggregate) or report confidence intervals or sample sizes for the four architectures.
- [Results] Notation: 'pseudo-R^{2}' is reported without specifying the exact variant (e.g., McFadden, Cox-Snell) or whether the GLM includes the full set of predictors used in the human trial analysis.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which highlight important areas for improving transparency and rigor in the Epi-LLM manuscript. We address each major comment point-by-point below, committing to revisions that add the requested methodological details and statistical reporting while maintaining the integrity of our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract and Methods: the headline GLM result (perceived health severity β=0.33, pseudo-R²=0.055 vs. human 0.072) and the claim of comparable quarantine behavior rest on unverified equivalence of the synthetic contact network, daily information available to agents, and operationalization of 'perceived health severity' to the AUIB epigame protocol; no description of network generation, SEIR parameters, or prompt templates is supplied, so differences in peak infections or compliance rates cannot be attributed to the tested behavioral rules rather than mismatched mechanics.
Authors: We agree that the manuscript requires expanded methodological detail to support reproducibility and attribution of results to LLM behavioral rules. The current version summarizes the framework but omits explicit descriptions of network generation, SEIR parameters, and prompt templates. In the revised manuscript we will add a comprehensive Methods section specifying: the contact network generation procedure (including degree distribution and geographic labeling), all SEIR parameters (transmission probability, incubation and infectious periods, recovery rates), and the complete prompt templates with the exact operationalization of perceived health severity. We will also document how daily information provided to agents aligns with the AUIB epigame protocol. These additions will enable readers to evaluate whether observed differences arise from behavioral priors rather than setup mismatches. revision: yes
-
Referee: [Results] Results section on architecture effects: the assertion that 'LLM architecture is a key determinant of epidemic dynamics' with low-variance models offering greater internal validity requires explicit reporting of per-architecture variance in compliance rates, peak infections, and GLM coefficients across the four models tested; without these quantities or statistical controls for multiple comparisons, the distinction between low- and high-variance architectures remains unsupported.
Authors: We concur that the architecture-effects claim needs quantitative backing through per-model variance statistics. The manuscript currently reports aggregate findings across architectures without disaggregated variance measures or multiple-comparison adjustments. In revision we will insert a new table (or supplementary table) reporting, for each of the four LLMs: variance in quarantine compliance rates, peak active infections, and GLM coefficients (including β for perceived severity). We will also apply and report appropriate statistical controls (e.g., Bonferroni or FDR correction) when comparing architectures. This will substantiate the low- versus high-variance distinction with the requested evidence. revision: yes
-
Referee: [Discussion] Comparison to external benchmark: the central claim that LLM agents exhibit human-like behavioral priors is load-bearing on the AUIB epigame data constituting a valid benchmark, yet the manuscript provides no verification that the simulated daily information, contact network structure, or severity judgment elicitation match the human trial protocol closely enough for the pseudo-R² values to be meaningfully compared.
Authors: The AUIB comparison is presented as an initial external benchmark rather than a strict replication study. Nevertheless, greater transparency on protocol alignment is warranted. In the revised Discussion we will add an explicit subsection enumerating the correspondences and divergences between our simulation (daily information flow, contact network topology, severity judgment prompts) and the published AUIB epigame protocol. We will qualify the pseudo-R² comparison accordingly, noting that while the values are numerically close, they reflect an approximation rather than identical conditions. This will allow readers to judge the strength of the human-like prior claim without overstating equivalence. revision: partial
Circularity Check
No significant circularity; results benchmarked to external human dataset and SEIR baseline
full rationale
The paper's central claims rest on direct comparisons of LLM agent quarantine compliance and GLM-derived predictors (β = 0.33, pseudo-R² = 0.055) against the independent AUIB epigame human trial (pseudo-R² = 0.072) and a no-intervention SEIR baseline. No equations or quantities are defined in terms of the outputs they are used to predict, no fitted parameters are relabeled as predictions, and no self-citations are invoked to justify uniqueness or load-bearing premises. The derivation chain therefore remains self-contained against external benchmarks rather than reducing to its own inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- attitudinal parameters
axioms (1)
- domain assumption LLM agents can reason and adapt dynamically over an outbreak contact network in a manner comparable to human participants
invented entities (1)
-
Epi-LLM synthetic agents
no independent evidence
Reference graph
Works this paper leans on
-
[1]
W. O. Kermack and A. G. McKendrick. A contribution to the mathematical theory of epidemics.Proceedings of the royal society of london. Series A, Containing papers of a mathematical and physical character, 115(772):700–721, 1927
1927
-
[2]
Tracy, M
M. Tracy, M. Cerd´ a, and K. M. Keyes. Agent-based modeling in public health: current applications and future directions. Annual review of public health, 39:77–94, 2018
2018
-
[3]
N. M. Ferguson, D. Laydon, G. Nedjati-Gilani, N. Imai, K. Ainslie, M. Baguelin, S. Bhatia, A. Boonyasiri, Z. Cucunub´ a, G. Cuomo-Dannenburg, et al.Report 9: Impact of non-pharmaceutical interventions (NPIs) to reduce COVID19 mortality and healthcare demand, volume 16. Imperial College London London, 2020
2020
-
[4]
N. M. Ferguson, D. A. Cummings, C. Fraser, J. C. Cajka, P. C. Cooley, and D. S. Burke. Strategies for mitigating an influenza pandemic.Nature, 442(7101):448–452, 2006
2006
-
[5]
Y. Ye, A. Pandey, C. Bawden, D. M. Sumsuzzman, R. Rajput, A. Shoukat, B. H. Singer, S. M. Moghadas, and A. P. Galvani. Integrating artificial intelligence with mechanistic epidemiological modeling: a scoping review of opportunities and challenges.Nature Communications, 16 (1):581, 2025
2025
-
[6]
Jian, H.-J
Z.-D. Jian, H.-J. Chang, T.-s. Hsu, and D.-W. Wang. Applying deep learning for surrogate construction of simulation systems. InInternational Conference on Simulation and Modeling Methodologies, Technologies and Applications, pages 335–350. Springer, 2017
2017
-
[7]
Kaur and Z
J. Kaur and Z. A. Butt. Ai-driven epidemic intelligence: the future of outbreak detection and response.Frontiers in Artificial Intelligence, 8:1645467, 2025
2025
-
[8]
S.-C. Wong, E. K.-Y. Chiu, K. H.-Y. Chiu, A. R. Tam, P.- H. Chau, M.-H. Choi, W.-Y. Ng, M. O.-T. Kwok, B. Y. Chau, M. Y.-Z. Ng, et al. Comparative evaluation and performance of large language models in clinical infection control scenarios: A benchmark study. InHealthcare, volume 13, page 2652. MDPI, 2025
2025
-
[9]
Rizzo, E
A. Rizzo, E. Mensa, and A. Giacomelli. The future of large language models in fighting emerging outbreaks: lights and shadows.The Lancet Microbe, 5(11), 2024
2024
-
[10]
M. H. Samaei, F. D. Sahneh, L. W. Cohnstaedt, and C. M. Scoglio. Epidemiqs: Prompt-to-paper llm agents for epidemic modeling and analysis.IEEE Transactions on Artificial Intelligence, 2026
2026
-
[11]
J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024
2024
-
[12]
C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
D. A. Boiko, R. MacKnight, and G. Gomes. Emergent autonomous scientific research capabilities of large language models.arXiv preprint arXiv:2304.05332, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [14]
-
[15]
Y. Lu, A. Aleta, C. Du, L. Shi, and Y. Moreno. Llms and generative agent-based models for complex systems research. Physics of Life Reviews, 51:283–293, 2024
2024
-
[16]
J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023
2023
-
[17]
Chuang, A
Y.-S. Chuang, A. Goyal, N. Harlalka, S. Suresh, R. Hawkins, S. Yang, D. Shah, J. Hu, and T. Rogers. Simulating opinion dynamics with networks of llm-based agents. InFindings of the association for computational linguistics: NAACL 2024, pages 3326–3346, 2024
2024
-
[18]
J. S. Park, C. Q. Zou, A. Shaw, B. M. Hill, C. Cai, M. R. Morris, R. Willer, P. Liang, and M. S. Bernstein. Generative agent simulations of 1,000 people.arXiv preprint arXiv:2411.10109, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [19]
-
[20]
R. Williams, N. Hosseinichimeh, A. Majumdar, and N. Ghaffarzadegan. Epidemic modeling with generative agents.arXiv preprint arXiv:2307.04986, 2023
-
[21]
Starsim: Agent-based disease modeling, 2026
Starsim. Starsim: Agent-based disease modeling, 2026. URL https://starsim.org/
2026
-
[22]
C. C. Kerr, R. M. Stuart, D. Mistry, R. G. Abeysuriya, K. Rosenfeld, G. R. Hart, R. C. N´ u˜ nez, J. A. Cohen, P. Selvaraj, B. Hagedorn, et al. Covasim: an agent- based model of covid-19 dynamics and interventions.PLoS computational biology, 17(7):e1009149, 2021
2021
-
[23]
Colubri, D
A. Colubri, D. Williams, T. Valente, C. T. Bauch, J. M. Drake, M. C. Mills, J. Drury, C. Fraser, L. Ferretti, and J. Panovska-Griffiths. Understanding human behaviour for pandemic preparedness with epigames.Nature Health, pages 1–3, 2026
2026
-
[24]
Colubri, A
A. Colubri, A. Grozdani, M. Khandpekar, Y. Graytee, O. Al-Mohammedi, A. A. Al-Shabandar, W. Y. Shabeeb, Y. Ghassan, H. Swayedi, C. T. Bauch, et al. App-based epidemic game to model belief-behavior mapping and cost incentives in voluntary quarantine: A randomized controlled trial.medRxiv, pages 2026–01, 2026
2026
-
[25]
H. H. Weiss. The sir model and the foundations of public health.Materials matematics, pages 0001–17, 2013
2013
-
[26]
M. H. A. Biswas, L. T. Paiva, and M. De Pinho. A seir model for control of infectious diseases with constraints. Mathematical Biosciences and Engineering, 11(4):761, 10 Preprint, YEAR, Volume XX, Issue x 2014
2014
-
[27]
Holme and J
P. Holme and J. Saram¨ aki. Temporal networks.Physics reports, 519(3):97–125, 2012
2012
-
[28]
Danon, T
L. Danon, T. A. House, J. M. Read, and M. J. Keeling. Social encounter networks: collective properties and disease transmission.Journal of The Royal Society Interface, 9 (76):2826–2833, 2012
2012
-
[29]
Openrouter api, 2026
OpenRouter. Openrouter api, 2026. URLhttps:// openrouter.ai/
2026
-
[30]
A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
NVIDIA Nemotron 3: Efficient and Open Intelligence
A. Blakeman, A. Grattafiori, A. Basant, A. Gupta, A. Khattar, A. Renduchintala, A. Vavre, A. Shukla, A. Bercovich, A. Ficek, et al. Nvidia nemotron 3: Efficient and open intelligence.arXiv preprint arXiv:2512.20856, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
gpt-oss-120b & gpt-oss-20b Model Card
S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Salecha, M
A. Salecha, M. E. Ireland, S. Subrahmanya, J. Sedoc, L. H. Ungar, and J. C. Eichstaedt. Large language models display human-like social desirability biases in big five personality surveys.PNAS nexus, 3(12):pgae533, 2024
2024
-
[35]
I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. Tanjim, S. Kim, F. Dernoncourt, T. Yu, R. Zhang, and N. K. Ahmed. Bias and fairness in large language models: A survey.Computational Linguistics, 50(3):1097–1179, Sept
-
[36]
Computational Linguistics48(4), 1125–1135 (2022) https://doi.org/10.1162/coli a 00448
doi: 10.1162/coli a 00524. URLhttps://aclanthology. org/2024.cl-3.8/
-
[37]
C. Chen, C. B. Frey, and G. Presidente. Culture and contagion: Individualism and compliance with covid-19 policy.Journal of economic behavior & organization, 190: 191–200, 2021
2021
-
[38]
R. P. Rajkumar. The relationship between measures of individualism and collectivism and the impact of covid-19 across nations.Public Health in Practice, 2:100143, 2021
2021
-
[39]
V. L. Champion, C. S. Skinner, et al. The health belief model.Health behavior and health education: Theory, research, and practice, 4:45–65, 2008. Preprint, YEAR, Volume XX, Issue x 11 Prompt Engineering Structure Prompt engineering has been widely employed to instantiate LLM agents in task-specific environments, providing the context necessary for cohere...
2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.