pith. sign in

arxiv: 2606.02867 · v1 · pith:TUCX5Z3Tnew · submitted 2026-06-01 · 💻 cs.MA · cs.AI· q-bio.PE

The Epi-LLM Framework: probing LLM behavioral priors through epidemiological agent-based models

Pith reviewed 2026-06-28 11:26 UTC · model grok-4.3

classification 💻 cs.MA cs.AIq-bio.PE
keywords LLM agentsagent-based modelingepidemic simulationquarantine behaviorbehavioral priorsSEIR modelpandemic preparednesscontact network
0
0 comments X

The pith

LLM agents in epidemic simulations reduce peak infections with quarantine rates matching human patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Epi-LLM framework to embed large language model agents in an agent-based model of disease spread on a contact network. These agents make dynamic decisions about quarantine that lower the maximum number of active cases compared to a baseline model with no behavior changes. Compliance peaks at 58-65 percent on day six of a 15-day run, and statistical analysis shows that perceived illness severity is the main driver of these choices, producing results close to those from human participants in a similar game. The authors conclude that the choice of LLM architecture shapes the overall epidemic outcome and that explicit attitude settings are needed to produce varied cultural responses.

Core claim

In the Epi-LLM framework, agents powered by large language models reason and adapt over an outbreak contact network. Compared to a no-intervention SEIR baseline and human data from the AUIB epigame, these agents reduced peak active infections with quarantine compliance peaking at 58-65% on day six. Perceived health severity was the strongest predictor of quarantine behaviour, with a binomial generalised linear model yielding β = 0.33, p = 0.002 and pseudo-R² of 0.055, close to the human trial value of 0.072. LLM architecture is a key determinant of epidemic dynamics, with low-variance architectures offering greater internal validity for testing behavioural rules and high-variance models bett

What carries the argument

The Epi-LLM framework, which places LLM agents into an agent-based epidemiological model so they can reason dynamically about quarantine on a simulated contact network and adapt over the course of an outbreak.

If this is right

  • LLM agents from four architectures lower peak active infections relative to a no-intervention SEIR baseline.
  • Quarantine compliance among the agents reaches 58-65 percent by day six and is driven most strongly by perceived health severity.
  • The statistical relationship between perceived severity and quarantine decisions yields a pseudo-R² value comparable to that observed in the human trial.
  • Low-variance LLM architectures supply greater internal validity when the goal is to test specific behavioural rules.
  • High-variance models may better approximate the variability of real-world human decision-making during epidemics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could support rapid testing of many intervention scenarios at low cost before any real-world deployment.
  • Extending the same agent setup to additional protective actions such as masking or vaccination uptake would test whether the observed behavioral patterns generalize.
  • If the alignment with human data holds under further checks, the method might help identify which public messages most effectively raise perceived severity.
  • Architecture-specific differences suggest researchers should select models according to whether the priority is consistency or population-level diversity.

Load-bearing premise

The AUIB epigame human participant data forms a valid external benchmark for direct comparison with LLM agent quarantine decisions, and the simulated contact network plus agent reasoning capture the essential features of real human behavioral responses during epidemics.

What would settle it

A replication run in which LLM agents produce quarantine compliance rates or a generalised linear model coefficient for perceived severity that differs substantially from the reported human trial values would undermine the claim of comparable behavioral priors.

read the original abstract

Human behaviour during epidemics affects infectious disease dynamics, but quantifying this remains deeply challenging. Here we introduce the Epi-LLM framework: a novel integration of agent-based modelling, real-life epigames, and large language models (LLMs) in which a synthetic society of agents reasons and adapts dynamically over an outbreak contact network. Comparing synthetic agent behaviour against a no-intervention SEIR baseline and human participant data from the AUIB epigame study, we find that LLM agents across four different architectures reduced peak active infections, with quarantine compliance peaking at 58-65% on day six of the 15-day simulation. A binomial generalised linear model showed that perceived health severity was the strongest predictor of quarantine behaviour ($\beta = 0.33, p = 0.002$), yielding a pseudo-$R^2$ of 0.055, comparable to the 0.072 observed in the human trial. LLM architecture is a key determinant of epidemic dynamics: low-variance architectures offer greater internal validity for testing behavioural rules, while high-variance models may better represent real-world decision-making. Geographic labels alone do not induce culturally differentiated behaviour; explicit attitudinal parameterisation is required. This proof-of-principle work lays the groundwork for deploying the Epi-LLM framework as a scalable, risk-free simulation environment for pandemic preparedness research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Epi-LLM framework integrating agent-based epidemiological models with LLMs to create synthetic agents that reason and adapt over an outbreak contact network. It compares behavior of agents from four LLM architectures against a no-intervention SEIR baseline and human data from the AUIB epigame, reporting that LLM agents reduce peak active infections, achieve quarantine compliance of 58-65% peaking on day six, and yield a binomial GLM in which perceived health severity is the strongest predictor of quarantine (eta=0.33, p=0.002) with pseudo-R^{2}=0.055, comparable to the human trial value of 0.072. Additional claims are that LLM architecture determines epidemic dynamics (low-variance models for internal validity, high-variance for realism) and that geographic labels alone do not produce culturally differentiated behavior without explicit attitudinal parameterization.

Significance. If the central empirical comparisons hold after methods clarification, the framework would provide a scalable, risk-free platform for testing behavioral rules and interventions in pandemic scenarios, extending agent-based modeling by leveraging LLM priors. The direct quantitative match to an external human epigame dataset on GLM predictors and the architecture-specific findings would be notable contributions to multi-agent systems and behavioral epidemiology.

major comments (3)
  1. [Abstract] Abstract and Methods: the headline GLM result (perceived health severity eta=0.33, pseudo-R^{2}=0.055 vs. human 0.072) and the claim of comparable quarantine behavior rest on unverified equivalence of the synthetic contact network, daily information available to agents, and operationalization of 'perceived health severity' to the AUIB epigame protocol; no description of network generation, SEIR parameters, or prompt templates is supplied, so differences in peak infections or compliance rates cannot be attributed to the tested behavioral rules rather than mismatched mechanics.
  2. [Results] Results section on architecture effects: the assertion that 'LLM architecture is a key determinant of epidemic dynamics' with low-variance models offering greater internal validity requires explicit reporting of per-architecture variance in compliance rates, peak infections, and GLM coefficients across the four models tested; without these quantities or statistical controls for multiple comparisons, the distinction between low- and high-variance architectures remains unsupported.
  3. [Discussion] Comparison to external benchmark: the central claim that LLM agents exhibit human-like behavioral priors is load-bearing on the AUIB epigame data constituting a valid benchmark, yet the manuscript provides no verification that the simulated daily information, contact network structure, or severity judgment elicitation match the human trial protocol closely enough for the pseudo-R^{2} values to be meaningfully compared.
minor comments (2)
  1. [Abstract] The abstract states 'quarantine compliance peaking at 58-65% on day six' but does not define the exact operationalization of compliance (e.g., binary decision per agent per day or aggregate) or report confidence intervals or sample sizes for the four architectures.
  2. [Results] Notation: 'pseudo-R^{2}' is reported without specifying the exact variant (e.g., McFadden, Cox-Snell) or whether the GLM includes the full set of predictors used in the human trial analysis.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important areas for improving transparency and rigor in the Epi-LLM manuscript. We address each major comment point-by-point below, committing to revisions that add the requested methodological details and statistical reporting while maintaining the integrity of our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Methods: the headline GLM result (perceived health severity β=0.33, pseudo-R²=0.055 vs. human 0.072) and the claim of comparable quarantine behavior rest on unverified equivalence of the synthetic contact network, daily information available to agents, and operationalization of 'perceived health severity' to the AUIB epigame protocol; no description of network generation, SEIR parameters, or prompt templates is supplied, so differences in peak infections or compliance rates cannot be attributed to the tested behavioral rules rather than mismatched mechanics.

    Authors: We agree that the manuscript requires expanded methodological detail to support reproducibility and attribution of results to LLM behavioral rules. The current version summarizes the framework but omits explicit descriptions of network generation, SEIR parameters, and prompt templates. In the revised manuscript we will add a comprehensive Methods section specifying: the contact network generation procedure (including degree distribution and geographic labeling), all SEIR parameters (transmission probability, incubation and infectious periods, recovery rates), and the complete prompt templates with the exact operationalization of perceived health severity. We will also document how daily information provided to agents aligns with the AUIB epigame protocol. These additions will enable readers to evaluate whether observed differences arise from behavioral priors rather than setup mismatches. revision: yes

  2. Referee: [Results] Results section on architecture effects: the assertion that 'LLM architecture is a key determinant of epidemic dynamics' with low-variance models offering greater internal validity requires explicit reporting of per-architecture variance in compliance rates, peak infections, and GLM coefficients across the four models tested; without these quantities or statistical controls for multiple comparisons, the distinction between low- and high-variance architectures remains unsupported.

    Authors: We concur that the architecture-effects claim needs quantitative backing through per-model variance statistics. The manuscript currently reports aggregate findings across architectures without disaggregated variance measures or multiple-comparison adjustments. In revision we will insert a new table (or supplementary table) reporting, for each of the four LLMs: variance in quarantine compliance rates, peak active infections, and GLM coefficients (including β for perceived severity). We will also apply and report appropriate statistical controls (e.g., Bonferroni or FDR correction) when comparing architectures. This will substantiate the low- versus high-variance distinction with the requested evidence. revision: yes

  3. Referee: [Discussion] Comparison to external benchmark: the central claim that LLM agents exhibit human-like behavioral priors is load-bearing on the AUIB epigame data constituting a valid benchmark, yet the manuscript provides no verification that the simulated daily information, contact network structure, or severity judgment elicitation match the human trial protocol closely enough for the pseudo-R² values to be meaningfully compared.

    Authors: The AUIB comparison is presented as an initial external benchmark rather than a strict replication study. Nevertheless, greater transparency on protocol alignment is warranted. In the revised Discussion we will add an explicit subsection enumerating the correspondences and divergences between our simulation (daily information flow, contact network topology, severity judgment prompts) and the published AUIB epigame protocol. We will qualify the pseudo-R² comparison accordingly, noting that while the values are numerically close, they reflect an approximation rather than identical conditions. This will allow readers to judge the strength of the human-like prior claim without overstating equivalence. revision: partial

Circularity Check

0 steps flagged

No significant circularity; results benchmarked to external human dataset and SEIR baseline

full rationale

The paper's central claims rest on direct comparisons of LLM agent quarantine compliance and GLM-derived predictors (β = 0.33, pseudo-R² = 0.055) against the independent AUIB epigame human trial (pseudo-R² = 0.072) and a no-intervention SEIR baseline. No equations or quantities are defined in terms of the outputs they are used to predict, no fitted parameters are relabeled as predictions, and no self-citations are invoked to justify uniqueness or load-bearing premises. The derivation chain therefore remains self-contained against external benchmarks rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract, the framework rests on domain assumptions about LLM reasoning capabilities and the validity of the epigame benchmark. No explicit free parameters or invented physical entities are quantified; the attitudinal parameterization is described as necessary but not given numerical values here.

free parameters (1)
  • attitudinal parameters
    Stated as required for culturally differentiated behavior; no specific values or fitting procedure given in abstract.
axioms (1)
  • domain assumption LLM agents can reason and adapt dynamically over an outbreak contact network in a manner comparable to human participants
    Invoked as the core premise enabling the Epi-LLM synthetic society.
invented entities (1)
  • Epi-LLM synthetic agents no independent evidence
    purpose: To probe LLM behavioral priors through epidemiological simulations
    Newly introduced composite entity combining LLMs with ABMs; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5805 in / 1704 out tokens · 46509 ms · 2026-06-28T11:26:41.764268+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 11 canonical work pages · 7 internal anchors

  1. [1]

    W. O. Kermack and A. G. McKendrick. A contribution to the mathematical theory of epidemics.Proceedings of the royal society of london. Series A, Containing papers of a mathematical and physical character, 115(772):700–721, 1927

  2. [2]

    Tracy, M

    M. Tracy, M. Cerd´ a, and K. M. Keyes. Agent-based modeling in public health: current applications and future directions. Annual review of public health, 39:77–94, 2018

  3. [3]

    N. M. Ferguson, D. Laydon, G. Nedjati-Gilani, N. Imai, K. Ainslie, M. Baguelin, S. Bhatia, A. Boonyasiri, Z. Cucunub´ a, G. Cuomo-Dannenburg, et al.Report 9: Impact of non-pharmaceutical interventions (NPIs) to reduce COVID19 mortality and healthcare demand, volume 16. Imperial College London London, 2020

  4. [4]

    N. M. Ferguson, D. A. Cummings, C. Fraser, J. C. Cajka, P. C. Cooley, and D. S. Burke. Strategies for mitigating an influenza pandemic.Nature, 442(7101):448–452, 2006

  5. [5]

    Y. Ye, A. Pandey, C. Bawden, D. M. Sumsuzzman, R. Rajput, A. Shoukat, B. H. Singer, S. M. Moghadas, and A. P. Galvani. Integrating artificial intelligence with mechanistic epidemiological modeling: a scoping review of opportunities and challenges.Nature Communications, 16 (1):581, 2025

  6. [6]

    Jian, H.-J

    Z.-D. Jian, H.-J. Chang, T.-s. Hsu, and D.-W. Wang. Applying deep learning for surrogate construction of simulation systems. InInternational Conference on Simulation and Modeling Methodologies, Technologies and Applications, pages 335–350. Springer, 2017

  7. [7]

    Kaur and Z

    J. Kaur and Z. A. Butt. Ai-driven epidemic intelligence: the future of outbreak detection and response.Frontiers in Artificial Intelligence, 8:1645467, 2025

  8. [8]

    S.-C. Wong, E. K.-Y. Chiu, K. H.-Y. Chiu, A. R. Tam, P.- H. Chau, M.-H. Choi, W.-Y. Ng, M. O.-T. Kwok, B. Y. Chau, M. Y.-Z. Ng, et al. Comparative evaluation and performance of large language models in clinical infection control scenarios: A benchmark study. InHealthcare, volume 13, page 2652. MDPI, 2025

  9. [9]

    Rizzo, E

    A. Rizzo, E. Mensa, and A. Giacomelli. The future of large language models in fighting emerging outbreaks: lights and shadows.The Lancet Microbe, 5(11), 2024

  10. [10]

    M. H. Samaei, F. D. Sahneh, L. W. Cohnstaedt, and C. M. Scoglio. Epidemiqs: Prompt-to-paper llm agents for epidemic modeling and analysis.IEEE Transactions on Artificial Intelligence, 2026

  11. [11]

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

  12. [12]

    C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

  13. [13]

    D. A. Boiko, R. MacKnight, and G. Gomes. Emergent autonomous scientific research capabilities of large language models.arXiv preprint arXiv:2304.05332, 2023

  14. [14]

    J. Li, Y. Lai, W. Li, J. Ren, M. Zhang, X. Kang, S. Wang, P. Li, Y.-Q. Zhang, W. Ma, et al. Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957, 2024

  15. [15]

    Y. Lu, A. Aleta, C. Du, L. Shi, and Y. Moreno. Llms and generative agent-based models for complex systems research. Physics of Life Reviews, 51:283–293, 2024

  16. [16]

    J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

  17. [17]

    Chuang, A

    Y.-S. Chuang, A. Goyal, N. Harlalka, S. Suresh, R. Hawkins, S. Yang, D. Shah, J. Hu, and T. Rogers. Simulating opinion dynamics with networks of llm-based agents. InFindings of the association for computational linguistics: NAACL 2024, pages 3326–3346, 2024

  18. [18]

    J. S. Park, C. Q. Zou, A. Shaw, B. M. Hill, C. Cai, M. R. Morris, R. Willer, P. Liang, and M. S. Bernstein. Generative agent simulations of 1,000 people.arXiv preprint arXiv:2411.10109, 2024

  19. [19]

    S. Choi, K. Lee, O. Sng, and J. M. Ackerman. Infected smallville: How disease threat shapes sociality in llm agents. arXiv preprint arXiv:2506.13783, 2025

  20. [20]

    Williams, N

    R. Williams, N. Hosseinichimeh, A. Majumdar, and N. Ghaffarzadegan. Epidemic modeling with generative agents.arXiv preprint arXiv:2307.04986, 2023

  21. [21]

    Starsim: Agent-based disease modeling, 2026

    Starsim. Starsim: Agent-based disease modeling, 2026. URL https://starsim.org/

  22. [22]

    C. C. Kerr, R. M. Stuart, D. Mistry, R. G. Abeysuriya, K. Rosenfeld, G. R. Hart, R. C. N´ u˜ nez, J. A. Cohen, P. Selvaraj, B. Hagedorn, et al. Covasim: an agent- based model of covid-19 dynamics and interventions.PLoS computational biology, 17(7):e1009149, 2021

  23. [23]

    Colubri, D

    A. Colubri, D. Williams, T. Valente, C. T. Bauch, J. M. Drake, M. C. Mills, J. Drury, C. Fraser, L. Ferretti, and J. Panovska-Griffiths. Understanding human behaviour for pandemic preparedness with epigames.Nature Health, pages 1–3, 2026

  24. [24]

    Colubri, A

    A. Colubri, A. Grozdani, M. Khandpekar, Y. Graytee, O. Al-Mohammedi, A. A. Al-Shabandar, W. Y. Shabeeb, Y. Ghassan, H. Swayedi, C. T. Bauch, et al. App-based epidemic game to model belief-behavior mapping and cost incentives in voluntary quarantine: A randomized controlled trial.medRxiv, pages 2026–01, 2026

  25. [25]

    H. H. Weiss. The sir model and the foundations of public health.Materials matematics, pages 0001–17, 2013

  26. [26]

    M. H. A. Biswas, L. T. Paiva, and M. De Pinho. A seir model for control of infectious diseases with constraints. Mathematical Biosciences and Engineering, 11(4):761, 10 Preprint, YEAR, Volume XX, Issue x 2014

  27. [27]

    Holme and J

    P. Holme and J. Saram¨ aki. Temporal networks.Physics reports, 519(3):97–125, 2012

  28. [28]

    Danon, T

    L. Danon, T. A. House, J. M. Read, and M. J. Keeling. Social encounter networks: collective properties and disease transmission.Journal of The Royal Society Interface, 9 (76):2826–2833, 2012

  29. [29]

    Openrouter api, 2026

    OpenRouter. Openrouter api, 2026. URLhttps:// openrouter.ai/

  30. [30]

    A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  31. [31]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  32. [32]

    NVIDIA Nemotron 3: Efficient and Open Intelligence

    A. Blakeman, A. Grattafiori, A. Basant, A. Gupta, A. Khattar, A. Renduchintala, A. Vavre, A. Shukla, A. Bercovich, A. Ficek, et al. Nvidia nemotron 3: Efficient and open intelligence.arXiv preprint arXiv:2512.20856, 2025

  33. [33]

    gpt-oss-120b & gpt-oss-20b Model Card

    S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

  34. [34]

    Salecha, M

    A. Salecha, M. E. Ireland, S. Subrahmanya, J. Sedoc, L. H. Ungar, and J. C. Eichstaedt. Large language models display human-like social desirability biases in big five personality surveys.PNAS nexus, 3(12):pgae533, 2024

  35. [35]

    I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. Tanjim, S. Kim, F. Dernoncourt, T. Yu, R. Zhang, and N. K. Ahmed. Bias and fairness in large language models: A survey.Computational Linguistics, 50(3):1097–1179, Sept

  36. [36]

    Computational Linguistics48(4), 1125–1135 (2022) https://doi.org/10.1162/coli a 00448

    doi: 10.1162/coli a 00524. URLhttps://aclanthology. org/2024.cl-3.8/

  37. [37]

    C. Chen, C. B. Frey, and G. Presidente. Culture and contagion: Individualism and compliance with covid-19 policy.Journal of economic behavior & organization, 190: 191–200, 2021

  38. [38]

    R. P. Rajkumar. The relationship between measures of individualism and collectivism and the impact of covid-19 across nations.Public Health in Practice, 2:100143, 2021

  39. [39]

    V. L. Champion, C. S. Skinner, et al. The health belief model.Health behavior and health education: Theory, research, and practice, 4:45–65, 2008. Preprint, YEAR, Volume XX, Issue x 11 Prompt Engineering Structure Prompt engineering has been widely employed to instantiate LLM agents in task-specific environments, providing the context necessary for cohere...