pith. sign in

arxiv: 2511.18000 · v2 · pith:CU6HF6FPnew · submitted 2025-11-22 · 💻 cs.LG · cs.AI· q-bio.PE

Reward Engineering for Spatial Epidemic Simulations: A Reinforcement Learning Platform for Individual Behavioral Learning

Pith reviewed 2026-05-25 07:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AIq-bio.PE
keywords reinforcement learningepidemic simulationreward engineeringspatial modelingagent behaviornon-pharmaceutical interventionsSIRS modelpolicy learning
0
0 comments X

The pith

A potential field reward in a new RL platform for spatial epidemics enables agents to learn maximal intervention adherence and spatial avoidance strategies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ContagionRL, a Gymnasium platform that couples reinforcement learning with a spatial SIRS+D epidemiological model to test how reward design shapes learned agent behaviors. Five reward functions are compared across PPO, SAC, and A2C algorithms under varying infection rates, grid sizes, visibility, and movement rules. The potential field reward, which supplies directional guidance plus adherence incentives, produces the strongest results by driving agents toward maximal compliance with non-pharmaceutical interventions and the emergence of sophisticated avoidance tactics. This approach matters because conventional agent-based models rely on fixed rules, whereas reward engineering here lets policies adapt to different epidemic conditions and information limits.

Core claim

ContagionRL integrates a spatial SIRS+D model with configurable parameters into a Gymnasium environment, permitting systematic evaluation of reward designs from sparse survival bonuses to a novel potential field formulation. Ablation studies across algorithms and scenarios show that directional guidance combined with explicit adherence incentives is essential for robust learning. Agents trained on the potential field reward achieve superior survival, maximal adherence to interventions, and complex spatial strategies, while other rewards yield weaker or less adaptive policies.

What carries the argument

The potential field reward function, which supplies directional guidance toward safer regions together with explicit incentives for intervention adherence inside the spatial SIRS+D Gymnasium environment.

If this is right

  • Directional guidance and adherence incentives prove critical for robust policy learning across tested algorithms.
  • Reward function choice dramatically changes agent behavior and survival rates under varied infection rates and grid sizes.
  • The modular platform supports stress-testing of rewards under limited observability and heterogeneous population dynamics.
  • Agents develop sophisticated spatial avoidance strategies specifically when trained with the potential field reward.
  • Systematic ablation reveals that sparse survival bonuses alone are insufficient for effective learning in these settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The platform could be used to simulate how different public incentive structures affect real-world compliance levels.
  • Extending the single-agent setup to multiple interacting agents might expose emergent group-level epidemic control behaviors.
  • The emphasis on information structure and environmental predictability could transfer to reward design in other spatial simulation domains such as traffic flow or resource allocation.
  • Direct comparison of learned policies against empirical mobility data during past outbreaks would test whether the discovered strategies align with observed human responses.

Load-bearing premise

The spatial SIRS+D model and the chosen observation and action spaces are realistic enough that reward-driven policies will yield meaningful behavioral insights.

What would settle it

If agents using the potential field reward fail to outperform the other four designs when the environment is altered to include different movement patterns or stronger stochastic infection rules, the superiority result would be falsified.

Figures

Figures reproduced from arXiv: 2511.18000 by Daniel Coombs, Radman Rakhshandehroo.

Figure 1
Figure 1. Figure 1: ContagionRL System Architecture. Top: SIRS+D spatial epidemic environment with toroidal grid, configurable observability, and continuous agent control interface. Middle: Reward function design from sparse to potential field-based rewards. Bottom: Multi-dimensional experimental evaluation. framework, the environment simulates a single reinforcement learning (Sutton & Barto, 2018) agent interacting with a po… view at source ↗
Figure 2
Figure 2. Figure 2: Episode duration distributions across different agents, including learning-based (PPO, SAC, A2C) [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of PPO agent performance under five different reward functions. Each model was [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study of the Potential Field reward function. Each variant was evaluated over 100 episodes across 3 training seeds (300 episodes total). Violin plots show the distribution of episode durations, overlaid with boxplots and individual episode results (small black dots). Large black dots with white outlines represent per-seed means. One-sided Mann–Whitney U tests (Bonferroni-corrected) compare each ab… view at source ↗
Figure 5
Figure 5. Figure 5: Impact of visibility radius constraints on RL agent performance in epidemic control. The figure [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance comparison of RL agents across different human movement patterns in epidemic control. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Compartmental epidemic models visualized: SIR, SIRS, SEIR, and SIRS+D. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Sample render of the SIRS+D environment at step 30 of an episode [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Mean episode durations and 95% bootstrapped confidence intervals (10,000 samples) for each agent. [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Mean episode durations with 95% bootstrapped confidence intervals (10,000 samples) for each [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Mean episode durations and 95% bootstrapped confidence intervals (10,000 samples) for each [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of mean episode durations across agents (Trained, Stationary, Random, Greedy) for [PITH_FULL_IMAGE:figures/full_fig_p032_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Mean episode durations for Trained, Stationary, Random, and Greedy agents evaluated across environments with varying grid sizes. Each bar represents the mean across 3 seeds (each with 100 evaluation episodes), and is accompanied by 95% bootstrapped confidence intervals (10,000 resamples). The red dashed line marks the maximum episode length. See [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Mean episode durations of Trained, Stationary, Random, and Greedy agents across varying levels of adherence effectiveness. Each bar represents the mean of 3 independent training runs, each evaluated over 100 episodes (300 evaluations per agent-type per setting), with 95% bootstrapped confidence intervals (10,000 resamples). A red dashed line indicates the maximum episode length imposed by the environment.… view at source ↗
Figure 15
Figure 15. Figure 15: Performance comparison of Trained and baseline agents across varying values of the distance decay parameter, which controls how repulsion from infected individuals diminishes with spatial separation. Bars show mean episode durations, with 95% bootstrapped confidence intervals (10,000 samples) calculated from per-seed means. The red dashed line indicates the maximum episode duration. For a statistical anal… view at source ↗
read the original abstract

We present ContagionRL, a Gymnasium-compatible reinforcement learning platform specifically designed for systematic reward engineering in spatial epidemic simulations. Unlike traditional agent-based models that rely on fixed behavioral rules, our platform enables rigorous evaluation of how reward function design affects learned survival strategies across diverse epidemic scenarios. ContagionRL integrates a spatial SIRS+D epidemiological model with configurable environmental parameters, allowing researchers to stress-test reward functions under varying conditions including limited observability, different movement patterns, and heterogeneous population dynamics. We evaluate five distinct reward designs, ranging from sparse survival bonuses to a novel potential field approach, across multiple RL algorithms (PPO, SAC, A2C). Through systematic ablation studies, we identify that directional guidance and explicit adherence incentives are critical components for robust policy learning. Our comprehensive evaluation across varying infection rates, grid sizes, visibility constraints, and movement patterns reveals that reward function choice dramatically impacts agent behavior and survival outcomes. Agents trained with our potential field reward consistently achieve superior performance, learning maximal adherence to non-pharmaceutical interventions while developing sophisticated spatial avoidance strategies. The platform's modular design enables systematic exploration of reward-behavior relationships, addressing a knowledge gap in models of this type where reward engineering has received limited attention. ContagionRL is an effective platform for studying adaptive behavioral responses in epidemic contexts and highlight the importance of reward design, information structure, and environmental predictability in learning. Our code is publicly available at https://github.com/redradman/ContagionRL

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ContagionRL, a Gymnasium-compatible RL platform that couples a configurable spatial SIRS+D epidemiological model with five reward formulations (including a novel potential-field design) and evaluates them under PPO, SAC, and A2C across varied grid sizes, infection rates, observability constraints, and movement patterns. The central empirical claim is that the potential-field reward produces policies with maximal NPI adherence and sophisticated spatial avoidance, outperforming sparse survival, adherence-only, and other baselines.

Significance. If the reported performance ordering is reproducible, the work supplies a modular, open-source testbed that isolates the effect of reward engineering on learned epidemic behavior—an area that has received little systematic attention. The public GitHub release and the explicit ablation across algorithms and environmental parameters constitute concrete strengths that lower the barrier for follow-on studies.

major comments (2)
  1. [Evaluation / Results] The manuscript states that 'systematic ablation studies' demonstrate superiority of the potential-field reward, yet supplies neither the numerical performance tables, confidence intervals, nor the precise experimental protocol (number of seeds, episode lengths, hyper-parameter grids) that would allow independent verification of that ordering. This gap directly affects the load-bearing empirical claim.
  2. [Environment Definition] The observation and action spaces are described at a high level, but the precise mapping from the SIRS+D state variables to the agent’s observation vector (and the dimensionality of the action space) is not given; without these definitions it is impossible to assess whether the reported spatial-avoidance strategies are artifacts of the chosen interface rather than genuine behavioral learning.
minor comments (2)
  1. [Abstract] The abstract claims 'maximal adherence' without defining the quantitative metric used to measure adherence; a short paragraph or equation in §3 would clarify this.
  2. [Figures] Figure captions and axis labels for the ablation plots should explicitly state the number of independent runs and whether shaded regions represent standard error or min/max.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address the two major comments below and will incorporate the requested clarifications into a revised manuscript.

read point-by-point responses
  1. Referee: [Evaluation / Results] The manuscript states that 'systematic ablation studies' demonstrate superiority of the potential-field reward, yet supplies neither the numerical performance tables, confidence intervals, nor the precise experimental protocol (number of seeds, episode lengths, hyper-parameter grids) that would allow independent verification of that ordering. This gap directly affects the load-bearing empirical claim.

    Authors: We agree that the current manuscript lacks the quantitative tables, confidence intervals, and full experimental protocol needed for independent verification. In the revision we will add (i) mean and standard-deviation performance tables for all reward–algorithm combinations, (ii) 95 % confidence intervals computed over multiple random seeds, and (iii) an explicit protocol section stating the number of seeds, episode lengths, and the hyper-parameter grids searched for PPO, SAC, and A2C. revision: yes

  2. Referee: [Environment Definition] The observation and action spaces are described at a high level, but the precise mapping from the SIRS+D state variables to the agent’s observation vector (and the dimensionality of the action space) is not given; without these definitions it is impossible to assess whether the reported spatial-avoidance strategies are artifacts of the chosen interface rather than genuine behavioral learning.

    Authors: We acknowledge that the precise observation vector construction and action-space dimensionality were only sketched at a high level. The revised manuscript will include (a) the exact mapping from each SIRS+D compartment and spatial coordinate to the observation components, (b) the resulting observation dimensionality, and (c) the discrete or continuous action-space size together with the movement semantics. These additions will be accompanied by a short pseudocode block clarifying the Gymnasium interface. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical RL platform (ContagionRL) and reports direct performance comparisons of five distinct reward designs across PPO/SAC/A2C on a spatial SIRS+D Gymnasium environment. No equations, fitted parameters, or derivations are shown that reduce reported outcomes to quantities defined by the paper's own inputs; results are scoped to simulation outcomes inside the defined environment with no load-bearing self-citations or self-definitional steps visible in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard RL assumptions (MDP formulation, policy optimization via PPO/SAC) and the validity of the chosen SIRS+D compartmental model; no new free parameters or invented entities are introduced beyond the reward-function definitions themselves.

axioms (2)
  • domain assumption The environment is a Markov decision process with the stated observation and action spaces.
    Implicit in the use of standard RL algorithms on the Gymnasium wrapper.
  • domain assumption The spatial SIRS+D dynamics adequately represent epidemic spread for the purpose of behavioral learning studies.
    Stated in the integration of the epidemiological model with the RL platform.

pith-pipeline@v0.9.0 · 5806 in / 1341 out tokens · 31223 ms · 2026-05-25T07:55:21.941220+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

  1. [1]

    rhealth = { 1ifS a =Susceptible 0otherwise (26) (This component can be ablated to 0 via the no health variant)

    Health Reward (rhealth):This is a binary reward for maintaining a susceptible state. rhealth = { 1ifS a =Susceptible 0otherwise (26) (This component can be ablated to 0 via the no health variant)

  2. [2]

    Letαa be the agent’s current adherence (a value in[0,1])

    Adherence Reward (radherence):This reward is directly proportional to the agent’s NPI adherence level. Letαa be the agent’s current adherence (a value in[0,1]). radherence =αa (27) (This component can be ablated to 0 via the no adherence variant)

  3. [3]

    Let the agent’s current position be pa = (xa,ya)and human j’s position bepj = (xj,yj)

    Movement Reward (rmove):This component rewards the agent for moving in alignment with a suggested force vectorFand optionally for matching its magnitude. Let the agent’s current position be pa = (xa,ya)and human j’s position bepj = (xj,yj). The shortest displacement vector on the toroidal grid from humanjto the agent is∆p j = (∆x j,∆y j), where: ∆x j = (x...

  4. [4]

    Identify the set of currently infected humansI(t)

    Identify Nearest Threat:Let the agent’s position at timet bep a(t) = (xa(t),ya(t)). Identify the set of currently infected humansI(t). This step utilizes privileged knowledge of the true state Sj ∈{S,I,R,D}for all humansj. If I(t)is empty, the agent defaults to a stationary action m∗= (0,0). 26

  5. [5]

    Target Selection:If I(t)is not empty, calculate the Euclidean distanced(pa(t), pj(t))for all j∈I(t), wherepj(t)is the position of infected humanj and d(·,·)accounts for the toroidal grid geometry. Identify the single infected humanhnearest(t)corresponding to the minimum distance: hnearest(t) = arg min j∈I(t) d(pa(t),pj(t)) The exact positionphnearest(t)is...

  6. [6]

    Evaluate Potential Moves:Define a discrete set of candidate movement vectorsAmove. This set includes the zero vector(0, 0)and scaled unit vectors representing the maximum possible step in the eight cardinal and diagonal directions, e.g.,{(0, 0), (±sM, 0), (0,±sM), (±sM/ √ 2,±sM/ √ 2)}, where sM is the maximum movement scale (typically 1.0)

  7. [7]

    Evaluate the distance from this potential position to the initially identified nearest threat’s current positionphnearest(t)

    Select Best Move:For each candidate movementmi = (∆xi, ∆yi)∈Amove, calculate the agent’s potential next positionp′ a,i(t + 1)by applying the movement topa(t)and considering the grid’s periodic boundaries. Evaluate the distance from this potential position to the initially identified nearest threat’s current positionphnearest(t). Select the movement vector...

  8. [8]

    Winner” indicates the model with significantly longer duration after correction. “Mean Diff

    Set Adherence:The adherence component of the action is deterministically set to the maximum value,α= 1.0. 6.Final Action:The resulting action for timesteptisa(t) = (m ∗,α= 1.0). This implementation defines a simple, reactive strategy that exploits complete and accurate environmental state information to maximize instantaneous separation from the nearest p...