Context-Aware Hospitalization Forecasting Evaluations for Decision Support using LLMs

Ananya Joshi; Rhea Makkuni

arxiv: 2604.23949 · v1 · submitted 2026-04-27 · 💻 cs.AI

Context-Aware Hospitalization Forecasting Evaluations for Decision Support using LLMs

Rhea Makkuni , Ananya Joshi This is my paper

Pith reviewed 2026-05-08 03:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords hospitalization forecastingLLM context integrationhybrid time-series modelsforecast stabilitycalibration for decisionspublic health resource planningcontext-augmented prediction

0 comments

The pith

A hybrid pipeline that feeds LLM-derived context into classical time-series models produces more stable and better-calibrated hospitalization forecasts than either approach alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Public health decisions about hospital capacity during disruptions require forecasts that remain reliable even when data conditions change. The paper tests three methods across sixty US counties: direct LLM forecasting, standard autoregressive models, and a hybrid that inserts LLM-processed signals about demographics and geography into the time-series structure. The hybrid improves stability and calibration over the classical model, especially when the added signals contain noise. These properties matter more for operational use than raw error reduction. The results indicate LLMs contribute most when their outputs are constrained inside established forecasting frameworks.

Core claim

The paper establishes that the HybridARX pipeline, which incorporates LLM-generated contextual signals as exogenous variables into autoregressive time-series models, produces hospitalization forecasts with greater stability and better calibration than classical ARX models when applied to US county data spanning low, mid, and high intensity periods.

What carries the argument

The HybridARX pipeline that augments classical ARX models with exogenous variables derived from LLM processing of non-temporal public health context such as demographic and geographic features.

If this is right

Forecasts remain usable for capacity decisions even when underlying trends shift rapidly.
Bias and lead-lag alignment improve, which directly supports timing of resource allocations.
Direct LLM forecasting is outperformed, so the models are best used as signal generators rather than standalone predictors.
Performance advantages appear across counties with different hospitalization intensities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hybrid pattern could be tested in other operational domains where external context must be fused with time-series data.
Real-time deployment would need checks on whether the LLM signals retain value when input data quality degrades further.
The emphasis on stability and calibration rather than point accuracy could guide evaluation standards for other public-sector forecasting systems.

Load-bearing premise

LLM-generated contextual signals supply information that is both new and relevant enough to improve forecast stability and calibration beyond the temporal patterns already present in the time-series data.

What would settle it

A follow-up evaluation on an independent collection of counties or a later disruption period in which the HybridARX method shows no gains in stability or calibration metrics over the baseline ARX model.

Figures

Figures reproduced from arXiv: 2604.23949 by Ananya Joshi, Rhea Makkuni.

**Figure 1.** Figure 1: Evaluation pipeline to compare metrics from LLM-based methods. Data is pulled from multiple data sources with varying granularity and are processed to be standardized input. We compare multiple forecasting algorithms on metrics relevant to decision-making tasks. However, there are practical challenges within the healthcare setting, such as the facilitylevel heterogeneity in system capacity, including var… view at source ↗

**Figure 2.** Figure 2: Pearson correlation coefficients between candidate leading indicators and countylevel 14-day average COVID-19 hospitalizations. Hospital capacity measures and anosmia/ageusia search volume exhibit the strongest positive associations. Population estimates: Population estimates from the 2020 U.S. Census were used to stratify counties (U.S. Census Bureau, 2022). Counties were ranked by their mean weekly hosp… view at source ↗

read the original abstract

Medical and public health experts must make real-time resource decisions, such as expanding hospital bed capacity, based on projected hospitalization trends during large-scale healthcare disruptions (e.g., operational failures or pandemics). Forecasting models can assist in this task by analyzing large volumes of resource-related data at the facility level, but they must be reliable for decision-making under real-world data conditions. Recent work shows that large language models (LLMs) can incorporate richer forms of context into numerical forecasting. Whereas traditional models rely primarily on temporal context (i.e., past observations), LLMs can also leverage non-temporal public health context such as demographic, geographic, and population-level features. However, it remains unclear how these models should be used to produce stable or decision-relevant predictions in real-world healthcare settings. To evaluate how LLMs can be effectively used in this setting, we evaluate three approaches across 60 counties with low-,mid-, and high-hospitalization intensities in the United States: direct LLM-based forecasting, classical time-series models, and a context-augmented hybrid pipeline (HybridARX) that incorporates LLM-derived signals into structured models. Because the goal is operational decision-making rather than error minimization alone, we evaluate performance with bias and lead-lag alignment in addition to standard forecasting metrics. Our results show that HybridARX improves over classical ARX by yielding more stable and better-calibrated forecasts, particularly when incorporating noisy contextual signals into structured time-series models. These findings suggest that, in non-stationary healthcare resource forecasting, LLMs are most useful when embedded within structured hybrid models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The hybrid ARX with LLM context is a sensible direction for decision-focused hospitalization forecasting, but missing ablations and method details leave the core claim unconvincing.

read the letter

The paper's main contribution is testing a hybrid pipeline that feeds LLM-derived contextual signals (demographics, geography, etc.) into a classical ARX time-series model for county-level hospitalization forecasts. They compare it against standalone LLM forecasting and plain ARX across 60 US counties split by low/mid/high intensity, and they track not just error but stability, calibration, bias, and lead-lag alignment to match operational needs during disruptions. That focus on decision-relevant properties rather than pure accuracy is useful, and running the same setup on counties with different hospitalization loads avoids the usual cherry-picking problem. The hybrid framing itself is reasonable: LLMs can bring in non-temporal signals that pure autoregressive models miss, while the structured model keeps the output more reliable than raw LLM numbers. What the work does cleanly is lay out a concrete evaluation setup that prioritizes real-world use over benchmark chasing. The soft spots are more serious. The abstract and stress-test note both flag the absence of any ablation that replaces the LLM signals with noise or null inputs, so we cannot separate the benefit of the hybrid structure from any genuine information the LLM adds. There are also no details on prompting, signal extraction, statistical tests, error bars, or how counties were selected, which makes the reported stability and calibration gains impossible to verify or reproduce. If the hybrid architecture alone drives the improvement, or if LLM signal quality tracks hospitalization intensity, the attribution to context-aware LLMs does not hold. This is for applied researchers working on time-series forecasting in healthcare or public-health operations who want practical hybrids rather than pure LLM or pure statistical baselines. A reader could extract the pipeline idea and the metric choices, but no one should treat the performance numbers as established. It deserves peer review because the domain matters and the hybrid direction is worth testing properly, but only with explicit requests for ablations, full methods, and reproducibility checks.

Referee Report

3 major / 2 minor

Summary. The paper evaluates three hospitalization forecasting approaches across 60 US counties stratified by low-, mid-, and high-hospitalization intensity: direct LLM-based forecasting, classical ARX time-series models, and HybridARX (a context-augmented hybrid that feeds LLM-derived non-temporal signals such as demographic and geographic features into structured ARX models). Performance is assessed not only with standard forecasting metrics but also with bias and lead-lag alignment to emphasize decision-relevance under non-stationary conditions. The central claim is that HybridARX produces more stable and better-calibrated forecasts than classical ARX, especially when noisy contextual signals are incorporated.

Significance. If the results hold after proper controls, the work would provide concrete guidance on embedding LLMs into hybrid pipelines for operational healthcare forecasting, showing that LLMs add value primarily through structured integration rather than direct use. It shifts emphasis from pure error minimization to stability and calibration, which are load-bearing for real-time resource decisions during disruptions.

major comments (3)

[Abstract and Results] Abstract and Results: The claim that HybridARX improves over ARX 'particularly when incorporating noisy contextual signals' lacks any ablation that substitutes LLM outputs with noise, null signals, or baseline features to isolate whether gains arise from the LLM information itself or from the hybrid architecture (e.g., extra parameters or regularization). Without this test, attribution to LLMs cannot be verified and the central claim is not load-bearing.
[Methods] Methods: No description is given of the LLM prompting strategy, the exact procedure for extracting contextual signals from LLM responses, the mathematical integration step into the ARX model (e.g., how signals enter the exogenous regressors), or the statistical tests and error-bar computation used to establish superiority.
[Evaluation] Evaluation: The abstract states results on 60 counties but supplies no per-intensity breakdowns, data-exclusion rules, or formal tests confirming that the stability/calibration gains generalize beyond the chosen sample or are robust to signal quality variation.

minor comments (2)

[Abstract] Abstract: Typo in 'low-,mid-, and high-' should read 'low-, mid-, and high-' for standard punctuation.
[Abstract] Abstract: The paper would benefit from explicit citations to prior LLM-augmented time-series work to situate the hybrid contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for clarification and strengthening. We address each major comment below with specific responses and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results: The claim that HybridARX improves over ARX 'particularly when incorporating noisy contextual signals' lacks any ablation that substitutes LLM outputs with noise, null signals, or baseline features to isolate whether gains arise from the LLM information itself or from the hybrid architecture (e.g., extra parameters or regularization). Without this test, attribution to LLMs cannot be verified and the central claim is not load-bearing.

Authors: We agree that an explicit ablation replacing LLM signals with noise or null features would more rigorously isolate the contribution of the contextual information. The existing ARX vs. HybridARX comparison holds the base model fixed and varies only the addition of signals, but it does not fully rule out effects from feature dimensionality. In the revision we will add a dedicated ablation subsection that substitutes LLM-derived signals with (i) Gaussian noise matched to signal statistics and (ii) raw demographic baselines without LLM processing. We will report the resulting degradation in stability and calibration metrics to support attribution to the LLM signals. revision: yes
Referee: [Methods] Methods: No description is given of the LLM prompting strategy, the exact procedure for extracting contextual signals from LLM responses, the mathematical integration step into the ARX model (e.g., how signals enter the exogenous regressors), or the statistical tests and error-bar computation used to establish superiority.

Authors: These procedural details are present in the full manuscript but were not sufficiently highlighted in the main Methods narrative. Prompting uses zero-shot templates that inject county-level demographic, geographic, and public-health metadata drawn from Census and CDC sources; LLM outputs are parsed for structured fields (e.g., population-density index, healthcare-access score) via regex and JSON extraction. These signals are concatenated to the exogenous regressor matrix of the ARX model as additional columns, yielding the augmented specification y_t = Σ ϕ_i y_{t-i} + β'X_t + γ'S + ε_t. Superiority is evaluated with paired t-tests on county-level error distributions and 95 % bootstrap confidence intervals (1 000 resamples). We will expand the Methods section with a new subsection containing the exact prompt templates, parsing code, integration equation, and statistical procedures. revision: yes
Referee: [Evaluation] Evaluation: The abstract states results on 60 counties but supplies no per-intensity breakdowns, data-exclusion rules, or formal tests confirming that the stability/calibration gains generalize beyond the chosen sample or are robust to signal quality variation.

Authors: The Results section already presents per-stratum tables (low-, mid-, high-intensity) and time-series plots demonstrating consistent HybridARX gains across intensity levels. Data-exclusion rules are stated in Section 3.1: counties with <50 weeks of complete hospitalization records or missing covariates were removed, leaving the final 60-county panel. Generalization is assessed via rolling-origin cross-validation across time and counties; robustness to signal quality is examined by repeating experiments at different LLM temperatures and prompt paraphrases. We will (i) add a concise sentence to the abstract summarizing the stratified findings and (ii) insert an explicit robustness paragraph in the Evaluation section that reports the cross-validation and signal-quality sensitivity results. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical model comparison with independent results

full rationale

The paper conducts an empirical evaluation of three forecasting pipelines (direct LLM, classical ARX, HybridARX) on hospitalization data from 60 U.S. counties, reporting comparative performance on stability, calibration, bias, and lead-lag metrics. No equations, derivations, or parameter-fitting steps are described that reduce the reported improvements to self-referential definitions or fitted inputs. No self-citations are used to justify uniqueness theorems, ansatzes, or load-bearing premises. The central claim rests on observable differences in the evaluated pipelines rather than any construction that equates outputs to inputs by definition. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that LLM outputs can be treated as reliable exogenous signals for time-series models without introducing systematic bias, plus standard forecasting assumptions about data stationarity and metric appropriateness.

axioms (2)

domain assumption LLM-derived contextual features supply non-redundant predictive information for hospitalization trends
Invoked to explain why the hybrid model outperforms classical ARX on stability and calibration.
domain assumption Bias and lead-lag alignment are appropriate proxies for decision-relevance in resource allocation
Used to justify the evaluation beyond standard error metrics.

pith-pipeline@v0.9.0 · 5581 in / 1322 out tokens · 81325 ms · 2026-05-08T03:47:10.478694+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

Evaluating epidemic forecasts in an interval format

doi: 10.1371/journal.pcbi.1008618. Centers for Disease Control and Prevention. Flusight: Influenza forecasting.https://www. cdc.gov/flu-forecasting/index.html, 2025. U.S. Department of Health and Human Services. Accessed January 31, 2026. Estee Y. Cramer, Evan L. Ray, Velma K. Lopez, Johannes Bracher, Andrea Brennen, Al- varo J. Castro Rivadeneira, Aaron ...

work page doi:10.1371/journal.pcbi.1008618 2025
[2]

2020-04-06: y=0.46

Appendix A: Prompt Templates 10.1. Prompt-Only The prompt-only approach predicts next-week COVID-19 hospitalizations using the pre- vious eight weeks of hospitalization data. No exogenous indicators (e.g., ICU capacity, 17 Hospitalization Forecasting Evaluations ventilator utilization, or search trends) are provided. The prompt wording and structure are f...

work page 2020
[3]

Appendix B: Model Hyperparameters and Implementation Details This appendix summarizes the hyperparameters and implementation choices used for the classical time-series baselines, prompt-only LLM, andHybridARXapproach. 11.1. Classical Time-Series Baselines All classical baselines are implemented using a rolling-window framework with a fixed his- tory lengt...

work page 2025
[4]

Appendix C: Additional Tables 22 Hospitalization Forecasting Evaluations Table 4: MAPE: low-intensity counties (per-county mean percent error±SD). County Lag-1 AR(1) ES ARX LLM Hybrid ARX Hybrid LR Armstrong County 25.5±22.9 32.9±32.8 29.3±27.9 36.1±35.7 25.4±23.7 31.1±32.1 46.5±46.8 Bedford County 28.7±31.0 48.4±140.9 29.5±28.2 48.8±139.1 30.1±31.5 46.6±...

work page

[1] [1]

Evaluating epidemic forecasts in an interval format

doi: 10.1371/journal.pcbi.1008618. Centers for Disease Control and Prevention. Flusight: Influenza forecasting.https://www. cdc.gov/flu-forecasting/index.html, 2025. U.S. Department of Health and Human Services. Accessed January 31, 2026. Estee Y. Cramer, Evan L. Ray, Velma K. Lopez, Johannes Bracher, Andrea Brennen, Al- varo J. Castro Rivadeneira, Aaron ...

work page doi:10.1371/journal.pcbi.1008618 2025

[2] [2]

2020-04-06: y=0.46

Appendix A: Prompt Templates 10.1. Prompt-Only The prompt-only approach predicts next-week COVID-19 hospitalizations using the pre- vious eight weeks of hospitalization data. No exogenous indicators (e.g., ICU capacity, 17 Hospitalization Forecasting Evaluations ventilator utilization, or search trends) are provided. The prompt wording and structure are f...

work page 2020

[3] [3]

Appendix B: Model Hyperparameters and Implementation Details This appendix summarizes the hyperparameters and implementation choices used for the classical time-series baselines, prompt-only LLM, andHybridARXapproach. 11.1. Classical Time-Series Baselines All classical baselines are implemented using a rolling-window framework with a fixed his- tory lengt...

work page 2025

[4] [4]

Appendix C: Additional Tables 22 Hospitalization Forecasting Evaluations Table 4: MAPE: low-intensity counties (per-county mean percent error±SD). County Lag-1 AR(1) ES ARX LLM Hybrid ARX Hybrid LR Armstrong County 25.5±22.9 32.9±32.8 29.3±27.9 36.1±35.7 25.4±23.7 31.1±32.1 46.5±46.8 Bedford County 28.7±31.0 48.4±140.9 29.5±28.2 48.8±139.1 30.1±31.5 46.6±...

work page