Recognition: unknown
FlatASCEND: Autoregressive Clinical Sequence Generation with Continuous Time Prediction and Association-Based Pharmacological Testing
Pith reviewed 2026-05-10 15:07 UTC · model grok-4.3
The pith
Patient-specific prefixes in an autoregressive clinical model amplify mechanistic drug effects while leaving confounding associations unchanged.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FlatASCEND generates patient-conditioned clinical sequences whose responses to intervention tokens preserve known pharmacological associations, with a prompt-shuffle ablation demonstrating that patient-specific prefixes amplify mechanistic effects 2.0-2.2 times for steroid-to-glucose and diuretic-to-potassium while leaving confounding-driven associations at 0.9 times for insulin-to-glucose. On MIMIC-IV incident-user comparisons the model recovers correct mechanistic directions in 4 of 10 cases, reproduces context associations in 2, and produces incorrect directions in 4, a pattern the authors interpret as learned observational associations without causal separation.
What carries the argument
The prompt-shuffle ablation on patient-conditioned autoregressive generation, which isolates how specific patient history strengthens preservation of mechanistic pharmacological associations in the output trajectories.
If this is right
- Generative clinical models can be assessed by whether patient conditioning selectively boosts known mechanistic associations rather than by distributional overlap alone.
- Optimization methods that share outcome variables with the evaluation metric can erase correct pharmacological directions.
- Short-horizon predictions in intensive-care settings remain more reliable than longer outpatient sequences.
- Zero-shot application across hospitals degrades without further adaptation.
Where Pith is reading between the lines
- If the amplification effect holds on interventional data, the same conditioning approach could support in-silico simulation of personalized treatment responses.
- Adding explicit causal structure or external knowledge graphs might help the model move beyond observational associations.
- The finding that reward optimization destroys correct links suggests caution when aligning such generators with surrogate objectives that overlap evaluation domains.
Load-bearing premise
That an incident-user design plus prompt shuffling on observational hospital records can separate true mechanistic pharmacological responses from residual confounding and learned correlations.
What would settle it
Run the same model and ablation on data from randomized controlled trials for the tested drug-outcome pairs and check whether the 2.0-2.2x amplification for mechanistic effects disappears or persists.
read the original abstract
Autoregressive models can predict clinical events, but generating patient-conditioned multi-step trajectories that respond to intervention tokens and testing whether those responses preserve known pharmacological associations has received limited attention. We present FlatASCEND, a 14.5M-parameter autoregressive clinical sequence model using flat composite tokens and a zero-inflated log-normal time head. Standard distributional metrics (Jaccard 0.889-0.954) do not distinguish FlatASCEND from trivial baselines; the model's value lies in conditional generation from patient-specific prefixes. A prompt-shuffle ablation shows patient-specific conditioning amplifies mechanistic pharmacological effects (2.0-2.2x for steroid to glucose, diuretic to potassium) while leaving confounding-driven associations unchanged (0.9x for insulin to glucose). An incident-user framework assesses directional consistency against prior pharmacological knowledge on MIMIC-IV (N=500 per comparison): 4/10 recover correct mechanistic directions, 2 reproduce treatment-context associations, 4 are incorrect (9/10 significant, Wilcoxon p<0.05). This pattern - partial recovery under residual confounding - is consistent with learned observational associations without causal distinction. Direct preference optimisation with surrogate reward destroys all correct associations (3/3 to 0/3), illustrating reward exploitation when reward and evaluation share an outcome domain. Generative evidence is strongest for short-horizon ICU data; outpatient temporal fidelity is weaker (median 10 vs 154 days on INSPECT), and zero-shot cross-site transfer degrades without adaptation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FlatASCEND, a 14.5M-parameter autoregressive model for clinical sequence generation using flat composite tokens and a zero-inflated log-normal time head. It claims that while standard distributional metrics (Jaccard 0.889-0.954) fail to beat trivial baselines, the model's value is in patient-specific conditional generation; a prompt-shuffle ablation purportedly shows selective amplification of mechanistic pharmacological effects (2.0-2.2x for steroid-glucose and diuretic-potassium) versus no change for confounding associations (0.9x for insulin-glucose), and an incident-user design on MIMIC-IV recovers correct directions for 4/10 tested associations (9/10 significant by Wilcoxon), consistent with learned observational associations without causal distinction. DPO is shown to destroy correct associations.
Significance. If the prompt-shuffle ablation and incident-user design validly separate mechanistic responses from residual confounding and learned correlations, the work would offer a useful probe for how autoregressive clinical models encode conditional dependencies, with potential applications in trajectory simulation. The explicit acknowledgment that results reflect observational associations, the use of ablations and statistical tests, and the negative DPO result are strengths that demonstrate careful evaluation. However, the low association recovery rate and failure of unconditional metrics limit broader impact.
major comments (3)
- Prompt-shuffle ablation results: the claim of selective 2.0-2.2x amplification for mechanistic pairs (steroid to glucose, diuretic to potassium) versus 0.9x for the confounding pair (insulin to glucose) assumes the global shuffle isolates patient-specific conditioning without altering marginal distributions over time-varying confounders. Because the model is trained exclusively on observational MIMIC-IV trajectories, token transitions already entangle patient features, treatments, and outcomes; the differential effect may simply reflect stronger representation of mechanistic pairs in patient-specific marginals rather than true isolation of pharmacological mechanism. The paper itself concludes the pattern is 'consistent with learned observational associations without causal distinction,' making the selective-amplification framing load-bearing yet under-supported.
- Incident-user pharmacological testing framework (N=500 per comparison): only 4/10 associations recover the correct mechanistic direction, 4 are incorrect, and 2 reproduce treatment-context associations, despite 9/10 reaching Wilcoxon p<0.05. This low recovery rate, under the paper's own residual-confounding caveat, provides only weak evidence that the generative model captures pharmacological associations beyond training correlations; the framework's directional checks against external knowledge therefore do not strongly validate the model's mechanistic fidelity.
- Direct preference optimisation experiment: the finding that DPO reduces correct associations from 3/3 to 0/3 is presented as illustrating reward exploitation when reward and evaluation share an outcome domain. However, no additional controls (e.g., alternative rewards or out-of-domain evaluation) are reported to distinguish exploitation from general degradation of the generative distribution, weakening the illustrative claim.
minor comments (2)
- The abstract and results note that standard distributional metrics do not distinguish FlatASCEND from trivial baselines; explicit numerical comparisons to those baselines (e.g., unigram or Markov models) should be added to the main text or a table for clarity.
- Outpatient temporal fidelity is reported as weaker (median 10 vs 154 days on INSPECT) with degraded zero-shot cross-site transfer; a brief discussion of potential causes (e.g., data sparsity or tokenization) would help readers interpret the scope of the model's strengths.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments highlight important limitations in our interpretation of the prompt-shuffle ablation, the strength of evidence from the incident-user tests, and the controls in the DPO experiment. We address each point below with clarifications drawn directly from the manuscript's own caveats and propose targeted revisions to improve precision without overstating the results.
read point-by-point responses
-
Referee: Prompt-shuffle ablation results: the claim of selective 2.0-2.2x amplification for mechanistic pairs (steroid to glucose, diuretic to potassium) versus 0.9x for the confounding pair (insulin to glucose) assumes the global shuffle isolates patient-specific conditioning without altering marginal distributions over time-varying confounders. ... making the selective-amplification framing load-bearing yet under-supported.
Authors: We agree that the shuffle ablation cannot isolate causal pharmacological mechanisms, as the model is trained solely on observational MIMIC-IV data where token transitions already entangle patient features, treatments, and outcomes. The manuscript explicitly concludes that the observed pattern is 'consistent with learned observational associations without causal distinction.' The differential effect (amplification for the two mechanistic pairs, no change for the confounding pair) is presented only as evidence that patient-specific prefixes produce non-uniform shifts relative to shuffled prefixes, not as proof of mechanism isolation. To address the concern that the framing may be load-bearing, we will revise the relevant section and abstract to replace 'selective amplification of mechanistic pharmacological effects' with 'differential response to patient-specific conditioning for certain associations,' while retaining the quantitative results and the explicit non-causal caveat. This is a partial revision focused on language precision. revision: partial
-
Referee: Incident-user pharmacological testing framework (N=500 per comparison): only 4/10 associations recover the correct mechanistic direction, 4 are incorrect, and 2 reproduce treatment-context associations, despite 9/10 reaching Wilcoxon p<0.05. This low recovery rate, under the paper's own residual-confounding caveat, provides only weak evidence that the generative model captures pharmacological associations beyond training correlations.
Authors: We report the exact recovery statistics (4/10 correct mechanistic directions, 4 incorrect, 2 treatment-context) and the Wilcoxon significance (9/10) in the manuscript, and we frame the entire experiment as showing only 'partial recovery under residual confounding' that remains 'consistent with learned observational associations without causal distinction.' The test is not offered as strong validation of mechanistic fidelity but as a directional probe against external pharmacological knowledge on the same observational data source. The low correct-direction rate is therefore not an unacknowledged weakness but the central reported finding. We will add a sentence in the discussion explicitly noting that the mixed directional results limit claims about fidelity beyond correlations. This is a partial revision. revision: partial
-
Referee: Direct preference optimisation experiment: the finding that DPO reduces correct associations from 3/3 to 0/3 is presented as illustrating reward exploitation when reward and evaluation share an outcome domain. However, no additional controls (e.g., alternative rewards or out-of-domain evaluation) are reported to distinguish exploitation from general degradation of the generative distribution, weakening the illustrative claim.
Authors: The DPO result is included as a negative finding to illustrate the risk that a surrogate reward defined on the same outcome domain can eliminate previously observed correct associations. We acknowledge that, without controls such as alternative reward formulations or out-of-domain evaluation metrics, the experiment cannot rigorously separate targeted exploitation from broader degradation of the generative distribution. We will revise the text to describe the result as 'an illustrative case of association loss under in-domain reward optimization' and add a brief note that additional controls would be required to confirm exploitation as the mechanism. This is a partial revision. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The paper reports empirical results from training an autoregressive model on MIMIC-IV trajectories and measuring differential amplification in a prompt-shuffle ablation plus directional consistency against external pharmacological knowledge. No equations, fitted parameters, or self-citations are presented that reduce the central claims (e.g., the 2.0-2.2x vs 0.9x contrast) to the training inputs by construction. The ablation compares conditioned versus shuffled prefixes within the same trained model; the resulting ratios are observed statistics on generated sequences rather than identities or renamed fits. The manuscript explicitly qualifies its findings as consistent with learned observational associations, confirming the evaluation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Observational MIMIC-IV data with incident-user design permits assessment of directional consistency against prior pharmacological knowledge
Reference graph
Works this paper leans on
-
[1]
Y. Li, S. Rao, J. R. A. Solares, A. Hassaine, R. Ramakrishnan, D. Canoy, Y. Zhu, K. Rahimi, and G. Salimi-Khorshidi. BEHRT : Transformer for electronic health records. Scientific Reports, 10(1):1--12, 2020
2020
-
[2]
Rasmy, Y
L. Rasmy, Y. Xiang, Z. Xie, C. Tao, and D. Zhi. Med-BERT : pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. npj Digital Medicine, 4(1):1--13, 2021
2021
- [3]
-
[4]
doi:10.48550/arXiv.2301.03150 , abstract =
E. Steinberg, J. Fries, Y. Xu, and N. H. Shah. MOTOR : A time-to-event foundation model for structured medical records. arXiv preprint arXiv:2301.03150, 2023
-
[5]
Scaling Recurrence-aware Foundation Models for Clinical Records via Next-Visit Prediction
H. Rajamohan, Y. Yin, T. T. Zheng, . Scaling recurrence-aware foundation models for clinical records via next-visit prediction. arXiv preprint arXiv:2603.24562, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [6]
-
[7]
Press, N
O. Press, N. A. Smith, and M. Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022
2022
-
[8]
Rafailov, A
R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems, 2024
2024
-
[9]
S. P. Marso, G. H. Daniels, K. Poulter, . Liraglutide and cardiovascular outcomes in type 2 diabetes. New England Journal of Medicine, 375(4):311--322, 2016
2016
-
[10]
J. A. Russell, K. R. Walley, J. Singer, . Vasopressin versus norepinephrine infusion in patients with septic shock. New England Journal of Medicine, 358(9):877--887, 2008
2008
-
[11]
C. P. Cannon, E. Braunwald, C. H. McCabe, . Intensive versus moderate lipid lowering with statins after acute coronary syndromes. New England Journal of Medicine, 350(15):1495--1504, 2004
2004
-
[12]
P. A. Poole-Wilson, K. Swedberg, J. G. F. Cleland, . Comparison of carvedilol and metoprolol on clinical outcomes in patients with chronic heart failure in the COMET trial. The Lancet, 362(9377):7--13, 2003
2003
-
[13]
J. D. Truwit, G. R. Bernard, J. Steingrub, . Rosuvastatin for sepsis-associated acute respiratory distress syndrome. New England Journal of Medicine, 370(23):2191--2200, 2014
2014
-
[14]
A. E. W. Johnson, L. Bulgarelli, L. Shen, A. Gayles, A. Shammout, S. Horng, T. J. Pollard, B. Moody, B. Gow, L. Lehman, L. A. Celi, and R. G. Mark. MIMIC-IV , a freely accessible electronic health record dataset. Scientific Data, 10(1):1--9, 2023
2023
-
[15]
T. J. Pollard, A. E. W. Johnson, J. D. Raffa, L. A. Celi, R. G. Mark, and O. Badawi. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Scientific Data, 5(1):1--13, 2018
2018
-
[16]
A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. PhysioBank , PhysioToolkit , and PhysioNet : Components of a new research resource for complex physiologic signals. Circulation, 101(23):e215--e220, 2000
2000
-
[17]
18.Kaplan, J.et al.Scaling laws for neural language models.arXiv preprint arXiv:2001.08361(2020)
Z. Kraljevic, A. Bean, A. Shek, R. Bendayan, J. Teo, and R. Dobson. Foresight---generative pretrained transformer (GPT) for modelling of patient timelines using electronic health records. arXiv preprint arXiv:2212.08072, 2024
-
[18]
C. Sainsbury and A. Karwath. ASCENDgpt : a phenotype-aware transformer for cardiovascular risk prediction. arXiv preprint arXiv:2509.04485, 2025
-
[19]
M. A. Hern \'a n and J. M. Robins. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology, 183(8):758--764, 2016
2016
-
[20]
J. M. Robins, M. A. Hern \'a n, and B. Brumback. Marginal structural models and causal inference in epidemiology. Epidemiology, 11(5):550--560, 2000
2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.