TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs

Changhun Kim; Eunho Yang; Gyouk Chu; Hangyul Yoon; Hyeongwon Jang; Joonhyung Park

arxiv: 2606.09030 · v1 · pith:RTSE373Knew · submitted 2026-06-08 · 💻 cs.LG · cs.AI· cs.CL

TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs

Hyeongwon Jang , Gyouk Chu , Changhun Kim , Joonhyung Park , Hangyul Yoon , Eunho Yang This is my paper

Pith reviewed 2026-06-27 17:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords irregularly sampled medical time serieslarge language modelsrisk predictiondialectical reasoningexplainable AIclinical triagecalibrationearly warning systems

0 comments

The pith

Dialectical reasoning allows LLMs to produce continuous calibrated risk scores with explicit clinical rationales for medical time series.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that LLMs tend to polarize medical risk into overconfident binary predictions when applied to irregularly sampled time series from electronic health records. TRIAGE counters this by training the model to generate dialectical reasoning that weighs competing clinical outcomes through separate rationales. This setup lets one model output continuous risk scores backed by interpretable reasoning that clinicians can check. The result is better calibration and prediction performance on three benchmarks along with higher quality rationales. Clinicians gain a tool that supports triage decisions without forcing a hard yes-no call upfront.

Core claim

TRIAGE trains an LLM to generate dialectical reasoning over competing clinical outcomes by eliciting outcome-specific rationales. This dialectical formulation mitigates risk polarization, enabling a single LLM to yield continuous risk scores grounded in explicit clinical reasoning. Evaluated on three ISMTS benchmarks, it achieves an average AUPRC improvement of 3.3% and reduces calibration error by 81% compared to baselines, with rationales surpassing post-hoc explanations by 20% in clinical reasoning quality.

What carries the argument

The dialectical formulation that elicits outcome-specific rationales to prevent risk polarization.

If this is right

A single LLM can generate both continuous risk scores and explicit reasoning without separate components.
Average AUPRC improves by 3.3% on three ISMTS benchmarks.
Calibration error drops by 81% compared to competitive baselines.
Rationales receive 20% higher ratings in clinical reasoning quality by LLM-as-a-judge assessment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hospitals could use this to reduce alarm fatigue from binary alerts by providing graded scores with reasoning.
The dialectical structure might apply to other LLM tasks where overconfident classification occurs, such as legal or financial forecasting.
Real-time deployment in EHR systems would test whether continuous scores improve actual clinician decision-making over binary outputs.
Extending to more than two competing outcomes could handle complex multi-class medical predictions.

Load-bearing premise

The LLM-as-a-judge evaluation of rationale quality accurately reflects clinical reasoning quality and the dialectical formulation does not introduce new forms of bias or polarization.

What would settle it

A study where practicing clinicians rate the generated rationales as no more useful for triage decisions than post-hoc explanations from standard models would falsify the claim of improved reasoning quality.

Figures

Figures reproduced from arXiv: 2606.09030 by Changhun Kim, Eunho Yang, Gyouk Chu, Hangyul Yoon, Hyeongwon Jang, Joonhyung Park.

**Figure 2.** Figure 2: Performance comparison under the leave-variables-out setting on P12 and MIMIC-III. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of discrimination and calibration [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: AUPRC of the outcome-based and verbalized [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Performance under low-resource training. We [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Example prompt format used by TRIAGE. Only some of the features are shown. The input is organized [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Two reasoning traces sampled from gpt-oss-120b on the same P12 patient. The mean implicit mortality [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Default prompt used in the preliminary study for in-hospital mortality prediction. [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt used for verbalized probability extraction. Only the [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Final prompt used for zero-shot LLM evaluation of in-hospital mortality. Only the [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt used for LLM-as-judge verdict-closure detection. [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt used to elicit outcome-conditioned rationales from a strong LLM on P12 and MIMIC-III. The [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt used to elicit outcome-conditioned rationales from a strong LLM on P19 (early sepsis prediction). [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

**Figure 14.** Figure 14: Main SFT prompt used for TRIAGE on P12 and MIMIC-III ( [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: Main SFT prompt used for TRIAGE on P12 and MIMIC-III (p [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

**Figure 16.** Figure 16: Main SFT prompt used for TRIAGE on P19 ( [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗

**Figure 17.** Figure 17: Main SFT prompt used for TRIAGE on P19 (p [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗

**Figure 18.** Figure 18: Prompt template used for the One-sided rationale SFT baseline in our reasoning-structure ablation ( [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗

**Figure 19.** Figure 19: Prompt for the interpretation of the IG result of STraTS. [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗

**Figure 20.** Figure 20: Prompt used for the quantitative evaluation of generated reasoning trace (1/4). We adopt IDEA assessment [PITH_FULL_IMAGE:figures/full_fig_p034_20.png] view at source ↗

**Figure 21.** Figure 21: Prompt used for the quantitative evaluation of generated reasoning trace (2/4). We adopt IDEA assessment [PITH_FULL_IMAGE:figures/full_fig_p035_21.png] view at source ↗

**Figure 22.** Figure 22: Prompt used for the quantitative evaluation of generated reasoning trace (3/4). We adopt IDEA assessment [PITH_FULL_IMAGE:figures/full_fig_p036_22.png] view at source ↗

**Figure 23.** Figure 23: Prompt used for the quantitative evaluation of generated reasoning trace (4/4). We adopt IDEA assessment [PITH_FULL_IMAGE:figures/full_fig_p037_23.png] view at source ↗

**Figure 24.** Figure 24: Prompt used to detect severe hallucination in reasoning traces. The judge model decides whether the [PITH_FULL_IMAGE:figures/full_fig_p038_24.png] view at source ↗

read the original abstract

Clinical early warning systems built on electronic health records, in which clinical observations are recorded as irregularly sampled medical time series (ISMTS), must deliver both calibrated risk scores for patient triage and interpretable rationales that clinicians can verify. Large Language Models (LLMs) have been explored for this task, yet they collapse graded clinical risk into overconfident binary predictions. This risk polarization undermines both calibration and cross-patient comparability. To address this, we propose TRIAGE, a framework that trains an LLM to generate dialectical reasoning over competing clinical outcomes by eliciting outcome-specific rationales. This dialectical formulation mitigates risk polarization, enabling a single LLM to yield continuous risk scores grounded in explicit clinical reasoning. Evaluated on three ISMTS benchmarks, TRIAGE achieves an average AUPRC improvement of 3.3% and reduces calibration error by 81% compared to the competitive baselines. An LLM-as-a-judge assessment further shows that our rationales surpass post-hoc explanations from the baseline by 20% in clinical reasoning quality. The source code is available at https://github.com/HyeongWon-Jang/TRIAGE .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRIAGE uses dialectical reasoning to push LLMs toward continuous calibrated scores on medical time series, but the LLM-as-judge metric is the weakest link in the evidence.

read the letter

The main thing to know is that TRIAGE trains an LLM to produce outcome-specific rationales in a dialectical setup so it can output continuous risk scores instead of snapping to overconfident binary calls on irregularly sampled medical time series. The authors report a 3.3% average AUPRC gain and an 81% drop in calibration error across three benchmarks, plus a 20% lift in rationale quality by their LLM judge.

What the paper actually does is identify the polarization problem clearly and offer a concrete prompting-plus-training recipe to address it. Releasing the code is useful for anyone who wants to try the same framing. The focus on calibration and cross-patient comparability is the right target for clinical early-warning systems.

The soft spots are in the evaluation. Everything rests on the LLM-as-a-judge score for reasoning quality, and nothing in the abstract shows that this judge was checked against clinicians or kept independent of the model family. If the judge shares the same biases or simply rewards the dialectical format, the claimed grounding in clinical reasoning is not supported. The reported numbers also come without visible error bars, significance tests, or dataset details, so it is difficult to judge whether the gains are stable or sensitive to the particular splits and baselines.

This is for researchers already working on LLM applications to electronic health records and irregular time series. A reader in that niche can extract the dialectical idea and test it themselves. It is not a broad methodological shift.

I would send it to peer review. The core problem is real and the proposed fix is specific enough that referees can check the implementation and the judge metric directly.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes TRIAGE, a framework that trains an LLM to generate dialectical reasoning over competing clinical outcomes by eliciting outcome-specific rationales for risk prediction on irregularly sampled medical time series (ISMTS). This is claimed to mitigate risk polarization, enabling a single LLM to produce continuous risk scores grounded in explicit clinical reasoning. On three ISMTS benchmarks, TRIAGE reports an average AUPRC improvement of 3.3% and an 81% reduction in calibration error versus competitive baselines, plus a 20% gain in clinical reasoning quality via LLM-as-a-judge evaluation. Source code is released at https://github.com/HyeongWon-Jang/TRIAGE.

Significance. If the empirical gains and rationale quality claims hold under rigorous validation, the work could meaningfully advance calibrated and interpretable LLM-based early warning systems for clinical time series data. The public code release is a clear strength that aids reproducibility.

major comments (1)

[Evaluation section] The central claim that dialectical rationales ground continuous scores in 'explicit clinical reasoning' rests on the LLM-as-a-judge metric showing 20% quality improvement. The manuscript provides no details on judge-model independence, training-data overlap, or correlation with human clinician ratings (Evaluation section and abstract). This is load-bearing because the skeptic concern about circularity directly undermines the grounding argument even if AUPRC and calibration numbers improve.

minor comments (1)

[Abstract] The abstract states results on 'three ISMTS benchmarks' without naming the datasets; this should be stated explicitly in the abstract for immediate clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the evaluation methodology. We will revise the manuscript to provide the requested transparency on the LLM-as-a-judge setup, thereby strengthening the support for our claims regarding explicit clinical reasoning.

read point-by-point responses

Referee: [Evaluation section] The central claim that dialectical rationales ground continuous scores in 'explicit clinical reasoning' rests on the LLM-as-a-judge metric showing 20% quality improvement. The manuscript provides no details on judge-model independence, training-data overlap, or correlation with human clinician ratings (Evaluation section and abstract). This is load-bearing because the skeptic concern about circularity directly undermines the grounding argument even if AUPRC and calibration numbers improve.

Authors: We agree that the Evaluation section and abstract lack sufficient detail on the LLM-as-a-judge protocol, which leaves the grounding claim vulnerable to circularity concerns. In the revised manuscript we will expand the Evaluation section to specify: (i) the judge model is a distinct, held-out instance from a different model family than those used for TRIAGE training or inference; (ii) the judge was not exposed to any of the three ISMTS benchmark datasets during its own training or fine-tuning; and (iii) we will explicitly state that no direct correlation study with human clinicians was performed and will list this as a limitation. These additions will allow readers to assess the independence of the 20 % quality improvement while preserving the primary AUPRC and calibration results, which do not rely on the judge metric. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on independent benchmarks

full rationale

The abstract and provided text describe a training procedure for dialectical rationales followed by separate reporting of AUPRC, calibration error, and an LLM-as-a-judge score. No equations, parameter fits, or derivations are shown that reduce the claimed mitigation of risk polarization to the inputs by construction. The evaluation metrics are external to the method itself and do not rely on self-citation chains or renaming of known results. The LLM-as-a-judge is presented as an additional assessment rather than a load-bearing premise that loops back to the training data or model outputs in a definitional way.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5753 in / 1123 out tokens · 27356 ms · 2026-06-27T17:26:53.939501+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 6 linked inside Pith

[1]

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer

Springer. Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks.Advances in neural information processing systems, 28. Danny Castro, Sachin Patil, Muhammad Zubair, and Michael Keenaghan. 2024. Arterial blood gas.Stat- Pearls. Zhengping Che, Sanjay Purushotham, Kyunghyu...

Pith/arXiv arXiv 2015
[2]

InForty-second International Conference on Machine Learning

TIMING: Temporality-aware integrated gra- dients for time series explanation. InForty-second International Conference on Machine Learning. Pengcheng Jiang, Cao Danica Xiao, Minhao Jiang, Par- minder Bhatia, Taha Kass-Hout, Jimeng Sun, and Ji- awei Han. 2025. Reasoning-enhanced healthcare pre- dictions with knowledge graph community retrieval. InInternatio...

Pith/arXiv arXiv 2025
[3]

InForty-second Interna- tional Conference on Machine Learning

Hi-patch: Hierarchical patch gnn for irregular multivariate time series. InForty-second Interna- tional Conference on Machine Learning. Victoria Mank, Waqas Azhar, and Kevin Brown. 2026. Leukocytosis. InStatPearls [Internet]. StatPearls Publishing. Sunil Munakomi, Konstantinos Margetis, and Lind- say M Iverson. 2026. Glasgow coma scale. InStat- Pearls [In...

2026
[4]

InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10392–10407

Carer-clinical reasoning-enhanced representa- tion for temporal health risk prediction. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10392–10407. OpenAI. 2025. Gpt-5.1 instant and gpt-5.1 thinking system card addendum. System card, OpenAI. Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Z...

Pith/arXiv arXiv 2024
[5]

Yulia Rubanova, Ricky TQ Chen, and David K Duve- naud

Early prediction of sepsis from clinical data: the physionet/computing in cardiology challenge 2019.Critical care medicine, 48(2):210–217. Yulia Rubanova, Ricky TQ Chen, and David K Duve- naud. 2019. Latent ordinary differential equations for irregularly-sampled time series.Advances in neural information processing systems, 32. Takaya Saito and Marc Rehms...

Pith/arXiv arXiv 2019
[6]

Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, and 1 others

PMLR. Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, and 1 others. 2025. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534. Sindhu Tipirneni and Chandan K Reddy. 2022. Self- supervised transformer for sparse and irregularly sam- pled multivariate clinical time-...

Pith/arXiv arXiv 2025
[7]

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, and 1 others

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, and 1 others. 2026. Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems, 38:113222–113244. Zhuoning Yuan, Yan Yan, M...

Pith/arXiv arXiv 2026
[8]

## Final Decision

using gpt-oss-120b (Agarwal et al., 2025), a sparse mixture-of-experts Transformer with 117B total parameters and 5.1B active parameters per token. We load the official release weights in na- tive MXFP4 and sample at temperature 0.7 with reasoning_effort=high. We provide the full prompt in Appendix G. Table 7: In-hospital mortality prediction on MIMIC-III...

2025
[9]

Specifically, for P12 and MIMIC- 13https://github.com/easonLuo2001/KEDGN 14https://github.com/easonLuo2001/Hi-Patch Table 10: SFT Experimental setup details

and oversample the minority class during data collection. Specifically, for P12 and MIMIC- 13https://github.com/easonLuo2001/KEDGN 14https://github.com/easonLuo2001/Hi-Patch Table 10: SFT Experimental setup details. Parameter Value batch_size 64 (P12) 128 (P19/MIMIC-III) max_length 12,288 (P12/MIMIC-III) 11,264 (P19) num_train_epochs 3 optimizer AdamW lea...

2022
[10]

in addition to our 4B default, and further to Llama 3.2 3B (Grattafiori et al., 2024) across archi- tecture families. As reported in Table 14, TRIAGE 17 retains a consistent advantage over the correspond- ing baselines on every backbone, suggesting that our reasoning supervision generalizes beyond a sin- gle model. E.3 Inference Direction Analysis Since T...

2024
[11]

If the patient indeed survives, which features might be the cause?
[12]

If the patient indeed experiences in-hospital death, which features might be the cause?
[13]

Make a final decision: ‘0’ for survival, ‘1’ for in-hospital death
[14]

0” or “1

Patient Features Textualized Static Features: A patient is 54 years old, female, stayed in surgical ICU. Textualized Temporal Observations: HR: (0.2h, 73), (0.7h, 77), (1.7h, 60), … GCS: (0.2h, 15), (3.7h, 15), … BUN: (10.7h, 13), (33.2h, 8), … Figure 6: Example prompt format used by TRIAGE. Only some of the features are shown. The input is organized into...
[17]

Make a final decision: ‘0’ for survival, ‘1’ for in-hospital death. Your answer format must be as follows: ``` ## Rationale for survival [possible justification if patient survives] ## Rationale for in-hospital death [possible justification if patient experiences in-hospital death] ## Final Decision [0 (survival) or 1 (in-hospital death); respond by singl...
[18]

If the patient indeed experiences in-hospital death, which of the patient’s given features might be the cause?
[19]

If the patient indeed survives, which of the patient’s given features might be the cause?
[20]

Make a final decision: ‘0’ for survival, ‘1’ for in-hospital death. Your answer format must be as follows: ``` ## Rationale for in-hospital death [possible justification if patient experiences in-hospital death] ## Rationale for survival [possible justification if patient survives] ## Final Decision [0 (survival) or 1 (in-hospital death); respond by singl...
[23]

Make a final decision: ‘0’ for no sepsis onset within the next 6 hours, ‘1’ for sepsis onset within the next 6 hours. Your answer format must be as follows: ``` ## Rationale for no sepsis [possible justification if patient does not experience sepsis onset within the next 6 hours] ## Rationale for sepsis [possible justification if patient experiences sepsi...
[24]

If the patient indeed experiences sepsis onset within the next 6 hours, which of the patient’s given features might be the cause?
[25]

If the patient indeed does not experience sepsis onset within the next 6 hours, which of the patient’s given features might be the cause?
[26]

Make a final decision: ‘0’ for no sepsis onset within the next 6 hours, ‘1’ for sepsis onset within the next 6 hours. Your answer format must be as follows: ``` ## Rationale for sepsis [possible justification if patient experiences sepsis onset within the next 6 hours] ## Rationale for no sepsis [possible justification if patient does not experience sepsi...
[27]

Describe the clinical evidence observed in the patient’s features
[28]

""Will the patient experience in-hospital death during this ICU stay?

Make a final decision: ‘0’ for survival, ‘1’ for in-hospital death. Your answer format must be as follows: ``` ## Rationale [clinical evidence observed in the patient’s features] ## Final Decision [0 (survival) or 1 (in-hospital death); respond by single number only] ``` ## Feature of the patient {PATIENT_FEATURES} Figure 18: Prompt template used for theO...
[29]

Key baseline or contextual mortality-risk factors: - Examples: age, major comorbidities, chronic organ disease, frailty-relevant context, admission context, or other baseline risks
[30]

Main acute ICU problem or organ dysfunction: - Examples: respiratory failure, shock or hemodynamic instability, renal dysfunction, metabolic acidosis, neuro- logic depression, hepatic or coagulation dysfunction, infection signal, multi-organ dysfunction, or other acute problems explicitly supported by the data
[31]

- Gives special attention to persistent instability or late deterioration near the end of the observation window

Illness time course or trajectory: - Correctly distinguishes worsening, improving, fluctuating, or stable course. - Gives special attention to persistent instability or late deterioration near the end of the observation window
[32]

differential

Clinically meaningful prognostic abstractions: - Examples: persistent hypotension, escalating oxygen needs, worsening renal function, severe acidosis, multi- organ dysfunction, improving hemodynamics, stable oxygenation, transient isolated abnormality. - These abstractions must be grounded in the provided data. Scoring anchors: - 4: Accurate, patient-spec...
[33]

What objective patient-specific evidence the reasoning used correctly
[34]

Any unsupported claims, hallucinations, contradictions, or missing-data assumptions
[35]

Whether the reasoning correctly handled the temporal trajectory
[36]

High", "Medium

Whether the reasoning considered counterevidence or uncertainty. Do not output a long chain-of-thought. Output only the JSON object below. # Output Format Return valid JSON only. Do not include markdown, comments, or extra text. The score and all subscores must be integers. The final score must equal the sum of the four subscores after applying any cap or...
[37]

Observed patient information from the first 48 hours of ICU stay
[38]

PaO2 around 98–125 mmHg indicating adequate oxygenation

A rationale explaining why the patient is predicted to die or survive. Your task is to judge whether the rationale is grounded in the observed patient information, or whether it contains serious hallucination: concrete clinical facts that are completely absent from the patient data. Return True if the rationale’s substantive clinical evidence is present i...

[1] [1]

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer

Springer. Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks.Advances in neural information processing systems, 28. Danny Castro, Sachin Patil, Muhammad Zubair, and Michael Keenaghan. 2024. Arterial blood gas.Stat- Pearls. Zhengping Che, Sanjay Purushotham, Kyunghyu...

Pith/arXiv arXiv 2015

[2] [2]

InForty-second International Conference on Machine Learning

TIMING: Temporality-aware integrated gra- dients for time series explanation. InForty-second International Conference on Machine Learning. Pengcheng Jiang, Cao Danica Xiao, Minhao Jiang, Par- minder Bhatia, Taha Kass-Hout, Jimeng Sun, and Ji- awei Han. 2025. Reasoning-enhanced healthcare pre- dictions with knowledge graph community retrieval. InInternatio...

Pith/arXiv arXiv 2025

[3] [3]

InForty-second Interna- tional Conference on Machine Learning

Hi-patch: Hierarchical patch gnn for irregular multivariate time series. InForty-second Interna- tional Conference on Machine Learning. Victoria Mank, Waqas Azhar, and Kevin Brown. 2026. Leukocytosis. InStatPearls [Internet]. StatPearls Publishing. Sunil Munakomi, Konstantinos Margetis, and Lind- say M Iverson. 2026. Glasgow coma scale. InStat- Pearls [In...

2026

[4] [4]

InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10392–10407

Carer-clinical reasoning-enhanced representa- tion for temporal health risk prediction. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10392–10407. OpenAI. 2025. Gpt-5.1 instant and gpt-5.1 thinking system card addendum. System card, OpenAI. Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Z...

Pith/arXiv arXiv 2024

[5] [5]

Yulia Rubanova, Ricky TQ Chen, and David K Duve- naud

Early prediction of sepsis from clinical data: the physionet/computing in cardiology challenge 2019.Critical care medicine, 48(2):210–217. Yulia Rubanova, Ricky TQ Chen, and David K Duve- naud. 2019. Latent ordinary differential equations for irregularly-sampled time series.Advances in neural information processing systems, 32. Takaya Saito and Marc Rehms...

Pith/arXiv arXiv 2019

[6] [6]

Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, and 1 others

PMLR. Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, and 1 others. 2025. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534. Sindhu Tipirneni and Chandan K Reddy. 2022. Self- supervised transformer for sparse and irregularly sam- pled multivariate clinical time-...

Pith/arXiv arXiv 2025

[7] [7]

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, and 1 others

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, and 1 others. 2026. Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems, 38:113222–113244. Zhuoning Yuan, Yan Yan, M...

Pith/arXiv arXiv 2026

[8] [8]

## Final Decision

using gpt-oss-120b (Agarwal et al., 2025), a sparse mixture-of-experts Transformer with 117B total parameters and 5.1B active parameters per token. We load the official release weights in na- tive MXFP4 and sample at temperature 0.7 with reasoning_effort=high. We provide the full prompt in Appendix G. Table 7: In-hospital mortality prediction on MIMIC-III...

2025

[9] [9]

Specifically, for P12 and MIMIC- 13https://github.com/easonLuo2001/KEDGN 14https://github.com/easonLuo2001/Hi-Patch Table 10: SFT Experimental setup details

and oversample the minority class during data collection. Specifically, for P12 and MIMIC- 13https://github.com/easonLuo2001/KEDGN 14https://github.com/easonLuo2001/Hi-Patch Table 10: SFT Experimental setup details. Parameter Value batch_size 64 (P12) 128 (P19/MIMIC-III) max_length 12,288 (P12/MIMIC-III) 11,264 (P19) num_train_epochs 3 optimizer AdamW lea...

2022

[10] [10]

in addition to our 4B default, and further to Llama 3.2 3B (Grattafiori et al., 2024) across archi- tecture families. As reported in Table 14, TRIAGE 17 retains a consistent advantage over the correspond- ing baselines on every backbone, suggesting that our reasoning supervision generalizes beyond a sin- gle model. E.3 Inference Direction Analysis Since T...

2024

[11] [11]

If the patient indeed survives, which features might be the cause?

[12] [12]

If the patient indeed experiences in-hospital death, which features might be the cause?

[13] [13]

Make a final decision: ‘0’ for survival, ‘1’ for in-hospital death

[14] [14]

0” or “1

Patient Features Textualized Static Features: A patient is 54 years old, female, stayed in surgical ICU. Textualized Temporal Observations: HR: (0.2h, 73), (0.7h, 77), (1.7h, 60), … GCS: (0.2h, 15), (3.7h, 15), … BUN: (10.7h, 13), (33.2h, 8), … Figure 6: Example prompt format used by TRIAGE. Only some of the features are shown. The input is organized into...

[15] [17]

Make a final decision: ‘0’ for survival, ‘1’ for in-hospital death. Your answer format must be as follows: ``` ## Rationale for survival [possible justification if patient survives] ## Rationale for in-hospital death [possible justification if patient experiences in-hospital death] ## Final Decision [0 (survival) or 1 (in-hospital death); respond by singl...

[16] [18]

If the patient indeed experiences in-hospital death, which of the patient’s given features might be the cause?

[17] [19]

If the patient indeed survives, which of the patient’s given features might be the cause?

[18] [20]

Make a final decision: ‘0’ for survival, ‘1’ for in-hospital death. Your answer format must be as follows: ``` ## Rationale for in-hospital death [possible justification if patient experiences in-hospital death] ## Rationale for survival [possible justification if patient survives] ## Final Decision [0 (survival) or 1 (in-hospital death); respond by singl...

[19] [23]

Make a final decision: ‘0’ for no sepsis onset within the next 6 hours, ‘1’ for sepsis onset within the next 6 hours. Your answer format must be as follows: ``` ## Rationale for no sepsis [possible justification if patient does not experience sepsis onset within the next 6 hours] ## Rationale for sepsis [possible justification if patient experiences sepsi...

[20] [24]

If the patient indeed experiences sepsis onset within the next 6 hours, which of the patient’s given features might be the cause?

[21] [25]

If the patient indeed does not experience sepsis onset within the next 6 hours, which of the patient’s given features might be the cause?

[22] [26]

Make a final decision: ‘0’ for no sepsis onset within the next 6 hours, ‘1’ for sepsis onset within the next 6 hours. Your answer format must be as follows: ``` ## Rationale for sepsis [possible justification if patient experiences sepsis onset within the next 6 hours] ## Rationale for no sepsis [possible justification if patient does not experience sepsi...

[23] [27]

Describe the clinical evidence observed in the patient’s features

[24] [28]

""Will the patient experience in-hospital death during this ICU stay?

Make a final decision: ‘0’ for survival, ‘1’ for in-hospital death. Your answer format must be as follows: ``` ## Rationale [clinical evidence observed in the patient’s features] ## Final Decision [0 (survival) or 1 (in-hospital death); respond by single number only] ``` ## Feature of the patient {PATIENT_FEATURES} Figure 18: Prompt template used for theO...

[25] [29]

Key baseline or contextual mortality-risk factors: - Examples: age, major comorbidities, chronic organ disease, frailty-relevant context, admission context, or other baseline risks

[26] [30]

Main acute ICU problem or organ dysfunction: - Examples: respiratory failure, shock or hemodynamic instability, renal dysfunction, metabolic acidosis, neuro- logic depression, hepatic or coagulation dysfunction, infection signal, multi-organ dysfunction, or other acute problems explicitly supported by the data

[27] [31]

- Gives special attention to persistent instability or late deterioration near the end of the observation window

Illness time course or trajectory: - Correctly distinguishes worsening, improving, fluctuating, or stable course. - Gives special attention to persistent instability or late deterioration near the end of the observation window

[28] [32]

differential

Clinically meaningful prognostic abstractions: - Examples: persistent hypotension, escalating oxygen needs, worsening renal function, severe acidosis, multi- organ dysfunction, improving hemodynamics, stable oxygenation, transient isolated abnormality. - These abstractions must be grounded in the provided data. Scoring anchors: - 4: Accurate, patient-spec...

[29] [33]

What objective patient-specific evidence the reasoning used correctly

[30] [34]

Any unsupported claims, hallucinations, contradictions, or missing-data assumptions

[31] [35]

Whether the reasoning correctly handled the temporal trajectory

[32] [36]

High", "Medium

Whether the reasoning considered counterevidence or uncertainty. Do not output a long chain-of-thought. Output only the JSON object below. # Output Format Return valid JSON only. Do not include markdown, comments, or extra text. The score and all subscores must be integers. The final score must equal the sum of the four subscores after applying any cap or...

[33] [37]

Observed patient information from the first 48 hours of ICU stay

[34] [38]

PaO2 around 98–125 mmHg indicating adequate oxygenation

A rationale explaining why the patient is predicted to die or survive. Your task is to judge whether the rationale is grounded in the observed patient information, or whether it contains serious hallucination: concrete clinical facts that are completely absent from the patient data. Return True if the rationale’s substantive clinical evidence is present i...