arxiv: 2604.10386 · v1 · submitted 2026-04-12 · 💻 cs.AI · cs.MA

Recognition: unknown

TrajOnco: a multi-agent framework for temporal reasoning over longitudinal EHR for multi-cancer early detection

Sihang Zeng , Young Won Kim , Wilson Lau , Ehsan Alipour , Ruth Etzioni , Meliha Yetisgen , Anand Oka

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:38 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords multi-agent LLMtemporal reasoningelectronic health recordscancer risk predictionlongitudinal datazero-shot learningmulti-cancer detectioninterpretable AI

0 comments

The pith

A multi-agent LLM framework reasons over patient health timelines to predict cancer risk one year ahead without any training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TrajOnco as a training-free system that splits the analysis of long patient records into coordinated steps handled by separate language-model agents sharing a memory store. These agents extract sequential clinical events, build summaries, link evidence to conclusions, and output risk scores for fifteen different cancers. A sympathetic reader would care because the method works directly on routine electronic records, matches the accuracy of some trained models on lung cancer, and supplies human-readable explanations that could support earlier intervention. The results also show the same framework remains effective when using smaller language models.

Core claim

TrajOnco applies a chain-of-agents architecture with long-term memory to perform temporal reasoning over sequential clinical events stored in longitudinal EHRs. It produces patient-level summaries, evidence-linked rationales, and predicted risk scores for diagnosis within one year across fifteen cancer types. On matched case-control cohorts from de-identified Truveta data the system reaches AUROCs of 0.64-0.80 in zero-shot use, performs comparably to supervised machine learning on a lung-cancer benchmark, and exhibits stronger temporal reasoning than single-agent LLMs. The multi-agent design also enables effective operation with smaller-capacity models, while aggregated outputs reveal risk-f

What carries the argument

The chain-of-agents architecture with long-term memory that coordinates specialized agents to process, recall, and reason over ordered clinical events for risk prediction.

If this is right

Enables zero-shot risk prediction across fifteen cancers using only existing longitudinal records.
Generates interpretable, evidence-linked rationales that clinicians can inspect and that aggregate into population-level patterns consistent with known clinical knowledge.
Maintains performance when smaller-capacity language models replace larger ones, lowering compute requirements.
Outperforms single-agent LLM baselines in capturing the order and timing of clinical events.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same agent-chain structure could be applied to other longitudinal prediction tasks such as progression of cardiovascular disease or diabetes complications.
If the rationales prove consistently faithful, they could be collected to create synthetic training data that improves smaller models for the same task.
Deployment in clinical workflows might allow real-time flagging during routine visits without requiring site-specific model retraining.

Load-bearing premise

The multi-agent chain with memory produces reliable temporal reasoning and risk scores without systematic hallucinations or biases that would invalidate the AUROC results or human validation on the de-identified records.

What would settle it

A fresh cohort of several thousand matched patients where blinded clinicians review the rationales and find frequent contradictions with the source records, or where measured AUROC on that cohort falls below 0.60 for most cancer types.

read the original abstract

Accurate estimation of cancer risk from longitudinal electronic health records (EHRs) could support earlier detection and improved care, but modeling such complex patient trajectories remains challenging. We present TrajOnco, a training-free, multi-agent large language model (LLM) framework designed for scalable multi-cancer early detection. Using a chain-of-agents architecture with long-term memory, TrajOnco performs temporal reasoning over sequential clinical events to generate patient-level summaries, evidence-linked rationales, and predicted risk scores. We evaluated TrajOnco on de-identified Truveta EHR data across 15 cancer types using matched case-control cohorts, predicting risk of cancer diagnosis at 1 year. In zero-shot evaluation, TrajOnco achieved AUROCs of 0.64-0.80, performing comparably to supervised machine learning in a lung cancer benchmark while demonstrating better temporal reasoning than single-agent LLMs. The multi-agent design also enabled effective temporal reasoning with smaller-capacity models such as GPT-4.1-mini. The fidelity of TrajOnco's output was validated through human evaluation. Furthermore, TrajOnco's interpretable reasoning outputs can be aggregated to reveal population-level risk patterns that align with established clinical knowledge. These findings highlight the potential of multi-agent LLMs to execute interpretable temporal reasoning over longitudinal EHRs, advancing both scalable multi-cancer early detection and clinical insight generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TrajOnco shows a workable multi-agent LLM chain with long-term memory for zero-shot multi-cancer risk scoring from EHR timelines, but the AUROC numbers rest on thin validation that leaves the temporal reasoning claim under-supported.

read the letter

TrajOnco is a training-free multi-agent LLM framework that chains agents and keeps long-term memory to reason over longitudinal EHR events, produce evidence-linked summaries, and output risk scores for 15 cancers at one-year horizon. In matched case-control cohorts from Truveta data it reaches AUROCs of 0.64-0.80 in zero-shot mode, matches supervised ML on a lung-cancer benchmark, and beats single-agent LLMs on temporal tasks while also running on smaller models like GPT-4.1-mini. The outputs are interpretable and can be rolled up to surface population patterns that match existing clinical knowledge, and the authors include a human check on output fidelity. That combination of zero-shot scalability, multi-cancer scope, and built-in rationale is the concrete advance here. The architecture itself is a straightforward but non-trivial extension of chain-of-agents ideas to this clinical setting. The soft spots sit in the evaluation. The abstract gives AUROC ranges without cohort sizes, exact matching rules, missing-data handling, confidence intervals, or significance tests. Human validation is mentioned but without reported agreement rates, blinding, or quantitative checks that the rationales actually track the raw timelines rather than LLM priors. Because the risk scores come directly from the same LLM outputs that generate the rationales, any systematic hallucination or recency bias would directly affect the ROC curves, and nothing in the reported results rules that out. This is the part that needs tightening before the performance numbers can be taken at face value. The work is aimed at researchers who build or evaluate LLM systems for clinical time-series and early-detection tasks. It has enough novelty in the design and enough empirical signal to justify sending it out for peer review, even though the current version would likely come back with requests for fuller methods, statistical detail, and stronger hallucination controls.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces TrajOnco, a training-free multi-agent LLM framework using a chain-of-agents architecture with long-term memory to perform temporal reasoning over longitudinal EHR sequences. It generates patient summaries, evidence-linked rationales, and 1-year cancer risk scores for 15 cancer types. On de-identified Truveta matched case-control cohorts, zero-shot evaluation yields AUROCs of 0.64-0.80, comparable to supervised ML on a lung cancer benchmark and superior to single-agent LLMs; human evaluation validates output fidelity, and aggregated rationales align with clinical knowledge.

Significance. If the results hold, the work shows that multi-agent LLMs can deliver interpretable, training-free temporal reasoning over complex EHR trajectories for scalable multi-cancer early detection. Credit is due for the training-free design, effective performance with smaller models (e.g., GPT-4.1-mini), explicit human fidelity validation, and the ability to surface population-level patterns consistent with established clinical knowledge.

major comments (3)

[Evaluation section] Evaluation section: The reported AUROCs of 0.64-0.80 are presented without cohort sizes, exact case-control matching criteria, missing-data handling protocols, or statistical significance/error bars; these omissions are load-bearing because they prevent assessment of whether the performance is robust or comparable to the supervised ML baseline.
[Human validation subsection] Human validation subsection: Fidelity is validated by human review, yet no inter-rater reliability, blinding procedures, or quantitative factual-consistency metrics (e.g., alignment of generated rationales with raw EHR timelines) are reported; this directly affects whether the AUROCs reflect genuine temporal reasoning or LLM-specific artifacts.
[Methods (multi-agent architecture)] Methods (multi-agent architecture): The superiority over single-agent LLMs is attributed to the chain-of-agents plus long-term memory, but no ablation isolating these components is provided; without it, the central claim that the multi-agent design enables reliable risk prediction cannot be isolated from prompt or model effects.

minor comments (3)

[Abstract] Abstract: Consider adding one sentence on cohort scale or confidence intervals to strengthen the headline performance claim.
[Figure 1] Figure 1 (framework diagram): The long-term memory module and inter-agent handoff arrows would benefit from explicit labels to clarify data flow.
[Discussion] Discussion: A brief comparison to prior LLM-EHR temporal reasoning work (e.g., on single-cancer tasks) would better situate the multi-cancer, multi-agent contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point-by-point below, providing the strongest honest defense of the manuscript while committing to revisions that strengthen the work without misrepresentation.

read point-by-point responses

Referee: [Evaluation section] Evaluation section: The reported AUROCs of 0.64-0.80 are presented without cohort sizes, exact case-control matching criteria, missing-data handling protocols, or statistical significance/error bars; these omissions are load-bearing because they prevent assessment of whether the performance is robust or comparable to the supervised ML baseline.

Authors: We agree that these details are essential for readers to assess robustness and comparability. The original manuscript reports results on de-identified Truveta matched case-control cohorts for 15 cancer types but does not include the requested specifics. In the revised manuscript, we will expand the Evaluation section to report exact cohort sizes (cases and controls per cancer type), the matching criteria (age, sex, visit frequency, and other variables), missing-data protocols (inclusion thresholds and any imputation), and statistical details including 95% confidence intervals via bootstrapping plus significance tests against the supervised ML baseline. These additions will directly address the concern. revision: yes
Referee: [Human validation subsection] Human validation subsection: Fidelity is validated by human review, yet no inter-rater reliability, blinding procedures, or quantitative factual-consistency metrics (e.g., alignment of generated rationales with raw EHR timelines) are reported; this directly affects whether the AUROCs reflect genuine temporal reasoning or LLM-specific artifacts.

Authors: We acknowledge that the human validation subsection reports expert review of output fidelity but omits the requested rigor. In the revision, we will add a detailed protocol description: number of reviewers, blinding procedures (reviewers blinded to model identity and source data), inter-rater reliability (e.g., Cohen's or Fleiss' kappa), and quantitative factual-consistency metrics. The latter will include alignment scores between generated rationales and raw EHR timelines, computed as precision/recall over extracted clinical events. This will provide stronger evidence that the AUROCs reflect temporal reasoning rather than artifacts. revision: yes
Referee: [Methods (multi-agent architecture)] Methods (multi-agent architecture): The superiority over single-agent LLMs is attributed to the chain-of-agents plus long-term memory, but no ablation isolating these components is provided; without it, the central claim that the multi-agent design enables reliable risk prediction cannot be isolated from prompt or model effects.

Authors: We agree that a dedicated ablation would more cleanly isolate the contributions of the chain-of-agents architecture and long-term memory. The manuscript already shows superior performance versus single-agent LLM baselines, but does not include a full ablation study. In the revised manuscript, we will add ablation experiments evaluating performance with and without the multi-agent chaining and memory modules (while holding prompts and base models fixed). This will quantify their individual effects on AUROC and rationale quality, strengthening the central claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance measured on external benchmarks

full rationale

The paper describes a training-free multi-agent LLM framework and reports its zero-shot AUROC performance (0.64-0.80) on matched case-control EHR cohorts for 15 cancers, with direct comparisons to supervised ML baselines and single-agent LLMs plus human validation of outputs. No equations, parameter fitting, or derivation chain exists that reduces any claimed result to its own inputs by construction. All load-bearing claims are falsifiable against independent data and external models; no self-citation is invoked to justify uniqueness or force a result. This is the standard honest outcome for an applied empirical framework paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into exact assumptions; the framework implicitly assumes LLMs can execute faithful temporal clinical reasoning when structured as agents.

axioms (1)

domain assumption Multi-agent LLM orchestration with long-term memory enables reliable temporal reasoning over sequential clinical events without task-specific training
This is the core premise enabling the zero-shot performance claims.

pith-pipeline@v0.9.0 · 5579 in / 1281 out tokens · 31598 ms · 2026-05-10T16:38:34.528187+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 34 canonical work pages · 6 internal anchors

[1]

The Lancet Oncology21(1), 6–8 (2020)

Whitaker, K.: Earlier diagnosis: the importance of cancer symptoms. The Lancet Oncology21(1), 6–8 (2020)

2020
[2]

Science375(6586), 9040 (2022) https://doi.org/10.1126/science.aay9040

Crosby, D., Bhatia, S., Brindle, K.M., Coussens, L.M., Dive, C., Emberton, M., Esener, S., Fitzgerald, R.C., Gambhir, S.S., Kuhn, P., Rebbeck, T.R., Bal- asubramanian, S.: Early detection of cancer. Science375(6586), 9040 (2022) https://doi.org/10.1126/science.aay9040

work page doi:10.1126/science.aay9040 2022
[3]

The Lancet Digital Health6(6), 396–406 (2024)

Jung, A.W., Holm, P.C., Gaurav, K., Hjaltelin, J.X., Placido, D., Mortensen, L.H., Birney, E., Gerstung, M.,et al.: Multi-cancer risk stratification based on national health data: a retrospective modelling and validation study. The Lancet Digital Health6(6), 396–406 (2024)

2024
[4]

Biomarker Research13(1), 101 (2025) 20

Zhao, R., Yuan, H., Jiang, Y., Liu, Z., Chen, R., Wang, S., Lu, L., Yuan, Z., Su, Z., He, Q.,et al.: Development and validation of an integrative 54 biomarker-based risk identification model for multi-cancer in 42,666 individu- als: a population-based prospective study to guide advanced screening strategies. Biomarker Research13(1), 101 (2025) 20

2025
[5]

Nature, 1–9 (2025) https://doi.org/10.1038/ s41586-025-09529-3

Shmatko, A., Jung, A.W., Gaurav, K., Brunak, S., Mortensen, L.H., Birney, E., Fitzgerald, T., Gerstung, M.: Learning the natural history of human dis- ease with generative transformers. Nature, 1–9 (2025) https://doi.org/10.1038/ s41586-025-09529-3 . Accessed 2025-09-23

2025
[6]

The British Journal of General Practice68(670), 301–310 (2018) https://doi.org/10.3399/bjgp18X695777

Holtedahl, K., Hjertholm, P., Borgquist, L., Donker, G.A., Buntinx, F., Weller, D., Braaten, T., M˚ ansson, J., Strandberg, E.L., Campbell, C., Korevaar, J.C., Parajuli, R.: Abdominal symptoms and cancer in the abdomen: prospective cohort study in European primary care. The British Journal of General Practice68(670), 301–310 (2018) https://doi.org/10.3399...

work page doi:10.3399/bjgp18x695777 2018
[7]

Cureus16(7), 65441 https://doi.org/10.7759/cureus

Mondoc, L.-M., Catana, A.-C., Prodan, L.-C., Valeanu, M., Mihaila, R.-G.: The Impact of Anemia on the Survival of Patients Diagnosed With Low-Grade Malig- nant B-cell Lymphomas. Cureus16(7), 65441 https://doi.org/10.7759/cureus. 65441 . Accessed 2026-03-10

work page doi:10.7759/cureus 2026
[8]

npj Digital Medicine (2026) https://doi.org/10.1038/s41746-026-02441-8

Li, M., Zhan, Z., Huang, J., Yeung, J., Ding, K., Blaes, A., Johnson, S., Liu, H., Xu, H., Zhang, R.: CancerLLM: a large language model in cancer domain. npj Digital Medicine (2026) https://doi.org/10.1038/s41746-026-02441-8 . Accessed 2026-03-10

work page doi:10.1038/s41746-026-02441-8 2026
[9]

ArXiv, 2511–112932 (2026)

Park, J., Pang, C., Lee, T.Y., Yang, J.Y., Berkowitz, J., Wei, A.Z., Tatonetti, N.: Toward Scalable Early Cancer Detection: Evaluating EHR-Based Predictive Models Against Traditional Screening Criteria. ArXiv, 2511–112932 (2026)

2026
[10]

Lundberg, S., Lee, S.-I.: A Unified Approach to Interpreting Model Predictions. arXiv. arXiv:1705.07874 [cs] (2017). https://doi.org/10.48550/arXiv.1705.07874 . http://arxiv.org/abs/1705.07874 Accessed 2026-03-10

work page Pith review doi:10.48550/arxiv.1705.07874 2017
[11]

npj Digital Medicine8(1), 397 (2025) https://doi.org/10.1038/s41746-025-01780-2

Zhu, M., Lin, H., Jiang, J., Jinia, A.J., Jee, J., Pichotta, K., Waters, M., Rose, D., Schultz, N., Chalise, S., Valleru, L., Morin, O., Moran, J., Deasy, J.O., Pilai, S., Nichols, C., Riely, G., Braunstein, L.Z., Li, A.: Large language model trained on clinical oncology data predicts cancer progression. npj Digital Medicine8(1), 397 (2025) https://doi.or...

work page doi:10.1038/s41746-025-01780-2 2025
[12]

Lau, W., Kim, Y., Parasa, S., Haque, M.E., Oka, A., Nanduri, J.: Pre- dicting Early-Onset Colorectal Cancer with Large Language Models. arXiv. arXiv:2506.11410 [cs] (2025). https://doi.org/10.48550/arXiv.2506.11410 . http: //arxiv.org/abs/2506.11410 Accessed 2025-10-10

work page doi:10.48550/arxiv.2506.11410 2025
[13]

Lost in the Middle: How Language Models Use Long Contexts

Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., Liang, P.: Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics12, 157–173 (2024) https://doi. org/10.1162/tacl a 00638

work page internal anchor Pith review doi:10.1162/tacl 2024
[14]

npj Digital Medicine , publisher =

Li, R., Wang, X., Berlowitz, D., Mez, J., Lin, H., Yu, H.: CARE-AD: a multi- agent large language model framework for Alzheimer’s disease prediction using 21 longitudinal clinical notes. npj Digital Medicine8(1), 541 (2025) https://doi.org/ 10.1038/s41746-025-01940-4 . Accessed 2025-08-26

work page doi:10.1038/s41746-025-01940-4 2025
[15]

Advances in Neural Information Processing Systems37, 132208–132237 (2024)

Zhang, Y., Sun, R., Chen, Y., Pfister, T., Zhang, R., Arik, S.: Chain of agents: Large language models collaborating on long-context tasks. Advances in Neural Information Processing Systems37, 132208–132237 (2024)

2024
[16]

Haug, N.R., Wagner, A.K., McGlynn, K.A., Leonard, C.E., Nguyen, M.D., Major, J.M.: New-onset cancer cases in FDA’s Sentinel System: A large distributed system of US electronic healthcare data. Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Onco...

work page doi:10.1158/1055-9965.epi-21-1451 2022
[17]

Cancer Causes & Control18(5), 561–569 (2007) https://doi.org/10.1007/s10552-007-0131-1

Setoguchi, S., Solomon, D.H., Glynn, R.J., Cook, E.F., Levin, R., Schneeweiss, S.: Agreement of diagnosis and its date for hematologic malignancies and solid tumors between medicare claims and cancer registry data. Cancer Causes & Control18(5), 561–569 (2007) https://doi.org/10.1007/s10552-007-0131-1

work page doi:10.1007/s10552-007-0131-1 2007
[18]

Proceedings of the 22nd

Chen, T., Guestrin, C.: XGBoost: A Scalable Tree Boosting System. In: Proceed- ings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery And Data Mining, pp. 785–794 (2016). https://doi.org/10.1145/2939672.2939785 . arXiv:1603.02754 [cs].http://arxiv.org/abs/1603.02754Accessed 2025-11-18

work page doi:10.1145/2939672.2939785 2016
[19]

BMC Medicine23, 551 (2025) https://doi.org/10

Li, X., Yuan, E.Y., Kuperberg, S.J., Bonzel, C.-L., Jeffway, M.I., Cai, T., Liao, K.P., Aguiar-Ib´ a˜ nez, R., Kao, Y.-H., Santorelli, M.L., Christiani, D.C., Cai, T., Duan, R.: Early detection of non-small cell lung cancer: An electronic health record data-driven approach. BMC Medicine23, 551 (2025) https://doi.org/10. 1186/s12916-025-04289-3

2025
[20]

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging LLM-as-a- Judge with MT-Bench and Chatbot Arena. arXiv. arXiv:2306.05685 [cs] (2023). http://arxiv.org/abs/2306.05685 Accessed 2024-05-22

work page internal anchor Pith review arXiv 2023
[21]

In: Inui, K., Sakti, S., Wang, H., Wong, D.F., Bhattacharyya, P., Banerjee, B., Ekbal, A., Chakraborty, T., Singh, D.P

Shi, L., Ma, C., Liang, W., Diao, X., Ma, W., Vosoughi, S.: Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge. In: Inui, K., Sakti, S., Wang, H., Wong, D.F., Bhattacharyya, P., Banerjee, B., Ekbal, A., Chakraborty, T., Singh, D.P. (eds.) Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4...

work page doi:10.18653/v1/2025.ijcnlp-long 2025
[22]

Chau Minh Pham, Alexander Hoyle, Simeng Sun, Philip Resnik, and Mohit Iyyer

Pham, C.M., Hoyle, A., Sun, S., Resnik, P., Iyyer, M.: TopicGPT: A Prompt- based Topic Modeling Framework. In: Duh, K., Gomez, H., Bethard, S. (eds.) 22 Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol- ume 1: Long Papers), pp. 2956–2984. Association for Com...

work page doi:10.18653/v1/2024.naacl-long.164 2024
[23]

BMC Medical Imaging25, 204 (2025) https://doi.org/10

Li, Y., Lin, C., Cui, L., Huang, C., Shi, L., Huang, S., Yu, Y., Zhou, X., Zhou, Q., Chen, K., Shi, L.: Association between age and lung cancer risk: evidence from lung lobar radiomics. BMC Medical Imaging25, 204 (2025) https://doi.org/10. 1186/s12880-025-01747-5 . Accessed 2026-03-13

2025
[24]

Proceedings of the American Tho- racic Society5(8), 811–815 (2008) https://doi.org/10.1513/pats.200809-100TH

Walser, T., Cui, X., Yanagawa, J., Lee, J.M., Heinrich, E., Lee, G., Sharma, S., Dubinett, S.M.: Smoking and Lung Cancer. Proceedings of the American Tho- racic Society5(8), 811–815 (2008) https://doi.org/10.1513/pats.200809-100TH . Accessed 2026-03-13

work page doi:10.1513/pats.200809-100th 2008
[25]

BMJ Open13(4), 068832 (2023) https://doi.org/10

Prado, M.G., Kessler, L.G., Au, M.A., Burkhardt, H.A., Zigman Suchsland, M., Kowalski, L., Stephens, K.A., Yetisgen, M., Walter, F.M., Neal, R.D., Lybarger, K., Thompson, C.A., Al Achkar, M., Sarma, E.A., Turner, G., Farjah, F., Thomp- son, M.J.: Symptoms and signs of lung cancer prior to diagnosis: case–control study using electronic health records from ...

2023
[26]

Annals of Thoracic Medicine14(4), 226–238 (2019) https://doi.org/10.4103/atm.ATM 110 19

Loverdos, K., Fotiadis, A., Kontogianni, C., Iliopoulou, M., Gaga, M.: Lung nod- ules: A comprehensive review on current approach and management. Annals of Thoracic Medicine14(4), 226–238 (2019) https://doi.org/10.4103/atm.ATM 110 19 . Accessed 2026-03-13

work page doi:10.4103/atm.atm 2019
[27]

Clinical Lung Cancer5(2), 90–97 (2003) https://doi.org/10.3816/CLC.2003.n.022

Pirker, R., Wiesenberger, K., Pohl, G., Minar, W.: Anemia in Lung Cancer: Clinical Impact and Management. Clinical Lung Cancer5(2), 90–97 (2003) https://doi.org/10.3816/CLC.2003.n.022 . Accessed 2026-03-13

work page doi:10.3816/clc.2003.n.022 2003
[28]

McInnes, L., Healy, J., Melville, J.: UMAP: Uniform Manifold Approxima- tion and Projection for Dimension Reduction. arXiv. arXiv:1802.03426 [stat] (2020). https://doi.org/10.48550/arXiv.1802.03426 . http://arxiv.org/abs/1802. 03426 Accessed 2026-03-13

work page internal anchor Pith review doi:10.48550/arxiv.1802.03426 2020
[29]

Journal of Diabetes Research2022, 1747326 (2022) https://doi.org/10.1155/2022/1747326

Yu, G.-H., Li, S.-F., Wei, R., Jiang, Z.: Diabetes and Colorectal Cancer Risk: Clin- ical and Therapeutic Implications. Journal of Diabetes Research2022, 1747326 (2022) https://doi.org/10.1155/2022/1747326 . Accessed 2026-03-13

work page doi:10.1155/2022/1747326 2022
[30]

Medic- ina60(8), 1218 (2024) https://doi.org/10.3390/medicina60081218

Miranda, B.C.J., Tustumi, F., Nakamura, E.T., Shimanoe, V.H., Kikawa, D., Waisberg, J.: Obesity and Colorectal Cancer: A Narrative Review. Medic- ina60(8), 1218 (2024) https://doi.org/10.3390/medicina60081218 . Accessed 2026-03-13 23

work page doi:10.3390/medicina60081218 2024
[31]

Can- cer Diagnosis & Prognosis3(2), 163 (2023) https://doi.org/10.21873/cdp.10196

Chardalias, L., Papaconstantinou, I., Gklavas, A., Politou, M., Theodosopou- los, T.: Iron Deficiency Anemia in Colorectal Cancer Patients: Is Preoperative Intravenous Iron Infusion Indicated? A Narrative Review of the Literature. Can- cer Diagnosis & Prognosis3(2), 163 (2023) https://doi.org/10.21873/cdp.10196 . Accessed 2026-03-13

work page doi:10.21873/cdp.10196 2023
[32]

Nutrients14(8), 1542 (2022) https://doi.org/10.3390/nu14081542

Cencioni, C., Trestini, I., Piro, G., Bria, E., Tortora, G., Carbone, C., Spal- lotta, F.: Gastrointestinal Cancer Patient Nutritional Management: From Specific Needs to Novel Epigenetic Dietary Approaches. Nutrients14(8), 1542 (2022) https://doi.org/10.3390/nu14081542 . Accessed 2026-03-13

work page doi:10.3390/nu14081542 2022
[33]

Disease-a-month: DM35(11), 721–768 (1989) https://doi.org/10.1016/ 0011-5029(89)90011-4

Johnson, R.A., Roodman, G.D.: Hematologic manifestations of malig- nancy. Disease-a-month: DM35(11), 721–768 (1989) https://doi.org/10.1016/ 0011-5029(89)90011-4

1989
[34]

The Israel Medical Association journal: IMAJ9(10), 732–735 (2007)

Vainrib, M., Leibovitch, I.: Urological implications of concurrent bladder and lung cancer. The Israel Medical Association journal: IMAJ9(10), 732–735 (2007)

2007
[35]

In: Agrawal, M., Deshpande, K., Engelhard, M., Joshi, S., Tang, S., Urteaga, I

Zeng, S., Liu, L.J., Wen, J., Yetisgen, M., Etzioni, R., Luo, G.: Trajsurv: Learn- ing continuous latent trajectories from electronic health records for trustworthy survival prediction. In: Agrawal, M., Deshpande, K., Engelhard, M., Joshi, S., Tang, S., Urteaga, I. (eds.) Proceedings of the 10th Machine Learning for Health- care Conference. Proceedings of...

2025
[36]

Zhang, S., Liu, Q., Usuyama, N., Wong, C., Naumann, T., Poon, H.: Explor- ing Scaling Laws for EHR Foundation Models. arXiv. arXiv:2505.22964 [cs] (2025). https://doi.org/10.48550/arXiv.2505.22964 . http://arxiv.org/abs/2505. 22964 Accessed 2025-08-16

work page doi:10.48550/arxiv.2505.22964 2025
[37]

and Was, Jaroslaw and Li, Quanzheng and Bates, David W

Renc, P., Jia, Y., Samir, A.E., Was, J., Li, Q., Bates, D.W., Sitek, A.: Zero shot health trajectory prediction using transformer. npj Digital Medicine7(1), 256 (2024) https://doi.org/10.1038/s41746-024-01235-0 . Accessed 2025-07-22

work page doi:10.1038/s41746-024-01235-0 2024
[38]

npj Digital Medicine8(1), 577 (2025) https://doi.org/10.1038/ s41746-025-01965-9

Cui, H., Unell, A., Chen, B., Fries, J.A., Alsentzer, E., Koyejo, S., Shah, N.H.: TIMER: Temporal instruction modeling and evaluation for longitudinal clin- ical records. npj Digital Medicine8(1), 577 (2025) https://doi.org/10.1038/ s41746-025-01965-9

2025
[39]

Kruse, M., Hu, S., Derby, N., Wu, Y., Stonbraker, S., Yao, B., Wang, D., Goldberg, E., Gao, Y.: Zero-shot Large Language Models for Long Clinical Text Summa- rization with Temporal Reasoning. arXiv. arXiv:2501.18724 [cs] (2025). https:// doi.org/10.48550/arXiv.2501.18724 . http://arxiv.org/abs/2501.18724 Accessed 2025-08-14

work page doi:10.48550/arxiv.2501.18724 2025
[40]

Context clues: Evaluating long context models for clinical prediction tasks on ehrs.arXiv preprint arXiv:2412.16178, 2024

Wornow, M., Bedi, S., Hernandez, M.A.F., Steinberg, E., Fries, J.A., R´ e, C., Koyejo, S., Shah, N.H.: Context clues: Evaluating long context models for clinical 24 prediction tasks on ehrs. arXiv preprint arXiv:2412.16178 (2024)

work page arXiv 2024
[41]

In: Morgado-Diaz, J.A

Duan, B., Zhao, Y., Bai, J., Wang, J., Duan, X., Luo, X., Zhang, R., Pu, Y., Kou, M., Lei, J., Yang, S.: Colorectal Cancer: An Overview. In: Morgado-Diaz, J.A. (ed.) Gastrointestinal Cancers. Exon Publications, Brisbane (AU) (2022). http://www.ncbi.nlm.nih.gov/books/NBK586003/Accessed 2026-03-26

2022
[42]

Xu, R., Yan, Y.: Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward. arXiv. arXiv:2602.12430 [cs] (2026). https://doi.org/10.48550/arXiv.2602.12430 . http://arxiv.org/abs/2602. 12430 Accessed 2026-03-13

work page internal anchor Pith review doi:10.48550/arxiv.2602.12430 2026
[43]

bioRxiv: The Preprint Server for Biology, 2025–0530656746 (2025) https://doi.org/10.1101/ 2025.05.30.656746

Huang, K., Zhang, S., Wang, H., Qu, Y., Lu, Y., Roohani, Y., Li, R., Qiu, L., Li, G., Zhang, J., Yin, D., Marwaha, S., Carter, J.N., Zhou, X., Wheeler, M., Bernstein, J.A., Wang, M., He, P., Zhou, J., Snyder, M., Cong, L., Regev, A., Leskovec, J.: Biomni: A General-Purpose Biomedical AI Agent. bioRxiv: The Preprint Server for Biology, 2025–0530656746 (202...

2025
[44]

An agentic system for rare disease diagnosis with traceable reasoning.Nature, 651:775–784, 2026

Zhao, W., Wu, C., Fan, Y., Qiu, P., Zhang, X., Sun, Y., Zhou, X., Zhang, S., Peng, Y., Wang, Y., Sun, X., Zhang, Y., Yu, Y., Sun, K., Xie, W.: An agentic system for rare disease diagnosis with traceable reasoning. Nature, 1–10 (2026) https://doi.org/10.1038/s41586-025-10097-9 . Accessed 2026-03-13

work page doi:10.1038/s41586-025-10097-9 2026
[45]

npj Digital Medicine6(1), 223 (2023) https://doi.org/10.1038/s41746-023-00967-9

Brentnall, A.R., Atakpa, E.C., Hill, H., Santeramo, R., Damiani, C., Cuzick, J., Montana, G., Duffy, S.W.: An optimization framework to guide the choice of thresholds for risk-based cancer screening. npj Digital Medicine6(1), 223 (2023) https://doi.org/10.1038/s41746-023-00967-9

work page doi:10.1038/s41746-023-00967-9 2023
[46]

Journal of Translational Medicine24, 49 (2025) https: //doi.org/10.1186/s12967-025-07511-1

Tanaka, S., Wilkens, L.R., Marchand, L.L., Yang, G., Yuan, J.-M., Koh, W.- P., Shrubsole, M.J., Luu, M., Pagano, I., Figueiredo, J.C., Furuya, H., Rosser, C.J.: Developing a prediction model in a large case-control study for the early detection of bladder cancer. Journal of Translational Medicine24, 49 (2025) https: //doi.org/10.1186/s12967-025-07511-1

work page doi:10.1186/s12967-025-07511-1 2025
[47]

control bars

Wu, X., Ritter, A., Xu, W.: Tabular Data Understanding with LLMs: A Survey of Recent Advances and Challenges. arXiv (2025). https://doi.org/10.48550/arXiv. 2508.00217

work page internal anchor Pith review doi:10.48550/arxiv 2025
[48]

https://docs.anthropic

Anthropic: Use XML tags to structure your prompts. https://docs.anthropic. com/en/docs/build-with-claude/prompt-engineering/use-xml-tags. Accessed: 2025-08-16 (2025)

2025
[49]

Advances in Neural Infor- mation Processing Systems37, 44102–44163 (2024) https://doi.org/10.52202/ 079017-1400

McDermott, M.B., Zhang, H., Hansen, L.H., Angelotti, G., Gallifant, J.: A Closer Look at AUROC and AUPRC under Class Imbalance. Advances in Neural Infor- mation Processing Systems37, 44102–44163 (2024) https://doi.org/10.52202/ 079017-1400 . Accessed 2026-03-15 25

2024
[50]

Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., Wang, Y., Guo, J.: A Survey on LLM-as-a-Judge. arXiv. arXiv:2411.15594 [cs] version: 1 (2024). https://doi.org/10.48550/arXiv.2411.15594 . http://arxiv. org/abs/2411.15594 Accessed 2026-03-15

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2411.15594 2024
[51]

npj Digital Medicine6(1), 7 (2023) https://doi.org/10.1038/ s41746-023-00753-7

Ehrmann, D.E., Joshi, S., Goodfellow, S.D., Mazwi, M.L., Eytan, D.: Mak- ing machine learning matter to clinicians: model actionability in medical decision-making. npj Digital Medicine6(1), 7 (2023) https://doi.org/10.1038/ s41746-023-00753-7 . Accessed 2026-03-15

2023
[52]

Nature Communications16(1), 9799 (2025) https://doi.org/10.1038/ s41467-025-64769-1

Qiu, P., Wu, C., Liu, S., Fan, Y., Zhao, W., Chen, Z., Gu, H., Peng, C., Zhang, Y., Wang, Y., Xie, W.: Quantifying the reasoning abilities of LLMs on clini- cal cases. Nature Communications16(1), 9799 (2025) https://doi.org/10.1038/ s41467-025-64769-1 . Accessed 2026-03-15

2025
[53]

BMJ Oncology3(1), 000087 (2024) https://doi.org/10

Wu, X., Tu, H., Hu, Q., Tsai, S.P., Ta-Wei Chu, D., Wen, C.-P.: Novel machine learning algorithm in risk prediction model for pan-cancer risk: application in a large prospective cohort. BMJ Oncology3(1), 000087 (2024) https://doi.org/10. 1136/bmjonc-2023-000087 . Accessed 2026-03-17

2024
[54]

Applied Clinical Informatics16(3), 556–568 (2025) https://doi.org/10.1055/a-2544-3117

Kroenke, K., Ruddy, K.J., Pachman, D.R., Grzegorczyk, V., Herrin, J., Rah- man, P.A., Tobin, K.A., Griffin, J.M., Chlan, L.L., Austin, J.D., Ridgeway, J.L., Mitchell, S.A., Marsolo, K.A., Cheville, A.L.: Using Electronic Health Records to Classify Cancer Site and Metastasis. Applied Clinical Informatics16(3), 556–568 (2025) https://doi.org/10.1055/a-2544-...

work page doi:10.1055/a-2544-3117 2025
[55]

Archive Location: nciglobal,ncienterprise (2022)

Cervical Cancer Screening - NCI. Archive Location: nciglobal,ncienterprise (2022). https://www.cancer.gov/types/cervical/screening Accessed 2026-03-17

2022
[56]

Archive Location: nciglobal,ncienterprise (2026)

Lung Cancer Screening - NCI. Archive Location: nciglobal,ncienterprise (2026). https://www.cancer.gov/types/lung/patient/lung-screening-pdq Accessed 2026- 03-17

2026
[57]

Archive Location: nciglobal,ncienterprise (2026)

Prostate Cancer Screening - NCI. Archive Location: nciglobal,ncienterprise (2026). https://www.cancer.gov/types/prostate/patient/prostate-screening-pdq Accessed 2026-03-17

2026
[58]

Archive Location: nciglobal,ncienterprise (2025)

Screening for Breast Cancer - NCI. Archive Location: nciglobal,ncienterprise (2025). https://www.cancer.gov/types/breast/screening Accessed 2026-03-17

2025
[59]

max pooling

Colorectal Cancer Screening - NCI. Archive Location: nciglobal,ncienterprise (2026). https://www.cancer.gov/types/colorectal/ patient/colorectal-screening-pdq Accessed 2026-03-17 26 Appendix A Data statistics Table A1 summarizes baseline characteristics for the 15 cancer-specific case-control cohorts, with 500 cases and 500 matched controls per cancer typ...

2026
[60]

Clinical Correctness and Plausibility: Assess whether the model’s risk assessment is clinically plausible and consistent with expert knowledge and the ground truth diagnosis
[61]

Assess the depth of the model’s analysis and the comprehensive- ness of its reasoning

Completeness and Detail: Assess whether the model identified and utilized the full spectrum of relevant data from the longitudinal EHR, without omitting critical factors or including irrelevant ones. Assess the depth of the model’s analysis and the comprehensive- ness of its reasoning
[62]

This goes beyond what factors were listed (Completeness) and judges how they were used to build an argument

Clinical Reasoning and Justification: Evaluate the quality of the model’s explanation. This goes beyond what factors were listed (Completeness) and judges how they were used to build an argument
[63]

Longitudinal and Temporal Reasoning: Specifically assess the model’s ability to interpret changes over time in the longitudinal EHR data and connect them to the future risk timeframe ({years}years)
[64]

evaluationsummary

Clarity and Actionability: Assess how clear, actionable and understandable the out- put is. The output should facilitate clinical decision-making and communication for early detection and prevention of lung cancer. Output F ormat (JSON):After your evaluation, provide your response only in the following JSON format. Do not include any text before or after ...