Recognition: unknown
TrajOnco: a multi-agent framework for temporal reasoning over longitudinal EHR for multi-cancer early detection
Pith reviewed 2026-05-10 16:38 UTC · model grok-4.3
The pith
A multi-agent LLM framework reasons over patient health timelines to predict cancer risk one year ahead without any training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TrajOnco applies a chain-of-agents architecture with long-term memory to perform temporal reasoning over sequential clinical events stored in longitudinal EHRs. It produces patient-level summaries, evidence-linked rationales, and predicted risk scores for diagnosis within one year across fifteen cancer types. On matched case-control cohorts from de-identified Truveta data the system reaches AUROCs of 0.64-0.80 in zero-shot use, performs comparably to supervised machine learning on a lung-cancer benchmark, and exhibits stronger temporal reasoning than single-agent LLMs. The multi-agent design also enables effective operation with smaller-capacity models, while aggregated outputs reveal risk-f
What carries the argument
The chain-of-agents architecture with long-term memory that coordinates specialized agents to process, recall, and reason over ordered clinical events for risk prediction.
If this is right
- Enables zero-shot risk prediction across fifteen cancers using only existing longitudinal records.
- Generates interpretable, evidence-linked rationales that clinicians can inspect and that aggregate into population-level patterns consistent with known clinical knowledge.
- Maintains performance when smaller-capacity language models replace larger ones, lowering compute requirements.
- Outperforms single-agent LLM baselines in capturing the order and timing of clinical events.
Where Pith is reading between the lines
- The same agent-chain structure could be applied to other longitudinal prediction tasks such as progression of cardiovascular disease or diabetes complications.
- If the rationales prove consistently faithful, they could be collected to create synthetic training data that improves smaller models for the same task.
- Deployment in clinical workflows might allow real-time flagging during routine visits without requiring site-specific model retraining.
Load-bearing premise
The multi-agent chain with memory produces reliable temporal reasoning and risk scores without systematic hallucinations or biases that would invalidate the AUROC results or human validation on the de-identified records.
What would settle it
A fresh cohort of several thousand matched patients where blinded clinicians review the rationales and find frequent contradictions with the source records, or where measured AUROC on that cohort falls below 0.60 for most cancer types.
read the original abstract
Accurate estimation of cancer risk from longitudinal electronic health records (EHRs) could support earlier detection and improved care, but modeling such complex patient trajectories remains challenging. We present TrajOnco, a training-free, multi-agent large language model (LLM) framework designed for scalable multi-cancer early detection. Using a chain-of-agents architecture with long-term memory, TrajOnco performs temporal reasoning over sequential clinical events to generate patient-level summaries, evidence-linked rationales, and predicted risk scores. We evaluated TrajOnco on de-identified Truveta EHR data across 15 cancer types using matched case-control cohorts, predicting risk of cancer diagnosis at 1 year. In zero-shot evaluation, TrajOnco achieved AUROCs of 0.64-0.80, performing comparably to supervised machine learning in a lung cancer benchmark while demonstrating better temporal reasoning than single-agent LLMs. The multi-agent design also enabled effective temporal reasoning with smaller-capacity models such as GPT-4.1-mini. The fidelity of TrajOnco's output was validated through human evaluation. Furthermore, TrajOnco's interpretable reasoning outputs can be aggregated to reveal population-level risk patterns that align with established clinical knowledge. These findings highlight the potential of multi-agent LLMs to execute interpretable temporal reasoning over longitudinal EHRs, advancing both scalable multi-cancer early detection and clinical insight generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TrajOnco, a training-free multi-agent LLM framework using a chain-of-agents architecture with long-term memory to perform temporal reasoning over longitudinal EHR sequences. It generates patient summaries, evidence-linked rationales, and 1-year cancer risk scores for 15 cancer types. On de-identified Truveta matched case-control cohorts, zero-shot evaluation yields AUROCs of 0.64-0.80, comparable to supervised ML on a lung cancer benchmark and superior to single-agent LLMs; human evaluation validates output fidelity, and aggregated rationales align with clinical knowledge.
Significance. If the results hold, the work shows that multi-agent LLMs can deliver interpretable, training-free temporal reasoning over complex EHR trajectories for scalable multi-cancer early detection. Credit is due for the training-free design, effective performance with smaller models (e.g., GPT-4.1-mini), explicit human fidelity validation, and the ability to surface population-level patterns consistent with established clinical knowledge.
major comments (3)
- [Evaluation section] Evaluation section: The reported AUROCs of 0.64-0.80 are presented without cohort sizes, exact case-control matching criteria, missing-data handling protocols, or statistical significance/error bars; these omissions are load-bearing because they prevent assessment of whether the performance is robust or comparable to the supervised ML baseline.
- [Human validation subsection] Human validation subsection: Fidelity is validated by human review, yet no inter-rater reliability, blinding procedures, or quantitative factual-consistency metrics (e.g., alignment of generated rationales with raw EHR timelines) are reported; this directly affects whether the AUROCs reflect genuine temporal reasoning or LLM-specific artifacts.
- [Methods (multi-agent architecture)] Methods (multi-agent architecture): The superiority over single-agent LLMs is attributed to the chain-of-agents plus long-term memory, but no ablation isolating these components is provided; without it, the central claim that the multi-agent design enables reliable risk prediction cannot be isolated from prompt or model effects.
minor comments (3)
- [Abstract] Abstract: Consider adding one sentence on cohort scale or confidence intervals to strengthen the headline performance claim.
- [Figure 1] Figure 1 (framework diagram): The long-term memory module and inter-agent handoff arrows would benefit from explicit labels to clarify data flow.
- [Discussion] Discussion: A brief comparison to prior LLM-EHR temporal reasoning work (e.g., on single-cancer tasks) would better situate the multi-cancer, multi-agent contribution.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment point-by-point below, providing the strongest honest defense of the manuscript while committing to revisions that strengthen the work without misrepresentation.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section: The reported AUROCs of 0.64-0.80 are presented without cohort sizes, exact case-control matching criteria, missing-data handling protocols, or statistical significance/error bars; these omissions are load-bearing because they prevent assessment of whether the performance is robust or comparable to the supervised ML baseline.
Authors: We agree that these details are essential for readers to assess robustness and comparability. The original manuscript reports results on de-identified Truveta matched case-control cohorts for 15 cancer types but does not include the requested specifics. In the revised manuscript, we will expand the Evaluation section to report exact cohort sizes (cases and controls per cancer type), the matching criteria (age, sex, visit frequency, and other variables), missing-data protocols (inclusion thresholds and any imputation), and statistical details including 95% confidence intervals via bootstrapping plus significance tests against the supervised ML baseline. These additions will directly address the concern. revision: yes
-
Referee: [Human validation subsection] Human validation subsection: Fidelity is validated by human review, yet no inter-rater reliability, blinding procedures, or quantitative factual-consistency metrics (e.g., alignment of generated rationales with raw EHR timelines) are reported; this directly affects whether the AUROCs reflect genuine temporal reasoning or LLM-specific artifacts.
Authors: We acknowledge that the human validation subsection reports expert review of output fidelity but omits the requested rigor. In the revision, we will add a detailed protocol description: number of reviewers, blinding procedures (reviewers blinded to model identity and source data), inter-rater reliability (e.g., Cohen's or Fleiss' kappa), and quantitative factual-consistency metrics. The latter will include alignment scores between generated rationales and raw EHR timelines, computed as precision/recall over extracted clinical events. This will provide stronger evidence that the AUROCs reflect temporal reasoning rather than artifacts. revision: yes
-
Referee: [Methods (multi-agent architecture)] Methods (multi-agent architecture): The superiority over single-agent LLMs is attributed to the chain-of-agents plus long-term memory, but no ablation isolating these components is provided; without it, the central claim that the multi-agent design enables reliable risk prediction cannot be isolated from prompt or model effects.
Authors: We agree that a dedicated ablation would more cleanly isolate the contributions of the chain-of-agents architecture and long-term memory. The manuscript already shows superior performance versus single-agent LLM baselines, but does not include a full ablation study. In the revised manuscript, we will add ablation experiments evaluating performance with and without the multi-agent chaining and memory modules (while holding prompts and base models fixed). This will quantify their individual effects on AUROC and rationale quality, strengthening the central claim. revision: yes
Circularity Check
No circularity: empirical performance measured on external benchmarks
full rationale
The paper describes a training-free multi-agent LLM framework and reports its zero-shot AUROC performance (0.64-0.80) on matched case-control EHR cohorts for 15 cancers, with direct comparisons to supervised ML baselines and single-agent LLMs plus human validation of outputs. No equations, parameter fitting, or derivation chain exists that reduces any claimed result to its own inputs by construction. All load-bearing claims are falsifiable against independent data and external models; no self-citation is invoked to justify uniqueness or force a result. This is the standard honest outcome for an applied empirical framework paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multi-agent LLM orchestration with long-term memory enables reliable temporal reasoning over sequential clinical events without task-specific training
Reference graph
Works this paper leans on
-
[1]
The Lancet Oncology21(1), 6–8 (2020)
Whitaker, K.: Earlier diagnosis: the importance of cancer symptoms. The Lancet Oncology21(1), 6–8 (2020)
2020
-
[2]
Science375(6586), 9040 (2022) https://doi.org/10.1126/science.aay9040
Crosby, D., Bhatia, S., Brindle, K.M., Coussens, L.M., Dive, C., Emberton, M., Esener, S., Fitzgerald, R.C., Gambhir, S.S., Kuhn, P., Rebbeck, T.R., Bal- asubramanian, S.: Early detection of cancer. Science375(6586), 9040 (2022) https://doi.org/10.1126/science.aay9040
-
[3]
The Lancet Digital Health6(6), 396–406 (2024)
Jung, A.W., Holm, P.C., Gaurav, K., Hjaltelin, J.X., Placido, D., Mortensen, L.H., Birney, E., Gerstung, M.,et al.: Multi-cancer risk stratification based on national health data: a retrospective modelling and validation study. The Lancet Digital Health6(6), 396–406 (2024)
2024
-
[4]
Biomarker Research13(1), 101 (2025) 20
Zhao, R., Yuan, H., Jiang, Y., Liu, Z., Chen, R., Wang, S., Lu, L., Yuan, Z., Su, Z., He, Q.,et al.: Development and validation of an integrative 54 biomarker-based risk identification model for multi-cancer in 42,666 individu- als: a population-based prospective study to guide advanced screening strategies. Biomarker Research13(1), 101 (2025) 20
2025
-
[5]
Nature, 1–9 (2025) https://doi.org/10.1038/ s41586-025-09529-3
Shmatko, A., Jung, A.W., Gaurav, K., Brunak, S., Mortensen, L.H., Birney, E., Fitzgerald, T., Gerstung, M.: Learning the natural history of human dis- ease with generative transformers. Nature, 1–9 (2025) https://doi.org/10.1038/ s41586-025-09529-3 . Accessed 2025-09-23
2025
-
[6]
The British Journal of General Practice68(670), 301–310 (2018) https://doi.org/10.3399/bjgp18X695777
Holtedahl, K., Hjertholm, P., Borgquist, L., Donker, G.A., Buntinx, F., Weller, D., Braaten, T., M˚ ansson, J., Strandberg, E.L., Campbell, C., Korevaar, J.C., Parajuli, R.: Abdominal symptoms and cancer in the abdomen: prospective cohort study in European primary care. The British Journal of General Practice68(670), 301–310 (2018) https://doi.org/10.3399...
-
[7]
Cureus16(7), 65441 https://doi.org/10.7759/cureus
Mondoc, L.-M., Catana, A.-C., Prodan, L.-C., Valeanu, M., Mihaila, R.-G.: The Impact of Anemia on the Survival of Patients Diagnosed With Low-Grade Malig- nant B-cell Lymphomas. Cureus16(7), 65441 https://doi.org/10.7759/cureus. 65441 . Accessed 2026-03-10
-
[8]
npj Digital Medicine (2026) https://doi.org/10.1038/s41746-026-02441-8
Li, M., Zhan, Z., Huang, J., Yeung, J., Ding, K., Blaes, A., Johnson, S., Liu, H., Xu, H., Zhang, R.: CancerLLM: a large language model in cancer domain. npj Digital Medicine (2026) https://doi.org/10.1038/s41746-026-02441-8 . Accessed 2026-03-10
-
[9]
ArXiv, 2511–112932 (2026)
Park, J., Pang, C., Lee, T.Y., Yang, J.Y., Berkowitz, J., Wei, A.Z., Tatonetti, N.: Toward Scalable Early Cancer Detection: Evaluating EHR-Based Predictive Models Against Traditional Screening Criteria. ArXiv, 2511–112932 (2026)
2026
-
[10]
Lundberg, S., Lee, S.-I.: A Unified Approach to Interpreting Model Predictions. arXiv. arXiv:1705.07874 [cs] (2017). https://doi.org/10.48550/arXiv.1705.07874 . http://arxiv.org/abs/1705.07874 Accessed 2026-03-10
-
[11]
npj Digital Medicine8(1), 397 (2025) https://doi.org/10.1038/s41746-025-01780-2
Zhu, M., Lin, H., Jiang, J., Jinia, A.J., Jee, J., Pichotta, K., Waters, M., Rose, D., Schultz, N., Chalise, S., Valleru, L., Morin, O., Moran, J., Deasy, J.O., Pilai, S., Nichols, C., Riely, G., Braunstein, L.Z., Li, A.: Large language model trained on clinical oncology data predicts cancer progression. npj Digital Medicine8(1), 397 (2025) https://doi.or...
-
[12]
Lau, W., Kim, Y., Parasa, S., Haque, M.E., Oka, A., Nanduri, J.: Pre- dicting Early-Onset Colorectal Cancer with Large Language Models. arXiv. arXiv:2506.11410 [cs] (2025). https://doi.org/10.48550/arXiv.2506.11410 . http: //arxiv.org/abs/2506.11410 Accessed 2025-10-10
-
[13]
Lost in the Middle: How Language Models Use Long Contexts
Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., Liang, P.: Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics12, 157–173 (2024) https://doi. org/10.1162/tacl a 00638
work page internal anchor Pith review doi:10.1162/tacl 2024
-
[14]
npj Digital Medicine , publisher =
Li, R., Wang, X., Berlowitz, D., Mez, J., Lin, H., Yu, H.: CARE-AD: a multi- agent large language model framework for Alzheimer’s disease prediction using 21 longitudinal clinical notes. npj Digital Medicine8(1), 541 (2025) https://doi.org/ 10.1038/s41746-025-01940-4 . Accessed 2025-08-26
-
[15]
Advances in Neural Information Processing Systems37, 132208–132237 (2024)
Zhang, Y., Sun, R., Chen, Y., Pfister, T., Zhang, R., Arik, S.: Chain of agents: Large language models collaborating on long-context tasks. Advances in Neural Information Processing Systems37, 132208–132237 (2024)
2024
-
[16]
Haug, N.R., Wagner, A.K., McGlynn, K.A., Leonard, C.E., Nguyen, M.D., Major, J.M.: New-onset cancer cases in FDA’s Sentinel System: A large distributed system of US electronic healthcare data. Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Onco...
-
[17]
Cancer Causes & Control18(5), 561–569 (2007) https://doi.org/10.1007/s10552-007-0131-1
Setoguchi, S., Solomon, D.H., Glynn, R.J., Cook, E.F., Levin, R., Schneeweiss, S.: Agreement of diagnosis and its date for hematologic malignancies and solid tumors between medicare claims and cancer registry data. Cancer Causes & Control18(5), 561–569 (2007) https://doi.org/10.1007/s10552-007-0131-1
-
[18]
Chen, T., Guestrin, C.: XGBoost: A Scalable Tree Boosting System. In: Proceed- ings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery And Data Mining, pp. 785–794 (2016). https://doi.org/10.1145/2939672.2939785 . arXiv:1603.02754 [cs].http://arxiv.org/abs/1603.02754Accessed 2025-11-18
-
[19]
BMC Medicine23, 551 (2025) https://doi.org/10
Li, X., Yuan, E.Y., Kuperberg, S.J., Bonzel, C.-L., Jeffway, M.I., Cai, T., Liao, K.P., Aguiar-Ib´ a˜ nez, R., Kao, Y.-H., Santorelli, M.L., Christiani, D.C., Cai, T., Duan, R.: Early detection of non-small cell lung cancer: An electronic health record data-driven approach. BMC Medicine23, 551 (2025) https://doi.org/10. 1186/s12916-025-04289-3
2025
-
[20]
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging LLM-as-a- Judge with MT-Bench and Chatbot Arena. arXiv. arXiv:2306.05685 [cs] (2023). http://arxiv.org/abs/2306.05685 Accessed 2024-05-22
work page internal anchor Pith review arXiv 2023
-
[21]
Shi, L., Ma, C., Liang, W., Diao, X., Ma, W., Vosoughi, S.: Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge. In: Inui, K., Sakti, S., Wang, H., Wong, D.F., Bhattacharyya, P., Banerjee, B., Ekbal, A., Chakraborty, T., Singh, D.P. (eds.) Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4...
-
[22]
Chau Minh Pham, Alexander Hoyle, Simeng Sun, Philip Resnik, and Mohit Iyyer
Pham, C.M., Hoyle, A., Sun, S., Resnik, P., Iyyer, M.: TopicGPT: A Prompt- based Topic Modeling Framework. In: Duh, K., Gomez, H., Bethard, S. (eds.) 22 Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol- ume 1: Long Papers), pp. 2956–2984. Association for Com...
-
[23]
BMC Medical Imaging25, 204 (2025) https://doi.org/10
Li, Y., Lin, C., Cui, L., Huang, C., Shi, L., Huang, S., Yu, Y., Zhou, X., Zhou, Q., Chen, K., Shi, L.: Association between age and lung cancer risk: evidence from lung lobar radiomics. BMC Medical Imaging25, 204 (2025) https://doi.org/10. 1186/s12880-025-01747-5 . Accessed 2026-03-13
2025
-
[24]
Walser, T., Cui, X., Yanagawa, J., Lee, J.M., Heinrich, E., Lee, G., Sharma, S., Dubinett, S.M.: Smoking and Lung Cancer. Proceedings of the American Tho- racic Society5(8), 811–815 (2008) https://doi.org/10.1513/pats.200809-100TH . Accessed 2026-03-13
-
[25]
BMJ Open13(4), 068832 (2023) https://doi.org/10
Prado, M.G., Kessler, L.G., Au, M.A., Burkhardt, H.A., Zigman Suchsland, M., Kowalski, L., Stephens, K.A., Yetisgen, M., Walter, F.M., Neal, R.D., Lybarger, K., Thompson, C.A., Al Achkar, M., Sarma, E.A., Turner, G., Farjah, F., Thomp- son, M.J.: Symptoms and signs of lung cancer prior to diagnosis: case–control study using electronic health records from ...
2023
-
[26]
Annals of Thoracic Medicine14(4), 226–238 (2019) https://doi.org/10.4103/atm.ATM 110 19
Loverdos, K., Fotiadis, A., Kontogianni, C., Iliopoulou, M., Gaga, M.: Lung nod- ules: A comprehensive review on current approach and management. Annals of Thoracic Medicine14(4), 226–238 (2019) https://doi.org/10.4103/atm.ATM 110 19 . Accessed 2026-03-13
-
[27]
Clinical Lung Cancer5(2), 90–97 (2003) https://doi.org/10.3816/CLC.2003.n.022
Pirker, R., Wiesenberger, K., Pohl, G., Minar, W.: Anemia in Lung Cancer: Clinical Impact and Management. Clinical Lung Cancer5(2), 90–97 (2003) https://doi.org/10.3816/CLC.2003.n.022 . Accessed 2026-03-13
-
[28]
McInnes, L., Healy, J., Melville, J.: UMAP: Uniform Manifold Approxima- tion and Projection for Dimension Reduction. arXiv. arXiv:1802.03426 [stat] (2020). https://doi.org/10.48550/arXiv.1802.03426 . http://arxiv.org/abs/1802. 03426 Accessed 2026-03-13
work page internal anchor Pith review doi:10.48550/arxiv.1802.03426 2020
-
[29]
Journal of Diabetes Research2022, 1747326 (2022) https://doi.org/10.1155/2022/1747326
Yu, G.-H., Li, S.-F., Wei, R., Jiang, Z.: Diabetes and Colorectal Cancer Risk: Clin- ical and Therapeutic Implications. Journal of Diabetes Research2022, 1747326 (2022) https://doi.org/10.1155/2022/1747326 . Accessed 2026-03-13
-
[30]
Medic- ina60(8), 1218 (2024) https://doi.org/10.3390/medicina60081218
Miranda, B.C.J., Tustumi, F., Nakamura, E.T., Shimanoe, V.H., Kikawa, D., Waisberg, J.: Obesity and Colorectal Cancer: A Narrative Review. Medic- ina60(8), 1218 (2024) https://doi.org/10.3390/medicina60081218 . Accessed 2026-03-13 23
-
[31]
Can- cer Diagnosis & Prognosis3(2), 163 (2023) https://doi.org/10.21873/cdp.10196
Chardalias, L., Papaconstantinou, I., Gklavas, A., Politou, M., Theodosopou- los, T.: Iron Deficiency Anemia in Colorectal Cancer Patients: Is Preoperative Intravenous Iron Infusion Indicated? A Narrative Review of the Literature. Can- cer Diagnosis & Prognosis3(2), 163 (2023) https://doi.org/10.21873/cdp.10196 . Accessed 2026-03-13
-
[32]
Nutrients14(8), 1542 (2022) https://doi.org/10.3390/nu14081542
Cencioni, C., Trestini, I., Piro, G., Bria, E., Tortora, G., Carbone, C., Spal- lotta, F.: Gastrointestinal Cancer Patient Nutritional Management: From Specific Needs to Novel Epigenetic Dietary Approaches. Nutrients14(8), 1542 (2022) https://doi.org/10.3390/nu14081542 . Accessed 2026-03-13
-
[33]
Disease-a-month: DM35(11), 721–768 (1989) https://doi.org/10.1016/ 0011-5029(89)90011-4
Johnson, R.A., Roodman, G.D.: Hematologic manifestations of malig- nancy. Disease-a-month: DM35(11), 721–768 (1989) https://doi.org/10.1016/ 0011-5029(89)90011-4
1989
-
[34]
The Israel Medical Association journal: IMAJ9(10), 732–735 (2007)
Vainrib, M., Leibovitch, I.: Urological implications of concurrent bladder and lung cancer. The Israel Medical Association journal: IMAJ9(10), 732–735 (2007)
2007
-
[35]
In: Agrawal, M., Deshpande, K., Engelhard, M., Joshi, S., Tang, S., Urteaga, I
Zeng, S., Liu, L.J., Wen, J., Yetisgen, M., Etzioni, R., Luo, G.: Trajsurv: Learn- ing continuous latent trajectories from electronic health records for trustworthy survival prediction. In: Agrawal, M., Deshpande, K., Engelhard, M., Joshi, S., Tang, S., Urteaga, I. (eds.) Proceedings of the 10th Machine Learning for Health- care Conference. Proceedings of...
2025
-
[36]
Zhang, S., Liu, Q., Usuyama, N., Wong, C., Naumann, T., Poon, H.: Explor- ing Scaling Laws for EHR Foundation Models. arXiv. arXiv:2505.22964 [cs] (2025). https://doi.org/10.48550/arXiv.2505.22964 . http://arxiv.org/abs/2505. 22964 Accessed 2025-08-16
-
[37]
and Was, Jaroslaw and Li, Quanzheng and Bates, David W
Renc, P., Jia, Y., Samir, A.E., Was, J., Li, Q., Bates, D.W., Sitek, A.: Zero shot health trajectory prediction using transformer. npj Digital Medicine7(1), 256 (2024) https://doi.org/10.1038/s41746-024-01235-0 . Accessed 2025-07-22
-
[38]
npj Digital Medicine8(1), 577 (2025) https://doi.org/10.1038/ s41746-025-01965-9
Cui, H., Unell, A., Chen, B., Fries, J.A., Alsentzer, E., Koyejo, S., Shah, N.H.: TIMER: Temporal instruction modeling and evaluation for longitudinal clin- ical records. npj Digital Medicine8(1), 577 (2025) https://doi.org/10.1038/ s41746-025-01965-9
2025
-
[39]
Kruse, M., Hu, S., Derby, N., Wu, Y., Stonbraker, S., Yao, B., Wang, D., Goldberg, E., Gao, Y.: Zero-shot Large Language Models for Long Clinical Text Summa- rization with Temporal Reasoning. arXiv. arXiv:2501.18724 [cs] (2025). https:// doi.org/10.48550/arXiv.2501.18724 . http://arxiv.org/abs/2501.18724 Accessed 2025-08-14
-
[40]
Wornow, M., Bedi, S., Hernandez, M.A.F., Steinberg, E., Fries, J.A., R´ e, C., Koyejo, S., Shah, N.H.: Context clues: Evaluating long context models for clinical 24 prediction tasks on ehrs. arXiv preprint arXiv:2412.16178 (2024)
-
[41]
In: Morgado-Diaz, J.A
Duan, B., Zhao, Y., Bai, J., Wang, J., Duan, X., Luo, X., Zhang, R., Pu, Y., Kou, M., Lei, J., Yang, S.: Colorectal Cancer: An Overview. In: Morgado-Diaz, J.A. (ed.) Gastrointestinal Cancers. Exon Publications, Brisbane (AU) (2022). http://www.ncbi.nlm.nih.gov/books/NBK586003/Accessed 2026-03-26
2022
-
[42]
Xu, R., Yan, Y.: Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward. arXiv. arXiv:2602.12430 [cs] (2026). https://doi.org/10.48550/arXiv.2602.12430 . http://arxiv.org/abs/2602. 12430 Accessed 2026-03-13
work page internal anchor Pith review doi:10.48550/arxiv.2602.12430 2026
-
[43]
bioRxiv: The Preprint Server for Biology, 2025–0530656746 (2025) https://doi.org/10.1101/ 2025.05.30.656746
Huang, K., Zhang, S., Wang, H., Qu, Y., Lu, Y., Roohani, Y., Li, R., Qiu, L., Li, G., Zhang, J., Yin, D., Marwaha, S., Carter, J.N., Zhou, X., Wheeler, M., Bernstein, J.A., Wang, M., He, P., Zhou, J., Snyder, M., Cong, L., Regev, A., Leskovec, J.: Biomni: A General-Purpose Biomedical AI Agent. bioRxiv: The Preprint Server for Biology, 2025–0530656746 (202...
2025
-
[44]
An agentic system for rare disease diagnosis with traceable reasoning.Nature, 651:775–784, 2026
Zhao, W., Wu, C., Fan, Y., Qiu, P., Zhang, X., Sun, Y., Zhou, X., Zhang, S., Peng, Y., Wang, Y., Sun, X., Zhang, Y., Yu, Y., Sun, K., Xie, W.: An agentic system for rare disease diagnosis with traceable reasoning. Nature, 1–10 (2026) https://doi.org/10.1038/s41586-025-10097-9 . Accessed 2026-03-13
-
[45]
npj Digital Medicine6(1), 223 (2023) https://doi.org/10.1038/s41746-023-00967-9
Brentnall, A.R., Atakpa, E.C., Hill, H., Santeramo, R., Damiani, C., Cuzick, J., Montana, G., Duffy, S.W.: An optimization framework to guide the choice of thresholds for risk-based cancer screening. npj Digital Medicine6(1), 223 (2023) https://doi.org/10.1038/s41746-023-00967-9
-
[46]
Journal of Translational Medicine24, 49 (2025) https: //doi.org/10.1186/s12967-025-07511-1
Tanaka, S., Wilkens, L.R., Marchand, L.L., Yang, G., Yuan, J.-M., Koh, W.- P., Shrubsole, M.J., Luu, M., Pagano, I., Figueiredo, J.C., Furuya, H., Rosser, C.J.: Developing a prediction model in a large case-control study for the early detection of bladder cancer. Journal of Translational Medicine24, 49 (2025) https: //doi.org/10.1186/s12967-025-07511-1
-
[47]
Wu, X., Ritter, A., Xu, W.: Tabular Data Understanding with LLMs: A Survey of Recent Advances and Challenges. arXiv (2025). https://doi.org/10.48550/arXiv. 2508.00217
work page internal anchor Pith review doi:10.48550/arxiv 2025
-
[48]
https://docs.anthropic
Anthropic: Use XML tags to structure your prompts. https://docs.anthropic. com/en/docs/build-with-claude/prompt-engineering/use-xml-tags. Accessed: 2025-08-16 (2025)
2025
-
[49]
Advances in Neural Infor- mation Processing Systems37, 44102–44163 (2024) https://doi.org/10.52202/ 079017-1400
McDermott, M.B., Zhang, H., Hansen, L.H., Angelotti, G., Gallifant, J.: A Closer Look at AUROC and AUPRC under Class Imbalance. Advances in Neural Infor- mation Processing Systems37, 44102–44163 (2024) https://doi.org/10.52202/ 079017-1400 . Accessed 2026-03-15 25
2024
-
[50]
Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., Wang, Y., Guo, J.: A Survey on LLM-as-a-Judge. arXiv. arXiv:2411.15594 [cs] version: 1 (2024). https://doi.org/10.48550/arXiv.2411.15594 . http://arxiv. org/abs/2411.15594 Accessed 2026-03-15
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2411.15594 2024
-
[51]
npj Digital Medicine6(1), 7 (2023) https://doi.org/10.1038/ s41746-023-00753-7
Ehrmann, D.E., Joshi, S., Goodfellow, S.D., Mazwi, M.L., Eytan, D.: Mak- ing machine learning matter to clinicians: model actionability in medical decision-making. npj Digital Medicine6(1), 7 (2023) https://doi.org/10.1038/ s41746-023-00753-7 . Accessed 2026-03-15
2023
-
[52]
Nature Communications16(1), 9799 (2025) https://doi.org/10.1038/ s41467-025-64769-1
Qiu, P., Wu, C., Liu, S., Fan, Y., Zhao, W., Chen, Z., Gu, H., Peng, C., Zhang, Y., Wang, Y., Xie, W.: Quantifying the reasoning abilities of LLMs on clini- cal cases. Nature Communications16(1), 9799 (2025) https://doi.org/10.1038/ s41467-025-64769-1 . Accessed 2026-03-15
2025
-
[53]
BMJ Oncology3(1), 000087 (2024) https://doi.org/10
Wu, X., Tu, H., Hu, Q., Tsai, S.P., Ta-Wei Chu, D., Wen, C.-P.: Novel machine learning algorithm in risk prediction model for pan-cancer risk: application in a large prospective cohort. BMJ Oncology3(1), 000087 (2024) https://doi.org/10. 1136/bmjonc-2023-000087 . Accessed 2026-03-17
2024
-
[54]
Applied Clinical Informatics16(3), 556–568 (2025) https://doi.org/10.1055/a-2544-3117
Kroenke, K., Ruddy, K.J., Pachman, D.R., Grzegorczyk, V., Herrin, J., Rah- man, P.A., Tobin, K.A., Griffin, J.M., Chlan, L.L., Austin, J.D., Ridgeway, J.L., Mitchell, S.A., Marsolo, K.A., Cheville, A.L.: Using Electronic Health Records to Classify Cancer Site and Metastasis. Applied Clinical Informatics16(3), 556–568 (2025) https://doi.org/10.1055/a-2544-...
-
[55]
Archive Location: nciglobal,ncienterprise (2022)
Cervical Cancer Screening - NCI. Archive Location: nciglobal,ncienterprise (2022). https://www.cancer.gov/types/cervical/screening Accessed 2026-03-17
2022
-
[56]
Archive Location: nciglobal,ncienterprise (2026)
Lung Cancer Screening - NCI. Archive Location: nciglobal,ncienterprise (2026). https://www.cancer.gov/types/lung/patient/lung-screening-pdq Accessed 2026- 03-17
2026
-
[57]
Archive Location: nciglobal,ncienterprise (2026)
Prostate Cancer Screening - NCI. Archive Location: nciglobal,ncienterprise (2026). https://www.cancer.gov/types/prostate/patient/prostate-screening-pdq Accessed 2026-03-17
2026
-
[58]
Archive Location: nciglobal,ncienterprise (2025)
Screening for Breast Cancer - NCI. Archive Location: nciglobal,ncienterprise (2025). https://www.cancer.gov/types/breast/screening Accessed 2026-03-17
2025
-
[59]
max pooling
Colorectal Cancer Screening - NCI. Archive Location: nciglobal,ncienterprise (2026). https://www.cancer.gov/types/colorectal/ patient/colorectal-screening-pdq Accessed 2026-03-17 26 Appendix A Data statistics Table A1 summarizes baseline characteristics for the 15 cancer-specific case-control cohorts, with 500 cases and 500 matched controls per cancer typ...
2026
-
[60]
Clinical Correctness and Plausibility: Assess whether the model’s risk assessment is clinically plausible and consistent with expert knowledge and the ground truth diagnosis
-
[61]
Assess the depth of the model’s analysis and the comprehensive- ness of its reasoning
Completeness and Detail: Assess whether the model identified and utilized the full spectrum of relevant data from the longitudinal EHR, without omitting critical factors or including irrelevant ones. Assess the depth of the model’s analysis and the comprehensive- ness of its reasoning
-
[62]
This goes beyond what factors were listed (Completeness) and judges how they were used to build an argument
Clinical Reasoning and Justification: Evaluate the quality of the model’s explanation. This goes beyond what factors were listed (Completeness) and judges how they were used to build an argument
-
[63]
Longitudinal and Temporal Reasoning: Specifically assess the model’s ability to interpret changes over time in the longitudinal EHR data and connect them to the future risk timeframe ({years}years)
-
[64]
evaluationsummary
Clarity and Actionability: Assess how clear, actionable and understandable the out- put is. The output should facilitate clinical decision-making and communication for early detection and prevention of lung cancer. Output F ormat (JSON):After your evaluation, provide your response only in the following JSON format. Do not include any text before or after ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.