pith. sign in

arxiv: 2605.27082 · v1 · pith:PIEQGGDDnew · submitted 2026-05-26 · 💻 cs.AI

Can Broad Biomedical Knowledge be Contextualized into Scenario-Grounded Propositions?

Pith reviewed 2026-06-29 17:31 UTC · model grok-4.3

classification 💻 cs.AI
keywords knowledge contextualizationbi-level optimizationmulti-agent systemsbiomedical discoveryclinical trialsLINCS L1000hypothesis generationiterative search
0
0 comments X

The pith

SCENE is a bi-level multi-agent framework that converts broad biomedical knowledge into evidence-supported, scenario-grounded propositions through iterative search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether broad biomedical knowledge can be turned into propositions that are both mechanistically interpretable and directly supported by specific datasets. It introduces SCENE, whose upper level generates search directions from general knowledge and grounds them in a dataset's schema while the lower level optimizes for concrete propositions that balance evidential strength with data support. Feedback between the levels refines the search over successive iterations. In clinical-trial data the system identifies patient subgroups with heterogeneous treatment benefits; in LINCS L1000 data it finds perturbational contexts with strong target-response alignment. The resulting propositions are traceable and inspectable, allowing domain experts to replay and validate them.

Core claim

SCENE treats knowledge contextualization as iterative search. The upper level converts broad knowledge into search directions and grounds them in the dataset schema. The lower level executes these directions through multi-objective optimization to identify concrete propositions that balance evidential strength and data support. Feedback between the two levels progressively refines the search. In clinical trials SCENE discovers specific, well-supported subgroups and outperforms baselines; in L1000 studies it identifies perturbational contexts with strong target-response matching and high positive rates.

What carries the argument

SCENE, the bi-level multi-agent framework that performs iterative search with an upper-level search-direction generator and a lower-level multi-objective optimizer connected by feedback.

If this is right

  • In clinical-trial scenarios SCENE produces specific, well-supported patient subgroups with heterogeneous treatment benefits.
  • In LINCS L1000 studies SCENE identifies perturbational contexts that exhibit strong target-response matching and high positive rates.
  • The framework yields traceable, inspectable hypotheses that domain experts can replay and validate.
  • SCENE outperforms existing baselines on both the clinical-trial and L1000 evaluation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bi-level search structure could be tested on additional biomedical resources such as electronic health records or single-cell atlases.
  • If the feedback loop scales computationally, the method might reduce the manual effort required to generate mechanistic hypotheses from large omics datasets.
  • Similar separation of search-direction generation from multi-objective grounding could be examined in non-biomedical domains where general knowledge must be mapped onto specific tabular data.

Load-bearing premise

Iterative feedback between the upper-level search-direction generator and the lower-level multi-objective optimizer can progressively refine outputs to achieve a useful balance between evidential strength and data support.

What would settle it

If SCENE is run on an independent clinical-trial dataset and the discovered subgroups fail to show statistically significant heterogeneous treatment benefits upon validation in a separate cohort, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.27082 by Anglin Liu, Jintai Chen, Lang Qin, Pengxiang Cai, Qingyuan Zeng, Xinyao Lai, Zixin Guan, Ziyang Chen.

Figure 1
Figure 1. Figure 1: Examples of scenario-grounded propositions. SCENE contextualizes a broad biomedical [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SCENE (Scenario-Contextualized Evidence and Knowledge Engine) artifact flow. Biomed [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Fraction of held-out Best￾1 clinical rules whose treatment￾benefit effect is at least 10 percent￾age points. Method Conn. ↑ Strong-Pos. ↑ AUPRC ↑ U-Supp. ↑ Gap ↓ Minimal 0.166 27.6 0.260 33.5 0.111 w/o Dynamic 0.171 28.5 0.273 34.2 0.068 w/o Pareto 0.184 18.8 0.273 30.0 0.081 w/o Upper 0.177 35.1 0.271 29.5 0.106 w/o Virtual 0.170 24.2 0.275 34.3 0.087 SCENE (ours) 0.188 41.5 0.268 31.4 0.063 [PITH_FULL_I… view at source ↗
Figure 4
Figure 4. Figure 4: Concrete clinical meaning of a SCENE subgroup rule. (Left) The displayed SCENE rule [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Discovery-side SCENE search trajectory showing median best-so-far and mean candidate ben￾efit across generations. Scenario Mode AUPRC MacroF1 ACC Clinical Trial C0 0.5877 0.4408 0.4898 C1 0.6140 0.4745 0.5510 C2 0.6747 0.5291 0.6939 L1000 C0 0.3582 0.5536 0.5600 C1 0.3725 0.5914 0.6100 C2 0.4113 0.6528 0.7200 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Held-out L1000 rule replay on 3 target mechanisms. Gray hexbins summarize the held-out [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Benefit–runtime trade-off and runtime cost across LLM backends. (Left) Each point is one [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Clinical benefit–support landscape. Each panel shows one clinical frame. The [PITH_FULL_IMAGE:figures/full_fig_p038_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Clinical benefit–variability profiles. (Left) Mean benefit with task-level SD. (Right) Median [PITH_FULL_IMAGE:figures/full_fig_p039_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Robust clinical geometry and aggregate read-outs. (Left) Median benefit with task-level [PITH_FULL_IMAGE:figures/full_fig_p040_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Task-specific clinical-context audit of reported top rules. Each task block contains three [PITH_FULL_IMAGE:figures/full_fig_p041_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Matched L1000 replay audit. SCENE-selected signatures are compared with each [PITH_FULL_IMAGE:figures/full_fig_p043_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Union of Best-1 full-profile SCENE rules across ten target-seed runs. Colored points are [PITH_FULL_IMAGE:figures/full_fig_p043_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Reliability by held-out support bin. The left panel shows the median absolute train– [PITH_FULL_IMAGE:figures/full_fig_p044_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Feature-family assignments from clinical rules. The alluvial map links the six clinical [PITH_FULL_IMAGE:figures/full_fig_p045_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: L1000 holdout effect versus generalization gap. Each point is a run-aggregated ablation [PITH_FULL_IMAGE:figures/full_fig_p046_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Evidence-gated lifecycle of SCENE-generated features. Percentages are measured [PITH_FULL_IMAGE:figures/full_fig_p047_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Grounding trace from broad biomedical directions to executable scenario-grounded [PITH_FULL_IMAGE:figures/full_fig_p048_18.png] view at source ↗
read the original abstract

Biomedical discovery often requires connecting broad biomedical knowledge with specific experimental or clinical data. Background knowledge suggests relevant mechanisms but is usually too general to map directly onto dataset variables, while data-driven patterns can be dataset-specific and hard to interpret mechanistically. We study this missing link as knowledge contextualization: transforming broad biomedical knowledge into evidence-supported, scenario-grounded propositions that domain experts can inspect, replay, and validate. We propose SCENE, a bi-level multi-agent framework that treats knowledge contextualization as iterative search. The upper level converts broad knowledge into search directions and grounds them in the dataset schema. The lower level executes these directions through multi-objective optimization to identify concrete propositions that balance evidential strength and data support. Feedback between the two levels progressively refines the search. We evaluate SCENE in two settings: discovering patient subgroups with heterogeneous treatment benefits in clinical trial scenarios, and identifying context-specific biological responses in LINCS L1000 studies. In clinical trials, SCENE discovers specific, well-supported subgroups and outperforms existing baselines. In L1000 studies, SCENE identifies perturbational contexts with strong target-response matching and high positive rates. These results show that SCENE bridges broad knowledge and scenario-specific evidence, producing traceable, inspectable hypotheses for follow-up validation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces SCENE, a bi-level multi-agent framework for contextualizing broad biomedical knowledge into scenario-grounded propositions. An upper level generates search directions from general knowledge and grounds them in the dataset schema; a lower level performs multi-objective optimization to produce concrete propositions balancing evidential strength and data support. Iterative feedback between levels refines the search. The framework is evaluated on discovering patient subgroups with heterogeneous treatment effects in clinical trials and identifying context-specific responses in LINCS L1000 data, where it is reported to outperform baselines and yield traceable, inspectable hypotheses.

Significance. If the reported empirical results hold under rigorous evaluation, SCENE would offer a practical mechanism for bridging general biomedical knowledge with dataset-specific evidence, enabling the generation of hypotheses that are both mechanistically grounded and directly testable against data. This could support more transparent and reproducible discovery workflows in areas such as precision medicine and pharmacogenomics.

major comments (1)
  1. [Abstract] Abstract: the claim that SCENE 'outperforms existing baselines' in the clinical-trial and L1000 settings is presented without any description of the baselines, metrics, statistical tests, data splits, or exclusion criteria. Because these details are load-bearing for the central empirical claim, the strength of the reported outperformance cannot be assessed from the manuscript as written.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in the abstract regarding our empirical claims. We address the comment below and will revise the manuscript to improve clarity and assessability of the results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that SCENE 'outperforms existing baselines' in the clinical-trial and L1000 settings is presented without any description of the baselines, metrics, statistical tests, data splits, or exclusion criteria. Because these details are load-bearing for the central empirical claim, the strength of the reported outperformance cannot be assessed from the manuscript as written.

    Authors: We agree that the abstract is insufficiently detailed on these points. In the revised version we will expand the abstract to include: (1) a concise description of the baselines (random search, knowledge-only, and data-driven ablations); (2) the primary evaluation metrics (subgroup treatment-effect heterogeneity and positive-response rate); (3) mention of the statistical tests (e.g., permutation or bootstrap tests for significance); and (4) a brief note on data splits and exclusion criteria used in both the clinical-trial and LINCS L1000 experiments. These additions will make the outperformance claims directly evaluable from the abstract while remaining within length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical bi-level multi-agent framework (SCENE) for contextualizing biomedical knowledge into scenario-grounded propositions, with no equations, parameter fits, or formal derivations present in the abstract or described evaluations. Claims of outperformance rest on reported results from subgroup discovery in clinical trials and context identification in L1000 studies, measured against external baselines rather than by construction from the framework's own outputs. No self-citations, uniqueness theorems, or ansatzes are invoked to justify load-bearing steps; the iterative feedback is presented as a practical mechanism whose utility is assessed empirically. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, background axioms, or newly postulated entities.

pith-pipeline@v0.9.1-grok · 5775 in / 983 out tokens · 56697 ms · 2026-06-29T17:31:00.967369+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references

  1. [1]

    Zarin, Tony Tse, Rebecca J

    Deborah A. Zarin, Tony Tse, Rebecca J. Williams, Robert M. Califf, and Nicholas C. Ide. The ClinicalTrials.gov results database–update and key issues.The New England Journal of Medicine, 364(9):852–860, 2011

  2. [2]

    Crawford, David Peck, Joshua W

    Justin Lamb, Emily D. Crawford, David Peck, Joshua W. Modell, Irene C. Blat, Matthew J. Wrobel, Jim Lerner, Jean-Philippe Brunet, Aravind Subramanian, Kenneth N. Ross, Michael Reich, Haley Hieronymus, Guo Wei, Scott A. Armstrong, Stephen J. Haggarty, Paul A. Clemons, Ru Wei, Steven A. Carr, Eric S. Lander, and Todd R. Golub. The connectivity map: Using ge...

  3. [3]

    Corsello, David D

    Aravind Subramanian, Rajiv Narayan, Steven M. Corsello, David D. Peck, Ted E. Natoli, Xiaodong Lu, Joshua Gould, John F. Davis, Andrew A. Tubelli, Jacob K. Asiedu, David L. Lahr, Jodi E. Hirschman, Zihan Liu, Melanie Donahue, Bina Julian, Mariya Khan, David Wadden, Ian C. Smith, Daniel Lam, Arthur Liberzon, Courtney Toder, Mukta Bagul, Marek Orzechowski, ...

  4. [4]

    FEDMEKI: A benchmark for scaling medical foundation models via federated knowledge injection

    Jiaqi Wang, Xiaochen Wang, Lingjuan Lyu, Jinghui Chen, and Fenglong Ma. FEDMEKI: A benchmark for scaling medical foundation models via federated knowledge injection. In Proceedings of the Conference on Neural Information Processing Systems, 2024. Datasets and Benchmarks Track

  5. [5]

    HypKG: Hypergraph-based knowledge graph contextualization for precision healthcare

    Yuzhang Xie, Xu Han, Ran Xu, Xiao Hu, Jiaying Lu, and Carl Yang. HypKG: Hypergraph-based knowledge graph contextualization for precision healthcare. InProceedings of the International Semantic Web Conference, 2025

  6. [6]

    Foster, Jeremy M

    Jared C. Foster, Jeremy M. G. Taylor, and Stephen J. Ruberg. Subgroup identification from randomized clinical trial data.Statistics in Medicine, 30(24):2867–2880, 2011

  7. [7]

    Ilya Lipkovich, Alex Dmitrienko, Jonathan Denne, and Gregory Enas. Subgroup identification based on differential effect search—a recursive partitioning method for establishing response to treatment in patient subpopulations.Statistics in Medicine, 30(21):2601–2621, 2011

  8. [8]

    Recursive partitioning for heterogeneous causal effects

    Susan Athey and Guido Imbens. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353–7360, 2016

  9. [9]

    Generalized random forests.The Annals of Statistics, 47(2):1148–1178, 2019

    Susan Athey, Julie Tibshirani, and Stefan Wager. Generalized random forests.The Annals of Statistics, 47(2):1148–1178, 2019

  10. [10]

    CURLS: Causal rule learning for subgroups with significant treatment effect

    Jiehui Zhou, Linxiao Yang, Xingyu Liu, Xinyue Gu, Liang Sun, and Wei Chen. CURLS: Causal rule learning for subgroups with significant treatment effect. InProceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4619–4630, 2024

  11. [11]

    Learning subgroups with maximum treatment effects without causal heuristics

    Lincen Yang, Zhong Li, Matthijs van Leeuwen, and Saber Salehkaleybar. Learning subgroups with maximum treatment effects without causal heuristics. InProceedings of the AAAI Confer- ence on Artificial Intelligence, volume 40, pages 27565–27573, 2026

  12. [12]

    Building a knowledge graph to enable precision medicine.Scientific Data, 10(1):67, 2023

    Payal Chandak, Kexin Huang, and Marinka Zitnik. Building a knowledge graph to enable precision medicine.Scientific Data, 10(1):67, 2023

  13. [13]

    The PROV data model and abstract syntax notation

    Luc Moreau, Paolo Missier, Khalid Belhajjame, Reza B’Far, James Cheney, Sam Coppens, Stephen Cresswell, Yolanda Gil, Paul Groth, Graham Klyne, Timothy Lebo, James McCusker, Simon Miles, Jim Myers, Satya Sahoo, and Curt Tilmes. The PROV data model and abstract syntax notation. W3C recommendation, World Wide Web Consortium (W3C), 2013. 10

  14. [14]

    Modern approaches for evaluating treatment effect heterogeneity from clinical trials and observational data.Statistics in Medicine, 43(22):4388–4436, 2024

    Ilya Lipkovich, David Svensson, Bohdana Ratitch, and Alex Dmitrienko. Modern approaches for evaluating treatment effect heterogeneity from clinical trials and observational data.Statistics in Medicine, 43(22):4388–4436, 2024

  15. [15]

    MetaGPT: Meta programming for a multi- agent collaborative framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi- agent collaborative framework. InProceedings of the International Conference on Learning Representati...

  16. [16]

    White, Doug Burger, and Chi Wang

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversations. InProceedings of the Conference on Language Modeling, 2024

  17. [17]

    AFlow: Automating agentic workflow generation

    Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. AFlow: Automating agentic workflow generation. InProceedings of the International Conference on Learning Representations, 2025

  18. [18]

    Meyarivan

    Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and T. Meyarivan. A fast and elitist mul- tiobjective genetic algorithm: NSGA-II.IEEE Transactions on Evolutionary Computation, 6(2):182–197, 2002

  19. [19]

    Project Data Sphere: Data access

    Project Data Sphere. Project Data Sphere: Data access. https://data.projectdatasphe re.org/projectdatasphere/html/access. Accessed 2026-05-05

  20. [20]

    Green, Katherine E

    Angela K. Green, Katherine E. Reeder-Hayes, Robert W. Corty, Ethan Basch, Matthew I. Milowsky, Stacie B. Dusetzina, Antonia V . Bennett, and William A. Wood. The project data sphere initiative: Accelerating cancer research by sharing data.The Oncologist, 20(5):464–e20, 2015

  21. [21]

    Qidan Wen, Min Hu, Maohua Lai, Juan Li, Zhenxing Hu, Kewei Quan, Jia Liu, Hua Liu, Yanbing Meng, Suling Wang, Xiaohui Wen, Chuyi Yu, Shuna Li, Shiya Huang, Yanhua Zheng, Han Lin, Xingyan Liang, Lingjing Lu, Zhefen Mai, Chunren Zhang, Taixiang Wu, Ernest H. Y . Ng, Elisabet Stener-Victorin, and Hongxia Ma. Data: Effect of acupuncture and metformin on insul...

  22. [22]

    Qidan Wen, Min Hu, Maohua Lai, Juan Li, Zhenxing Hu, Kewei Quan, Jia Liu, Hua Liu, Yanbing Meng, Suling Wang, Xiaohui Wen, Chuyi Yu, Shuna Li, Shiya Huang, Yanhua Zheng, Han Lin, Xingyan Liang, Lingjing Lu, Zhefen Mai, Chunren Zhang, Taixiang Wu, Ernest H. Y . Ng, Elisabet Stener-Victorin, and Hongxia Ma. Effect of acupuncture and metformin on insulin sen...

  23. [23]

    Keenan, Sherry L

    Alexandra B. Keenan, Sherry L. Jenkins, Kathleen M. Jagodnik, Simon Koplev, Edward He, Denis Torre, Zichen Wang, Anders B. Dohlman, Moshe C. Silverstein, Alexander Lachmann, Maxim V . Kuleshov, Avi Ma’ayan, Vasileios Stathias, Raymond Terryn, Daniel Cooper, Michele Forlin, Amar Koleti, Dusica Vidovic, Caty Chung, Stephan C. Schürer, et al. The library of ...

  24. [24]

    Cooper, John P

    Amar Koleti, Raymond Terryn, Vasileios Stathias, Caty Chung, Daniel J. Cooper, John P. Turner, Dusica Vidovic, Michele Forlin, Tanya T. Kelley, Alessandro D’Urso, et al. Data portal for the library of integrated network-based cellular signatures (LINCS) program: Integrated access to diverse large-scale cellular perturbation response data.Nucleic Acids Res...

  25. [25]

    The relationship between precision-recall and ROC curves

    Jesse Davis and Mark Goadrich. The relationship between precision-recall and ROC curves. In Proceedings of the International Conference on Machine Learning, pages 233–240, 2006. 11

  26. [26]

    The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.PLOS ONE, 10(3):e0118432, 2015

    Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.PLOS ONE, 10(3):e0118432, 2015

  27. [27]

    A systematic analysis of performance measures for classification tasks.Information Processing & Management, 45(4):427–437, 2009

    Marina Sokolova and Guy Lapalme. A systematic analysis of performance measures for classification tasks.Information Processing & Management, 45(4):427–437, 2009

  28. [28]

    Qwen/Qwen3.5-9B model card

    Qwen Team. Qwen/Qwen3.5-9B model card. https://huggingface.co/Qwen/Qwen3.5-9 B, 2026. Accessed 2026-05-07

  29. [29]

    GLM-4.5-Air model card

    Z.AI. GLM-4.5-Air model card. https://huggingface.co/zai-org/GLM-4.5-Air ,

  30. [30]

    Qwen3: Think deeper, act faster

    Qwen Team. Qwen3: Think deeper, act faster. https://qwenlm.github.io/blog/qwen3/,

  31. [31]

    Qwen/Qwen3.5-27B model card

    Qwen Team. Qwen/Qwen3.5-27B model card. https://huggingface.co/Qwen/Qwen3. 5-27B, 2026. Accessed 2026-05-07

  32. [32]

    DeepSeek-V3.2 release

    DeepSeek-AI. DeepSeek-V3.2 release. https://api-docs.deepseek.com/news/news25 1201, 2025. Accessed 2026-05-07

  33. [33]

    ClinicalTrials.gov record for NCT00174655

    ClinicalTrials.gov. ClinicalTrials.gov record for NCT00174655. https://clinicaltria ls.gov/study/NCT00174655, 2011. ClinicalTrials.gov identifier NCT00174655; accessed 2026-04-28

  34. [34]

    ClinicalTrials.gov record for NCT02491333

    ClinicalTrials.gov. ClinicalTrials.gov record for NCT02491333. https://clinicaltria ls.gov/study/NCT02491333, 2022. ClinicalTrials.gov identifier NCT02491333; accessed 2026-04-28

  35. [35]

    Rothwell

    Peter M. Rothwell. Subgroup analysis in randomised controlled trials: Importance, indications, and interpretation.The Lancet, 365(9454):176–186, 2005

  36. [36]

    Kent, Jessica K

    David M. Kent, Jessica K. Paulus, David van Klaveren, Ralph D’Agostino, Steven Goodman, Rodney Hayward, John P. A. Ioannidis, Bray Patrick-Lake, Sally Morton, Michael Pencina, Gowri Raman, Joseph S. Ross, Harry P. Selker, Ravi Varadhan, Andrew Vickers, John B. Wong, and Ewout W. Steyerberg. The predictive approaches to treatment effect heterogeneity (PATH...

  37. [37]

    Mesirov, and Pablo Tamayo

    Arthur Liberzon, Chet Birger, Helga Thorvaldsdóttir, Mahmoud Ghandi, Jill P. Mesirov, and Pablo Tamayo. The molecular signatures database hallmark gene set collection.Cell Systems, 1(6):417–425, 2015

  38. [38]

    Qwen3.5: Towards native multimodal agents

    Qwen Team. Qwen3.5: Towards native multimodal agents. https://qwen.ai/blog?id=qw en3.5, 2026. Accessed 2026-05-03

  39. [39]

    An introduction to ROC analysis.Pattern Recognition Letters, 27(8):861–874, 2006

    Tom Fawcett. An introduction to ROC analysis.Pattern Recognition Letters, 27(8):861–874, 2006

  40. [40]

    scenario_context

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...