Recognition: unknown
Eliciting associations between clinical variables from LLMs via comparison questions across populations
Pith reviewed 2026-05-08 12:56 UTC · model grok-4.3
The pith
LLMs can recover clinical variable associations through structured patient comparison questions rather than direct queries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Indirect elicitation via triplet comparisons can recover meaningful association structure from LLMs and offer a cautious route from implicit correlations to causal statements that are congruent with LLM answering patterns.
What carries the argument
Patient comparison triplet questions fed to an LLM, combined with a statistical model that turns similarity decisions into correlation estimates and prompt-level environment shifts that enable invariant causal prediction across subpopulations.
If this is right
- Elicited correlations are smooth, stable, and clinically interpretable across prompted environments.
- Statistically significant variation across environments supports downstream invariance testing.
- Invariant causal prediction yields a small set of candidate invariant parent links.
- The method avoids direct elicitation while remaining congruent with the LLM's own answering patterns.
Where Pith is reading between the lines
- The same triplet-comparison logic could be tried in non-medical domains where LLMs hold implicit structured knowledge but direct questions risk bias.
- If the correlations prove robust, they might serve as a low-cost way to surface hypotheses for later verification in real patient data.
- Extending the prompt environments beyond simple subpopulation shifts could test whether the invariance step still isolates plausible causal candidates.
Load-bearing premise
LLM similarity decisions on comparison triplets faithfully reflect correlations present in its training data and prompt-level environment shifts correspond to distinct subpopulations in a way that enables valid invariance testing.
What would settle it
Whether the correlations obtained from triplet comparisons on COPD and MS variables match the direction and strength of associations documented in independent clinical literature, or whether the candidate invariant parent links remain stable under alternative prompt phrasings.
Figures
read the original abstract
The training data of large language models (LLMs) comprises a wide range of biomedical literature, reflecting data from many different patient populations. We investigate how it might be possible to recover information on correlation and causal links between patient characteristics, as a key building block for medical decision making. To avoid the pitfalls of direct elicitation, we propose an approach based on structured comparison questions, specifically patient comparison triplet questions. This is combined with a statistical model for the LLM representation that provides estimates of correlations without access to activations or model internals. Intuitively, we consider how similarity decisions of LLMs based on a first variable are affected by providing information on a second variable for one of the patients being assessed. We then induce prompt-level environment shifts to obtain correlation estimates for different subpopulations, which enables an invariant causal prediction (ICP) approach to obtain conservative candidate parent links. We demonstrate the method in two clinical domains, chronic obstructive pulmonary disease (COPD) and multiple sclerosis (MS). Across prompted environments, the elicited correlations are smooth, stable, and clinically interpretable, yet vary in a statistically significant way that supports downstream invariance testing, such that ICP provides a small set of candidate invariant parent links. These results show that indirect elicitation via triplet comparisons can recover meaningful association structure from LLMs and offer a cautious route from implicit correlations to causal statements that are congruent with LLM answering patterns.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an indirect elicitation method using patient comparison triplet questions to recover correlations between clinical variables from LLMs, paired with a statistical model that estimates these associations from similarity decisions without access to model internals or activations. Prompt-level environment shifts are induced to represent distinct subpopulations, enabling application of invariant causal prediction (ICP) to obtain conservative candidate causal parent links. Demonstrated in COPD and MS domains, the elicited correlations are claimed to be smooth, stable, and clinically interpretable, with statistically significant variation across environments that supports ICP yielding a small set of invariant parent candidates.
Significance. If the central claims hold, the work offers a cautious, indirect route to extract structured association and causal information from LLMs' implicit biomedical knowledge without direct prompting biases or internal access. This could provide a building block for medical decision support by leveraging the diverse patient populations reflected in LLM training data, while the ICP step adds conservatism to causal interpretations congruent with model answering patterns.
major comments (2)
- [Methods (environment construction and ICP)] The application of ICP to prompt-induced environments (described in the methods for environment shifts and ICP) is load-bearing for the causal claims, yet the environments are generated solely via linguistic modifications to queries on a single fixed LLM. No demonstration is provided that these shifts produce independent changes in the implicit joint distributions of clinical variables (as required for ICP to isolate invariant mechanisms), rather than shared model artifacts or prompt-specific effects; the abstract's report of statistically significant variation does not address this.
- [Abstract and Results] The abstract and results sections claim 'smooth, stable, and clinically interpretable' correlations plus 'statistically significant variation' supporting ICP, but provide no quantitative metrics, error bars, baseline comparisons, p-values, effect sizes, or details on how the parameters of the statistical model for LLM representation are estimated. This absence undermines evaluation of the evidence strength for the pipeline's outputs and the downstream invariance testing.
minor comments (1)
- [Methods] The description of the triplet comparison questions and how similarity decisions translate to correlation estimates could be clarified with an explicit equation or pseudocode for the statistical model, to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, providing clarifications and indicating planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Methods (environment construction and ICP)] The application of ICP to prompt-induced environments (described in the methods for environment shifts and ICP) is load-bearing for the causal claims, yet the environments are generated solely via linguistic modifications to queries on a single fixed LLM. No demonstration is provided that these shifts produce independent changes in the implicit joint distributions of clinical variables (as required for ICP to isolate invariant mechanisms), rather than shared model artifacts or prompt-specific effects; the abstract's report of statistically significant variation does not address this.
Authors: We thank the referee for this important observation on the ICP application. The environments are constructed via targeted linguistic modifications to the triplet comparison prompts on a fixed LLM, as detailed in the methods, to probe subpopulation-specific associations reflected in the model's training data. The reported statistically significant variation across environments demonstrates that the shifts meaningfully affect the elicited correlations, providing a basis for invariance testing. We do not assert that these prompt shifts generate fully independent joint distributions equivalent to real-world data sources; instead, ICP is applied conservatively to identify a small set of candidate parent links that remain invariant to the induced variations, ensuring congruence with the LLM's answering behavior. In revision, we will expand the methods and discussion sections to explicitly address the assumptions of prompt-based environments, discuss potential artifacts, and add supplementary robustness checks (e.g., sensitivity to prompt phrasing and cross-LLM consistency where feasible). revision: partial
-
Referee: [Abstract and Results] The abstract and results sections claim 'smooth, stable, and clinically interpretable' correlations plus 'statistically significant variation' supporting ICP, but provide no quantitative metrics, error bars, baseline comparisons, p-values, effect sizes, or details on how the parameters of the statistical model for LLM representation are estimated. This absence undermines evaluation of the evidence strength for the pipeline's outputs and the downstream invariance testing.
Authors: We agree that greater quantitative detail would improve the evaluation of our claims. The full results section includes estimates from the statistical model (based on a likelihood over triplet similarity decisions), but we will revise the abstract to report specific metrics such as mean correlation strengths with standard errors, p-values for cross-environment variation, and effect sizes. The results will be augmented with error bars on all figures and tables, explicit baseline comparisons (including direct elicitation where applicable), and a clearer description of parameter estimation (including the model's formulation, optimization procedure, and any regularization). These changes will directly support the claims of smoothness, stability, interpretability, and the validity of the ICP step. revision: yes
Circularity Check
No significant circularity; method applies external ICP to elicited LLM data
full rationale
The derivation proceeds by eliciting pairwise associations via triplet comparison questions posed to an LLM, fitting a statistical model to those responses to obtain correlation estimates, inducing prompt-based environment shifts, and then applying the external ICP procedure to identify invariant parent candidates. No equation or step reduces a claimed prediction to a fitted parameter by construction, no term is defined in terms of its output, and no load-bearing premise rests on a self-citation chain. The central result is an empirical procedure whose outputs are falsifiable against clinical knowledge and whose inputs (LLM answers) are generated independently of the ICP step. This is the normal non-circular case for a methodological proposal.
Axiom & Free-Parameter Ledger
free parameters (1)
- parameters of the statistical model for LLM representation
axioms (2)
- domain assumption LLM similarity decisions on comparison questions reflect correlations in training data
- domain assumption prompt-level environment shifts induce distinct subpopulation environments
Reference graph
Works this paper leans on
-
[1]
2023 , doi =
Ban, Taiyu and Chen, Lyuzhou and Lyu, Derui and Wang, Xiangyu and Chen, Huanhuan , title =. 2023 , doi =
2023
-
[2]
IEEE Transactions on Artificial Intelligence , year =
Ban, Taiyu and Chen, Lyuzhou and Lyu, Derui and Wang, Xiangyu and Zhu, Qinrui and Tu, Qiang and Chen, Huanhuan , title =. IEEE Transactions on Artificial Intelligence , year =
-
[3]
IEEE Transactions on Knowledge and Data Engineering , year =
Ban, Taiyu and Chen, Lyuzhou and Lyu, Derui and Wang, Xiangyu and Zhu, Qinrui and Chen, Huanhuan , title =. IEEE Transactions on Knowledge and Data Engineering , year =. doi:10.1109/TKDE.2025.3528461 , issn =
-
[4]
Computational Linguistics , year =
Belinkov, Yonatan , title =. Computational Linguistics , year =
-
[5]
Bradley, Ralph Allan and Terry, Milton E. , title =. Biometrika , publisher =. 1952 , volume =. doi:10.2307/2334029 , eprint =
-
[6]
org/posts/baJyjpktzmcmRfosq/ stitching-saes-of-different-sizes
Conmy, Arthur and. Towards automated circuit discovery for mechanistic interpretability , publisher =. 2023 , month =. doi:10.48550/arXiv.2304.14997 , eprint =
-
[7]
2024 , doi =
Darvariu, Victor-Alexandru and Hailes, Stephen and Musolesi, Mirco , title =. 2024 , doi =
2024
-
[8]
2021 , journal =
Elazar, Yanai and Ravfogel, Shauli and Jacovi, Alon and Goldberg, Yoav , title =. 2021 , journal =
2021
-
[9]
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =
Feng, Tao and Qu, Lizhen and Tandon, Niket and Li, Zhuang and Kang, Xiaoxi and Haffari, Gholamreza , title =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =
-
[10]
Convergent Evolution: How Different Language Models Learn Similar Number Representations
Fu, Deqing and Zhou, Tianyi and Belkin, Mikhail and Sharan, Vatsal and Jia, Robin , title =. 2026 , month =. doi:10.48550/arXiv.2604.20817 , eprint =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.20817 2026
-
[11]
The Twelfth International Conference on Learning Representations , year =
Gurnee, Wes and Tegmark, Max , title =. The Twelfth International Conference on Learning Representations , year =
-
[12]
Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su, Dan and Xu, Yan and Ishii, Etsuko and Bang, Yejin and Chen, Delong and Dai, Wenliang and Chan, Ho Shu and Madotto, Andrea and Fung, Pascale , title =. ACM Computing Surveys , year =. doi:10.1145/3571730 , issn =
-
[13]
Can large language models infer causation from correlation? , booktitle =
Jin, Zhijing and Liu, Jiarui and Lyu, Zhiheng and Poff, Spencer and Sachan, Mrinmaya and Mihalcea, Rada and Diab, Mona and Sch. Can large language models infer causation from correlation? , booktitle =. 2024 , url =
2024
-
[14]
Advances in Neural Information Processing Systems , year =
Jin, Zhijing and Chen, Yuen and Leeb, Felix and Gresele, Luigi and Kamal, Ojasv and Lyu, Zhiheng and Blin, Kevin and Adauto, Fernando Gonzalez and. Advances in Neural Information Processing Systems , year =
-
[15]
Assessing multimodal chronic wound embeddings with expert triplet agreement , publisher =
Kabus, Fabian and Hindel, Julia and Bratuli. Assessing multimodal chronic wound embeddings with expert triplet agreement , publisher =. 2026 , month =. doi:10.48550/arXiv.2603.29376 , eprint =
-
[16]
Khatibi, Elahe and Abbasian, Mahyar and Yang, Zhongqi and Azimi, Iman and Rahmani, Amir M. , title =. 2025 , month =. doi:10.48550/arXiv.2405.01744 , eprint =
-
[17]
Lindsey, Jack and Gurnee, Wes and Ameisen, Emmanuel and Chen, Brian and Pearce, Adam and Turner, Nicholas L. and Citro, Craig and Abrahams, David and Carter, Shan and Hosmer, Basil and Marcus, Jonathan and Sklar, Michael and Templeton, Adly and Bricken, Trenton and McDougall, Callum and Cunningham, Hoagy and Henighan, Thomas and Jermyn, Adam and Jones, An...
-
[18]
Can large language models build causal graphs? , booktitle =
Long, Stephanie and Schuster, Tibor and Pich. Can large language models build causal graphs? , booktitle =. 2022 , url =
2022
-
[19]
Causal discovery with language models as imperfect experts , booktitle =
Long, Stephanie and Pich. Causal discovery with language models as imperfect experts , booktitle =. 2023 , url =
2023
-
[20]
The First Conference on Language Modeling , year =
Mallen, Alex and Brumley, Madeline and Kharchenko, Julia and Belrose, Nora , title =. The First Conference on Language Modeling , year =
-
[21]
The First Conference on Language Modeling , year =
Marks, Samuel and Tegmark, Max , title =. The First Conference on Language Modeling , year =
-
[22]
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =
Meng, Zaiqiao and Liu, Fangyu and Shareghi, Ehsan and Su, Yixuan and Collins, Charlotte and Collier, Nigel , title =. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =
-
[23]
Advances in Neural Information Processing Systems , year =
Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , title =. Advances in Neural Information Processing Systems , year =
-
[24]
Journal of the Royal Statistical Society Series B: Statistical Methodology , volume =
Peters, Jonas and B. Causal inference by using invariant prediction: identification and confidence intervals , journal =. 2016 , month =. doi:10.1111/rssb.12167 , issn =
-
[25]
Petroni, Fabio and Rockt. Language models as knowledge bases? , booktitle =. 2019 , pages =. doi:10.18653/v1/D19-1250 , address =
-
[26]
Findings of the Association for Computational Linguistics: NAACL 2024 , year =
Pezeshkpour, Pouya and Hruschka, Estevam , title =. Findings of the Association for Computational Linguistics: NAACL 2024 , year =
2024
-
[27]
and Duvenaud, David , title =
Requeima, James and Bronskill, John and Choi, Dami and Turner, Richard E. and Duvenaud, David , title =. Advances in Neural Information Processing Systems , year =
-
[28]
and Cheng, Newton and Durmus, Esin and
Sharma, Mrinank and Tong, Meg and Korbak, Tomasz and Duvenaud, David and Askell, Amanda and Bowman, Samuel R. and Cheng, Newton and Durmus, Esin and. Towards understanding sycophancy in language models , booktitle =. 2024 , url =
2024
-
[29]
Logan IV, Eric Wallace, and Sameer Singh
Shin, Taylor and Razeghi, Yasaman and Logan Iv, Robert L. and Wallace, Eric and Singh, Sameer , title =. Proceedings of the 2020. 2020 , pages =. doi:10.18653/v1/2020.emnlp-main.346 , address =
-
[30]
Sung, Mujeen and Lee, Jinhyuk and Yi, Sean and Jeon, Minji and Kim, Sungdong and Kang, Jaewoo , title =. Proceedings of the 2021. 2021 , pages =. doi:10.18653/v1/2021.emnlp-main.388 , address =
-
[31]
Transactions on Machine Learning Research , year =
Takayama, Masayuki and Okuda, Tadahisa and Pham, Thong and Ikenoue, Tatsuyoshi and Fukuma, Shingo and Shimizu, Shohei and Sannai, Akiyoshi , title =. Transactions on Machine Learning Research , year =
-
[32]
Thurstone, L. L. , title =. Psychological Review , year =. doi:10.1037/h0070288 , issn =
-
[33]
Insights into ordinal embedding algorithms: a systematic evaluation , journal =
Vankadara, Leena Chennuru and Lohaus, Michael and Haghiri, Siavash and Wahab, Faiz Ul and. Insights into ordinal embedding algorithms: a systematic evaluation , journal =. 2023 , volume =
2023
-
[34]
and Sharma, Amit , title =
Vashishtha, Aniket and Reddy, Abbavaram Gowtham and Kumar, Abhinav and Bachu, Saketh and Balasubramanian, Vineeth N. and Sharma, Amit , title =. The Thirteenth International Conference on Learning Representations , year =
-
[35]
Advances in
Vig, Jesse and Gehrmann, Sebastian and Belinkov, Yonatan and Qian, Sharon and Nevo, Daniel and Singer, Yaron and Shieber, Stuart , title =. Advances in. 2020 , volume =
2020
-
[36]
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence , year =
Wan, Guangya and Lu, Yunsheng and Wu, Yuqi and Hu, Mengxuan and Li, Sheng , title =. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence , year =
-
[37]
2023 , journal =
Yao, Zonghai and Cao, Yi and Yang, Zhichao and Yu, Hong , title =. 2023 , journal =
2023
-
[38]
Give me the facts!
Youssef, Paul and Kora. Give me the facts!. Findings of the Association for Computational Linguistics: EMNLP 2023 , year =
2023
-
[39]
Causal parrots: large language models may talk causality but are not causal , journal =
Ze. Causal parrots: large language models may talk causality but are not causal , journal =. 2023 , url =
2023
-
[40]
2024 , doi =
Zhou, Yu and Wu, Xingyu and Huang, Beicheng and Wu, Jibin and Feng, Liang and Tan, Kay Chen , title =. 2024 , doi =
2024
-
[41]
Brodeur, Peter G. and Buckley, Thomas A. and Kanjee, Zahir and Goh, Ethan and Bin Ling, Evelyn and Jain, Priyank and Cabral, Stephanie and Abdulnour, Raja-Elie and Haimovich, Adrian D. and Freed, Jason A. and Olson, Andrew and Morgan, Daniel J. and Hom, Jason and Gallo, Robert and McCoy, Liam G. and Mombini, Haadi and Lucas, Christopher and Fotoohi, Misha...
-
[42]
Suppes, Patrick and Krantz, David M. and Luce, R. Duncan and Tversky, Amos , title =. 1989 , isbn =. doi:10.1016/C2009-0-21665-5 , url =
-
[43]
Prediction of air trapping or pulmonary hyperinflation by forced spirometry in
Alter, Peter and Orszag, Jan and Kellerer, Christoph and Kahnert, Kathrin and Speicher, Tobias and Watz, Henrik and Bals, Robert and Welte, Tobias and Vogelmeier, Claus and J. Prediction of air trapping or pulmonary hyperinflation by forced spirometry in. ERJ Open Research , year =. doi:10.1183/23120541.00092-2020 , url =
-
[44]
and Annareddy, Srinivasulareddy , title =
Devalla, Lokesh and Ghewade, Bhalchandra and Jadhav, Ulhas S. and Annareddy, Srinivasulareddy , title =. Cureus , year =. doi:10.7759/cureus.53492 , url =
-
[45]
Holtz, Mayara and Perossi, Laura and Perossi, Juliana and dos Santos, Daniele Oliveira and de Souza, Hugo Celso Dutra and Gastaldi, Ada Clarice , title =. PLOS ONE , year =. doi:10.1371/journal.pone.0281780 , url =
-
[46]
and Bowen, Tomorrow and Luthra, Munish and Dixit, Adviteeya N
Hwang, Jiyoung and Saikawa, Eri and Avramov, Alexander and Chen, Siran and Shelly, Sandeep and Altartoor, Khaled A. and Bowen, Tomorrow and Luthra, Munish and Dixit, Adviteeya N. and Klein, Adam M. , title =. Environment International , year =. doi:10.1016/j.envint.2025.109560 , url =
-
[47]
Journal of Neuroradiology , year =
Nabizadeh, Farzaneh and Zafari, Roghayeh and Mohamadi, Mobin and Maleki, Tahereh and Fallahi, Mahdieh and Rafiei, Nazanin , title =. Journal of Neuroradiology , year =. doi:10.1016/j.neurad.2023.11.007 , url =
-
[48]
Jiang, Xiaotong and Shen, Chunhua and Teunissen, Charlotte and Wessels, Marianne and Zetterberg, Henrik and Giovannoni, Gavin and Singh, Carol M. and Caba, Bastien and Elliott, Claire and Fisher, Elizabeth and de Moor, Chris and Belachew, Samia and Gafson, Arie , title =. Multiple Sclerosis Journal , year =. doi:10.1177/13524585231176732 , url =
-
[49]
Mirmosayyeb, Omid and Yazdan Panah, Mohammad and Mokary, Yousef and Mohammadi, Mohammad and Moases Ghaffary, Elham and Shaygannejad, Vahid and Weinstock-Guttman, Bianca and Zivadinov, Robert and Jakimovski, Dejan , title =. PLOS ONE , year =. doi:10.1371/journal.pone.0312421 , url =
-
[50]
Peng, Junjie and Li, Xiaohua and Zhou, Hong and Wang, Tao and Li, Xiaoou and Chen, Lei , title =. Respiration , year =. doi:10.1159/000541633 , url =
-
[51]
European Respiratory Review , year =
Wang, Tinggui and Liang, Mengdan and Ye, Xiaoying and Huang, Xiaoxiao and Chen, Hanbing and He, Xiongkun and Xie, Mengying and Xie, Xiaowei and Jiang, Xiannuan and Chen, Zhehui and Xie, Baosong and Zeng, Yiming and Xie, Xiaoxu , title =. European Respiratory Review , year =. doi:10.1183/16000617.0055-2025 , url =
-
[52]
Rosenstein, Igal and Nordin, Anna and Sabir, Hemin and Malmestr. Association of serum glial fibrillary acidic protein with progression independent of relapse activity in multiple sclerosis , journal =. 2024 , volume =. doi:10.1007/s00415-024-12389-y , url =
-
[53]
and Everest, Marina and Beniameen, David and Rosehart, Heather , title =
Morrow, Sarah A. and Everest, Marina and Beniameen, David and Rosehart, Heather , title =. Multiple Sclerosis and Related Disorders , year =. doi:10.1016/j.msard.2025.106799 , url =
-
[54]
Mollison, Daisy and Sellar, Robin and Bastin, Mark and Mollison, Denis and Chandran, Siddharthan and Wardlaw, Joanna and Connick, Peter , title =. PLOS ONE , year =. doi:10.1371/journal.pone.0177727 , url =
-
[55]
Seabold, Skipper and Perktold, Josef , title =. Proceedings of the 9th Python in Science Conference , year =. doi:10.25080/Majora-92bf1922-011 , url =
-
[56]
Introducing gpt-oss , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.