arxiv: 2604.09537 · v1 · submitted 2026-04-10 · 💻 cs.CL · cs.AI· cs.IR· cs.LG

Recognition: unknown

Case-Grounded Evidence Verification: A Framework for Constructing Evidence-Sensitive Supervision

Daniel Truhn, Mahshad Lotfinia, Mehdi Joodaki, Soroosh Tayebi Arasteh, Sven Nebelung

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:09 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IRcs.LG

keywords evidence verificationsupervision constructionevidence dependenceradiologycounterfactual negativescase contextnatural language inference

0 comments

The pith

A supervision construction method automatically creates support and non-support examples so that verifiers must depend on whether evidence backs a claim for a given case.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces case-grounded evidence verification as a framework where a model sees a local case, external evidence, and a structured claim, then decides if the evidence supports the claim in that specific context. To create training data without manual annotation, the authors generate positive support pairs along with controlled negatives that include counterfactual wrong-state versions and topic-related mismatches. When a standard model is trained on this data in a radiology setting, it beats baselines that see only the case or only the evidence, stays accurate with correct evidence, and drops sharply when evidence is removed or swapped. These patterns hold across unseen evidence articles and a separate case set, though results vary with evidence source shifts and model backbone. The work frames the core bottleneck in evidence grounding as insufficient supervision that encodes the causal role of the evidence itself.

Core claim

The supervision construction procedure generates explicit support examples together with semantically controlled non-support examples, including counterfactual wrong-state and topic-related negatives, without manual evidence annotation. Training a verifier on the resulting task produces models that substantially outperform case-only and evidence-only baselines, remain strong when evidence is correct, and collapse when evidence is removed or swapped, showing genuine dependence on whether the evidence supports the claim for the case.

What carries the argument

The supervision construction procedure that pairs case context with evidence and claims, then automatically produces support instances plus counterfactual wrong-state and topic-related negative instances to encode evidence dependence.

If this is right

Verifiers can be trained to exhibit genuine evidence dependence for case-specific decisions without requiring manual support labels.
The same construction procedure scales to produce large supervision sets for evidence-grounded reasoning in domains such as radiology.
Trained models transfer their evidence-checking behavior to unseen evidence articles and external case distributions.
Performance remains sensitive to shifts in the source of evidence and to the choice of underlying model backbone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same automatic negative generation approach could be applied to build supervision for evidence verification in legal or scientific reasoning tasks.
Combining this verifier with retrieval systems might create end-to-end pipelines that both fetch and check evidence.
The sharp drop under evidence swap offers a practical test for whether other models have truly learned to ground predictions in supplied evidence.

Load-bearing premise

The automatically generated counterfactual wrong-state and topic-related negative examples accurately represent real cases of evidence non-support without introducing exploitable artifacts or semantic mismatches that a model could learn instead of true verification.

What would settle it

Run the trained verifier on a new test set of human-annotated radiology cases that include known unsupported evidence and measure whether accuracy falls to near-chance levels comparable to the synthetic negative conditions.

Figures

Figures reproduced from arXiv: 2604.09537 by Daniel Truhn, Mahshad Lotfinia, Mehdi Joodaki, Soroosh Tayebi Arasteh, Sven Nebelung.

**Figure 1.** Figure 1: Overview of the framework. 4.3 Evaluation Metrics Let pˆi = fϕ(ci , ei , yi) denote the predicted probability of support for test example i, and let ℓi ∈ {0, 1} denote its gold label. We report threshold-free ranking metrics, thresholded decision metrics, and a proper scoring rule. Ranking metrics. We report the area under the receiver operating characteristic curve (AUROC) and the area under the precision… view at source ↗

**Figure 2.** Figure 2: Intervention profile of the case-grounded verifier on MIMIC-CXR, showing rapid saturation [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Representative MIMIC-CXR test example under evidence interventions. The verifier [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Evidence-grounded reasoning requires more than attaching retrieved text to a prediction: a model should make decisions that depend on whether the provided evidence supports the target claim. In practice, this often fails because supervision is weak, evidence is only loosely tied to the claim, and evaluation does not test evidence dependence directly. We introduce case-grounded evidence verification, a general framework in which a model receives a local case context, external evidence, and a structured claim, and must decide whether the evidence supports the claim for that case. Our key contribution is a supervision construction procedure that generates explicit support examples together with semantically controlled non-support examples, including counterfactual wrong-state and topic-related negatives, without manual evidence annotation. We instantiate the framework in radiology and train a standard verifier on the resulting support task. The learned verifier substantially outperforms both case-only and evidence-only baselines, remains strong under correct evidence, and collapses when evidence is removed or swapped, indicating genuine evidence dependence. This behavior transfers across unseen evidence articles and an external case distribution, though performance degrades under evidence-source shift and remains sensitive to backbone choice. Overall, the results suggest that a major bottleneck in evidence grounding is not only model capacity, but the lack of supervision that encodes the causal role of evidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete procedure for auto-generating support and non-support examples for evidence verifiers in radiology, and the reported tests show the model drops when evidence is altered, but the negatives themselves are not shown to be free of artifacts.

read the letter

The main takeaway is that this work supplies a practical recipe for building supervision that forces a verifier to attend to whether evidence actually supports a case-specific claim, using automatically created counterfactual wrong-state and topic-related negatives. That construction step is the clearest addition over standard contrastive setups, and the experiments indicate the trained model beats simple baselines while collapsing under evidence removal or swap, which points to some real dependence on the provided text rather than just the claim or case alone. Transfer to unseen articles is also reported, though it weakens under source shift and depends on the backbone model chosen. Those outcomes are worth noting for anyone trying to move beyond loose retrieval-augmented generation in specialized domains. The soft spot sits in the negative generation process itself. The abstract describes the negatives as semantically controlled but gives no operational steps for defining wrong-states in radiology reports or for sampling topic-related cases while keeping context intact. Without those details or controls for semantic drift, phrasing patterns, or label-correlated artifacts, the observed sensitivity to evidence changes could reflect learning of generation quirks instead of genuine support logic. Dataset sizes, statistical tests, and exact baseline reproductions are also absent from the summary, which leaves the strength of the gains hard to gauge. This is useful reading for groups working on grounded reasoning or medical NLP supervision design. Readers who already care about negative sampling and evidence dependence will get a workable template to adapt, even if they will need to re-implement and test the negatives carefully. The paper is coherent on its own terms and addresses a recognized bottleneck, so it merits sending to referees rather than a desk reject. I would recommend review with a request for expanded methods on negative construction and additional ablation checks on artifact risks.

Referee Report

3 major / 2 minor

Summary. The paper introduces case-grounded evidence verification, a framework where a model receives local case context, external evidence, and a structured claim, and decides if the evidence supports the claim. The key contribution is an automatic supervision construction procedure that generates explicit support examples along with semantically controlled non-support examples (counterfactual wrong-state and topic-related negatives) without manual annotation. Instantiated in radiology, the trained verifier outperforms case-only and evidence-only baselines, remains strong with correct evidence, and collapses under evidence removal or swap, which the authors interpret as genuine evidence dependence; this transfers across unseen articles but degrades under source shift and is backbone-sensitive.

Significance. If the synthetic non-support examples faithfully isolate the causal support relation without artifacts, the work offers a scalable route to evidence-sensitive supervision that directly addresses a core bottleneck in evidence-grounded reasoning. The explicit tests of dependence via evidence removal and swap constitute a falsifiable empirical strength that goes beyond standard accuracy metrics. The approach could improve reliability in high-stakes domains such as medical reasoning, provided the supervision quality holds.

major comments (3)

[§3] §3 (Supervision Construction): The operational details for defining and generating 'counterfactual wrong-state' negatives in the radiology domain—specifically how wrong-states are identified while preserving case context and avoiding semantic drift—are not provided. This is load-bearing for the central claim, as the observed outperformance and collapse behavior could reflect exploitation of generation artifacts rather than learned verification logic.
[§4] §4 (Experiments): No dataset sizes, statistical significance tests for the reported performance gains over baselines, or exact implementation details for the case-only and evidence-only baselines are supplied. Without these, it is impossible to assess whether the gains are robust or potentially confounded by differences in training data volume or baseline construction.
[§4.3] §4.3 (Evidence Sensitivity): The claim of genuine evidence dependence rests on the collapse when evidence is removed or swapped, yet no additional controls (e.g., human validation of negatives or detection of systematic patterns in the synthetic examples) are described to rule out the model learning spurious cues from the negative generation process instead of the support relation.

minor comments (2)

[§3] The notation for the verifier input tuple (case context, evidence, claim) could be formalized more clearly, perhaps with an explicit equation, to aid reproducibility.
[§4] Figure captions and axis labels in the evidence-sensitivity plots would benefit from explicit mention of the exact swap and removal conditions tested.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important areas for improving clarity in the supervision construction and experimental reporting, which we will address in the revision. We respond to each major comment below.

read point-by-point responses

Referee: [§3] §3 (Supervision Construction): The operational details for defining and generating 'counterfactual wrong-state' negatives in the radiology domain—specifically how wrong-states are identified while preserving case context and avoiding semantic drift—are not provided. This is load-bearing for the central claim, as the observed outperformance and collapse behavior could reflect exploitation of generation artifacts rather than learned verification logic.

Authors: We agree that expanding the operational details in §3 is necessary for transparency. In the revised manuscript, we will add a dedicated subsection describing the counterfactual wrong-state generation: wrong-states are derived from a predefined set of radiology-specific state transitions (e.g., presence/absence of findings like effusion or consolidation, drawn from clinical literature on common diagnostic alternatives) applied only to the claim-relevant portions of the case. Case context is preserved by leaving all unrelated findings and history unchanged, and semantic drift is controlled via template-based edits followed by automatic consistency checks against the original case text. These additions will clarify that the negatives isolate the support relation without introducing exploitable artifacts. revision: yes
Referee: [§4] §4 (Experiments): No dataset sizes, statistical significance tests for the reported performance gains over baselines, or exact implementation details for the case-only and evidence-only baselines are supplied. Without these, it is impossible to assess whether the gains are robust or potentially confounded by differences in training data volume or baseline construction.

Authors: We accept that these details are required for reproducibility and robustness assessment. The revised version will report exact dataset sizes (e.g., counts of support examples and each negative type in train/dev/test splits), include statistical significance tests (paired bootstrap or McNemar's test with p-values for gains over baselines), and specify baseline implementations: case-only receives the structured claim plus full case context; evidence-only receives the claim plus evidence text; both use identical training regimes, data volumes, and hyperparameters as the full model. revision: yes
Referee: [§4.3] §4.3 (Evidence Sensitivity): The claim of genuine evidence dependence rests on the collapse when evidence is removed or swapped, yet no additional controls (e.g., human validation of negatives or detection of systematic patterns in the synthetic examples) are described to rule out the model learning spurious cues from the negative generation process instead of the support relation.

Authors: The referee correctly notes that the dependence tests would be strengthened by ruling out generation artifacts. We will add to §4.3 a human validation experiment on a random sample of 200 synthetic negatives (with inter-annotator agreement) confirming they do not support the claims, plus an analysis checking for systematic lexical or structural cues in the negatives. The existing removal/swap results already provide falsifiable evidence of dependence, but these controls will further address the concern. This constitutes a partial revision as the core experiments remain unchanged. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical supervision framework

full rationale

The paper presents a data-construction procedure for generating support and non-support examples (including counterfactual wrong-state and topic-related negatives) for training an evidence verifier, followed by empirical training and evaluation against baselines plus ablation tests on evidence removal/swap. No equations, fitted parameters, or derivations are described that reduce any claimed prediction or result to the inputs by construction. The central claims rest on measurable performance differences in held-out testing and transfer settings, which are independent of any self-referential definitions or self-citation chains. The method is self-contained and externally falsifiable via the described generation rules and model behavior.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of the automatic negative example generation and the assumption that observed performance collapse under evidence removal/swapping demonstrates causal dependence rather than other artifacts.

free parameters (1)

verifier model hyperparameters
Standard training parameters for the backbone model are used but not detailed in the abstract.

axioms (1)

domain assumption Generated counterfactual and topic-related negatives faithfully represent evidence non-support without introducing model-exploitable shortcuts.
Invoked in the supervision construction procedure as the basis for controlled non-support examples.

pith-pipeline@v0.9.0 · 5544 in / 1352 out tokens · 61597 ms · 2026-05-10T17:09:54.439221+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 16 canonical work pages · 2 internal anchors

[1]

A diagnostic study of explainability techniques for text classification

Pepa Atanasova, Jakob Grue Simonsen, Christina Lioma, and Isabelle Augenstein. A diagnostic study of explainability techniques for text classification. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3256–3274, 2020. URL https://aclanthology.org/2020.emnlp-main.263/

2020
[2]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[3]

Bowman, Gabor Angeli, Christopher Potts, and Christopher D

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large an- notated corpus for learning natural language inference. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, September 2015. URL https: //aclanthology.org/D15-1075/

2015
[4]

Kay Henning Brodersen, Cheng Soon Ong, Klaas Enno Stephan, and Joachim M. Buhmann. The balanced accuracy and its posterior distribution. In2010 20th International Conference on Pattern Recognition, pages 3121–3124, 2010. doi: 10.1109/ICPR.2010.764

work page doi:10.1109/icpr.2010.764 2010
[5]

From large language models to multimodal ai: a scoping review on the potential of generative ai in medicine

Lukas Buess, Matthias Keicher, Nassir Navab, Andreas Maier, and Soroosh Tayebi Arasteh. From large language models to multimodal ai: a scoping review on the potential of generative ai in medicine. Biomedical Engineering Letters, 15(5):845–863, 2025

2025
[6]

e-snli: natural language inference with natural language explanations

Oana-Maria Camburu, Tim Rocktäschel, Thomas Lukasiewicz, and Phil Blunsom. e-snli: natural language inference with natural language explanations. InProceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 9560–9572, Red Hook, NY , USA, 2018

2018
[7]

Chex- pert plus: Augmenting a large chest x-ray dataset with text radiology reports, patient demographics and additional image formats.arXiv preprint arXiv:2405.19538,

Pierre Chambon, Jean-Benoit Delbrouck, Thomas Sounack, Shih-Cheng Huang, Zhihong Chen, Maya Varma, Steven QH Truong, Chu The Chuong, and Curtis P Langlotz. Chexpert plus: Augmenting a large chest x-ray dataset with text radiology reports, patient demographics and additional image formats.arXiv preprint arXiv:2405.19538, 2024

work page arXiv 2024
[8]

Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tai, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping H...

2024
[9]

ELECTRA: Pre-training text encoders as discriminators rather than generators

Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. Electra: Pre-training text encoders as discriminators rather than generators.arXiv preprint arXiv:2003.10555, 2020

work page arXiv 2003
[10]

arXiv preprint arXiv:2205.09712 , year=

Antonia Creswell, Murray Shanahan, and Irina Higgins. Selection-inference: Exploiting large language models for interpretable logical reasoning.arXiv preprint arXiv:2205.09712, 2022

work page arXiv 2022
[11]

The pascal recognising textual entailment challenge

Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual entailment challenge. InMachine learning challenges workshop, pages 177–190. Springer, 2005

2005
[12]

The relationship between precision-recall and ROC curves,

Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. InProceedings of the 23rd International Conference on Machine Learning, page 233–240, New York, NY , USA, 2006. Association for Computing Machinery. URLhttps://doi.org/10.1145/1143844.1143874

work page doi:10.1145/1143844.1143874 2006
[13]

Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C. Wallace. ERASER: A benchmark to evaluate rationalized NLP models. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4443–4458, 2020. URL https://aclanthology.org/2020.acl-main.408/

2020
[14]

Bootstrap methods: another look at the jackknife

Bradley Efron. Bootstrap methods: another look at the jackknife. InBreakthroughs in statistics: Methodol- ogy and distribution, pages 569–593. Springer, 1992

1992
[15]

Chapman and Hall/CRC, 1994

Bradley Efron and Robert J Tibshirani.An introduction to the bootstrap. Chapman and Hall/CRC, 1994

1994
[16]

On the foundations of noise-free selective classification.Journal of Machine Learning Research, 11(53):1605–1641, 2010

Ran El-Yaniv and Yair Wiener. On the foundations of noise-free selective classification.Journal of Machine Learning Research, 11(53):1605–1641, 2010. URL http://jmlr.org/papers/v11/el-yaniv10a. html

2010
[17]

Agentic retrieval-augmented reasoning reshapes collective reliability under model variability in radiology question answering, 2026

Mina Farajiamiri, Jeta Sopa, Saba Afza, Lisa Adams, Felix Barajas Ordonez, Tri-Thien Nguyen, Mahshad Lotfinia, Sebastian Wind, Keno Bressem, Sven Nebelung, Daniel Truhn, and Soroosh Tayebi Arasteh. Agentic retrieval-augmented reasoning reshapes collective reliability under model variability in radiology question answering, 2026. URLhttps://arxiv.org/abs/2...

work page arXiv 2026
[18]

Radiopaedia: building an online radiology resource

F Gaillard et al. Radiopaedia: building an online radiology resource. European Congress of Radiology- RANZCR ASM 2011, 2011

2011
[19]

Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A

Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wal...

2020
[20]

Selective classification for deep neural networks

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. NIPS’17, page 4885–4894, Red Hook, NY , USA, 2017. Curran Associates Inc

2017
[21]

Verification of forecasts expressed in terms of probability.Monthly weather review, 78(1):1–3, 1950

W Brier Glenn et al. Verification of forecasts expressed in terms of probability.Monthly weather review, 78(1):1–3, 1950

1950
[22]

Strictly proper scoring rules, prediction, and estimation.Journal of the American statistical Association, 102(477):359–378, 2007

Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation.Journal of the American statistical Association, 102(477):359–378, 2007

2007
[23]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 1321–1330, 2017

2017
[24]

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A. Smith. Annotation artifacts in natural language inference data. In Marilyn Walker, Heng Ji, and Amanda Stent, editors,Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volu...

2018
[25]

Retrieval augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InInternational conference on machine learning, pages 3929–3938. PMLR, 2020

2020
[26]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

2020
[27]

Alistair E

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, and Andrew Y . Ng. Chexpert: a large chest radiogr...

work page doi:10.1609/aaai.v33i01.3301590 2019
[28]

Leveraging passage retrieval with generative models for open domain question answering

Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, 2021. URL https://aclanthology.org/ 2021.eacl-main.74/

2021
[29]

Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4198–4205, 2020

2020
[30]

Sarthak Jain and Byron C. Wallace. Attention is not Explanation. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3543–3556, Minneapolis, Minnesota, 2019. Association for Computational Linguistics. URLhttps://aclantholog...

2019
[31]

Sarthak Jain, Sarah Wiegreffe, Yuval Pinter, and Byron C. Wallace. Learning to faithfully rationalize by construction. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4459–4473, Online, 2020. URLhttps://aclanthology.org/2020.acl-main.409/

2020
[32]

Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019. 11

2019
[33]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781,

2020
[34]

URLhttps://aclanthology.org/2020.emnlp-main.550/

2020
[35]

CoRR , volume =

Divyansh Kaushik, Eduard Hovy, and Zachary C Lipton. Learning the difference that makes a difference with counterfactually-augmented data.arXiv preprint arXiv:1909.12434, 2019

work page arXiv 1909
[36]

Rationalizing neural predictions

Tao Lei, Regina Barzilay, and Tommi Jaakkola. Rationalizing neural predictions. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 107–117. Association for Computational Linguistics, 2016. URLhttps://aclanthology.org/D16-1011/

2016
[37]

Retrieval- augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval- augmented generation for knowledge-intensive nlp tasks. InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’2...

2020
[38]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[39]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

2019
[40]

Thomas McCoy, Ellie Pavlick, and Tal Linzen

R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428– 3448, Florence, Italy, 2019. URLhttps://aclanthology.org/...

2019
[41]

Distant supervision for relation extraction without labeled data

Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky. Distant supervision for relation extraction without labeled data. In Keh-Yih Su, Jian Su, Janyce Wiebe, and Haizhou Li, editors,Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 100...

2009
[42]

Differential privacy for medical deep learning: methods, tradeoffs, and deployment implications.npj Digital Medicine, 9:93, 2026

Marziyeh Mohammadi, Mohsen Vejdanihemmat, Mahshad Lotfinia, Mirabela Rusu, Daniel Truhn, Andreas Maier, and Soroosh Tayebi Arasteh. Differential privacy for medical deep learning: methods, tradeoffs, and deployment implications.npj Digital Medicine, 9:93, 2026

2026
[43]

Longllada: Unlocking long context capabilities in diffusion llms

Yixin Nie, Haonan Chen, and Mohit Bansal. Combining fact extraction and verification with neural semantic matching networks. InProceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence,...

work page doi:10.1609/aaai 2019
[44]

A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature

Benjamin Nye, Junyi Jessy Li, Roma Patel, Yinfei Yang, Iain Marshall, Ani Nenkova, and Byron Wallace. A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1...

2018
[45]

Cambridge University Press, USA, 2nd edition,

Judea Pearl.Causality: Models, Reasoning and Inference. Cambridge University Press, USA, 2nd edition,
[46]

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.Advances in large margin classifiers, 10(3):61–74, 1999

John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.Advances in large margin classifiers, 10(3):61–74, 1999

1999
[47]

RocketQA: An optimized training approach to dense passage retrieval for open- domain question answering

Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. RocketQA: An optimized training approach to dense passage retrieval for open- domain question answering. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo...

2021
[48]

Data programming: creating large training sets, quickly

Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. Data programming: creating large training sets, quickly. InProceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, page 3574–3582, Red Hook, NY , USA, 2016. Curran Associates Inc. ISBN 9781510838819. 12

2016
[49]

Available: http://dx.doi.org/10.14778/3157794.3157797

Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. Snorkel: rapid training data creation with weak supervision.Proc. VLDB Endow., 11(3):269–282, November 2017. ISSN 2150-8097. URLhttps://doi.org/10.14778/3157794.3157797

work page doi:10.14778/3157794.3157797 2017
[50]

The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets.PloS one, 10(3):e0118432, 2015

Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets.PloS one, 10(3):e0118432, 2015

2015
[51]

Sofia Serrano and Noah A. Smith. Is attention interpretable? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2931–2951. Association for Computational Linguistics, 2019. URLhttps://aclanthology.org/P19-1282/

2019
[52]

Toward expert-level medical question answering with large language models.Nature medicine, 31(3):943–950, 2025

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature medicine, 31(3):943–950, 2025

2025
[53]

Measuring the accuracy of diagnostic systems.Science, 240(4857):1285–1293, 1988

John A Swets. Measuring the accuracy of diagnostic systems.Science, 240(4857):1285–1293, 1988

1988
[54]

Large language models streamline automated machine learning for clinical studies.Nature Communications, 15(1):1603, 2024

Soroosh Tayebi Arasteh, Tianyu Han, Mahshad Lotfinia, Christiane Kuhl, Jakob Nikolas Kather, Daniel Truhn, and Sven Nebelung. Large language models streamline automated machine learning for clinical studies.Nature Communications, 15(1):1603, 2024

2024
[55]

The treasure trove hidden in plain sight: The utility of gpt-4 in chest radiograph evaluation.Radiology, 313(2):e233441, 2024

Soroosh Tayebi Arasteh, Robert Siepmann, Marc Huppertz, Mahshad Lotfinia, Behrus Puladi, Christiane Kuhl, Daniel Truhn, and Sven Nebelung. The treasure trove hidden in plain sight: The utility of gpt-4 in chest radiograph evaluation.Radiology, 313(2):e233441, 2024

2024
[56]

Preserving fairness and diagnostic accuracy in private large-scale ai models for medical imaging.Communications Medicine, 4(1):46, 2024

Soroosh Tayebi Arasteh, Alexander Ziller, Christiane Kuhl, Marcus Makowski, Sven Nebelung, Rickmer Braren, Daniel Rueckert, Daniel Truhn, and Georgios Kaissis. Preserving fairness and diagnostic accuracy in private large-scale ai models for medical imaging.Communications Medicine, 4(1):46, 2024

2024
[57]

Radiorag: online retrieval– augmented generation for radiology question answering.Radiology: Artificial Intelligence, 7(4):e240476, 2025

Soroosh Tayebi Arasteh, Mahshad Lotfinia, Keno Bressem, Robert Siepmann, Lisa Adams, Dyke Ferber, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, and Daniel Truhn. Radiorag: online retrieval– augmented generation for radiology question answering.Radiology: Artificial Intelligence, 7(4):e240476, 2025

2025
[58]

FEVER: a large-scale dataset for fact extraction and VERification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER: a large-scale dataset for fact extraction and VERification. In Marilyn Walker, Heng Ji, and Amanda Stent, editors,Pro- ceedings of the 2018 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 (...

2018
[59]

URLhttps://aclanthology.org/N18-1074/

Association for Computational Linguistics. URLhttps://aclanthology.org/N18-1074/
[60]

An introduction to the bootstrap.Monographs on statistics and applied probability, 57(1):1–436, 1993

Robert J Tibshirani and Bradley Efron. An introduction to the bootstrap.Monographs on statistics and applied probability, 57(1):1–436, 1993

1993
[61]

Fact or fiction: Verifying scientific claims

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. Fact or fiction: Verifying scientific claims. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534–7550, 2020. URL https: //aclanthology.org/2020.emnlp-main.609/

2020
[62]

Irgan: A minimax game for unifying generative and discriminative information retrieval models

Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. Irgan: A minimax game for unifying generative and discriminative information retrieval models. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, page 515–524, New York, NY , USA, 2017...

work page doi:10.1145/3077136.3080786 2017
[63]

Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Griffin Thomas Adams, Jeremy Howard, and Iacopo Poli. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. InProceedi...

work page doi:10.18653/v1/2025.acl-long.127 2025
[64]

Attention is not not explanation

Sarah Wiegreffe and Yuval Pinter. Attention is not not explanation. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 11–20. Association for Computational Linguistics,

2019
[65]

URLhttps://aclanthology.org/D19-1002/. 13
[66]

A broad-coverage challenge corpus for sentence understanding through inference

Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistic...

2018
[67]

Multi-step retrieval and reasoning improves radiology question answering with large language models.npj Digital Medicine, 8:790, 2025

Sebastian Wind, Jeta Sopa, Daniel Truhn, Mahshad Lotfinia, Tri-Thien Nguyen, Keno Bressem, Lisa Adams, Mirabela Rusu, Harald Köstler, Gerhard Wellein, et al. Multi-step retrieval and reasoning improves radiology question answering with large language models.npj Digital Medicine, 8:790, 2025

2025
[68]

End-to-end neural ad- hoc ranking with kernel pooling

Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. End-to-end neural ad- hoc ranking with kernel pooling. InProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, page 55–64, New York, NY , USA, 2017. Association for Computing Machinery. ISBN 9781450350228. URL http...

work page arXiv 2017
[69]

Index for rating diagnostic tests.Cancer, 3(1):32–35, 1950

William J Youden. Index for rating diagnostic tests.Cancer, 3(1):32–35, 1950

1950
[70]

Transforming classifier scores into accurate multiclass probability estimates

Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. InProceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 694–699, 2002

2002
[71]

smaller is always better

Cyril Zakka, Rohan Shad, Akash Chaurasia, Alex R. Dalal, Jennifer L. Kim, Michael Moor, Robyn Fong, Curran Phillips, Kevin Alexander, Euan Ashley, Jack Boyd, Kathleen Boyd, Karen Hirsch, Curt Langlotz, Rita Lee, Joanna Melia, Joanna Nelson, Karim Sallam, Stacey Tullis, Melissa Ann V ogelsong, John Patrick Cunningham, and William Hiesinger. Almanac — retri...

work page arXiv 2024