BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts

Elias Hossain; Niloofar Yousefi; Sabera Akter Bushra; Sanjeda Sara Jennifer

arxiv: 2606.11208 · v1 · pith:SWSJJALBnew · submitted 2026-04-23 · 💻 cs.CL · cs.AI

BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts

Elias Hossain , Sanjeda Sara Jennifer , Sabera Akter Bushra , Niloofar Yousefi This is my paper

Pith reviewed 2026-07-04 20:09 UTC · model glm-5.2

classification 💻 cs.CL cs.AI

keywords contextual contradictionarticle-disjoint evaluationbiomedical NLPbenchmark leakagedivergence ontologyclaim verificationsilver annotationnatural language inference

0 comments

The pith

Article-disjoint splits expose memorization in contradiction benchmarks

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that when two biomedical findings appear to disagree, the disagreement is usually explained by hidden contextual differences — different cohorts, geographies, assay protocols, disease subtypes — rather than by genuine logical incompatibility. Existing natural-language-inference benchmarks collapse this structure into flat entail/contradict/neutral labels, which overstates disagreement in the literature and hides whether a model understands *why* two claims diverge. To address this, the authors introduce BioDivergence, a framework with a six-class conflict taxonomy and a 13-axis divergence ontology that requires models to recover, for each claim pair, the conflict type, the specific contextual axes driving the divergence, the dominant confounder, and a reconciliation explanation. The benchmark is built from 202,180 biomedical abstracts yielding 11,865 silver-labelled claim pairs across five domains. The central methodological move is the article-disjoint split: because a single abstract generates many claim pairs, conventional pair-level deduplication allows the same article to appear in both training and test sets, so a model can score well by memorizing article-level patterns rather than learning contextual reasoning. When the authors enforce zero article, claim, or pair overlap across splits, the fine-tuned reference model drops approximately 12 points of contextual-contradiction F1 (from 0.521 to 0.401 ± 0.017). This drop is invisible under conventional pair-level reporting and is the paper's load-bearing empirical finding: it demonstrates that a meaningful share of benchmark performance on prior contradiction tasks may reflect article-level memorization rather than genuine contextual-reasoning capability. The paper positions BioDivergence not as a single dataset but as an evaluation lens that separates these two phenomena.

Core claim

The paper's central discovery is that article-level overlap in contradiction benchmarks inflates model scores by a measurable amount — roughly 12 F1 points for contextual-contradiction detection — and that this inflation is invisible under the pair-level deduplication that most benchmarks use. By constructing an article-disjoint split with zero article, claim, or pair overlap across train, dev, and test, the authors show that a fine-tuned BiomedBERT-based reference model retains only 0.401 contextual-F1 (down from 0.521), while a zero-shot Mistral-7B model that was never fine-tuned reaches 0.389 on the same split, nearly matching the fine-tuned model. This convergence suggests that much of a

What carries the argument

The central mechanism is the article-disjoint split construction: connected components of articles (linked through any shared claim pair) are kept intact within a single split, and pair_id duplicates are resolved in favour of LLM-labelled copies. This guarantees zero article, claim, or pair overlap across train, dev, and test. The secondary mechanism is the structured multi-output task formulation: instead of a single contradiction label, each claim pair requires four outputs (conflict type, divergence axes, dominant confounder, reconciliation explanation), which makes the 'why' of disagreement directly measurable rather than collapsed into a binary.

If this is right

Benchmark designers for scientific claim verification should audit article-level overlap in their splits, since pair-level deduplication alone can leave over 90% of test rows sharing an article with training data.
Models that perform well on coarse contradiction classification may still fail at recovering the contextual axes that explain divergence, meaning that flat F1 scores overstate a model's biomedical reasoning capability.
The finding that gold divergence axes lift dominant-confounder accuracy from 0.561 to 0.690 suggests that axis prediction quality is the current ceiling for structured contradiction reasoning, not the downstream classification heads.
Practitioners evaluating contextual-contradiction capability should report on article-disjoint splits and include a non-annotator-family zero-shot baseline to avoid annotator bias in scores.

Load-bearing premise

All benchmark labels were produced by a single LLM annotator (Qwen2.5-7B) with no human-adjudicated gold subset for validation. A cross-family sensitivity check with Llama-3.1-8B yielded only fair agreement (Cohen's κ = 0.20), and the authors explicitly state this does not validate the labels. If the annotator systematically misrepresents the structure of contextual contradiction, all downstream model scores measure agreement with that annotator's labeling behaviour rather

What would settle it

If a human-adjudicated gold subset showed that Qwen2.5-7B's labels for contextual contradiction diverge substantially from expert biomedical judgement — particularly on the divergence-axis assignments and dominant-confounder selections — then all model scores on the benchmark would measure agreement with the annotator rather than contextual reasoning, and the 12-point article-disjoint drop would be an artifact of annotator consistency patterns rather than a genuine memorization signal.

read the original abstract

Biomedical findings often seem to conflict across studies, but many of these differences are context-dependent rather than true contradictions. Variations in cohort, geography, assay protocol, disease subtype, and clinical setting can make both claims locally valid. Existing NLI and scientific claim-verification benchmarks reduce such cases to entailment, contradiction, or neutral, failing to capture the contextual structure behind divergence. To address this, we introduce BioDivergence, an evaluation framework with a six-class conflict taxonomy, a 13-axis divergence ontology, and four structured outputs per claim pair: conflict type, divergence axes, dominant confounder, and reconciliation explanation. We release BioDivergence-Silver-v1.0, an article-disjoint silver benchmark of 11,865 claim pairs across five biomedical domains, alongside a legacy deduplicated variant for comparison. Results show notable ranking differences between the two variants, with the fine-tuned reference model dropping about 12 points under the article-disjoint setting, while Mistral-7B-Instruct-v0.3 achieves 0.5523 accuracy and 0.3894 contextual-F1 on the 842-example primary test set. BioDivergence offers a more faithful way to distinguish contextual divergence from direct contradiction and to separate article-level memorization from genuine task learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The stress-test concern about the 12-point F1 drop is correct and it undermines the paper's central empirical claim. But the benchmark framework and ontology are a genuine contribution worth taking seriously.

read the letter

The stress-test note is right, and it lands hard. The paper's headline finding is that the fine-tuned reference model drops ~12 points of contextual-contradiction F1 when moving from legacy pair-level packaging to the article-disjoint split (0.521 → 0.401), which they attribute to article-level memorization. But Table 4 shows that zero-shot Mistral-7B, which is never fine-tuned and cannot memorize anything, drops by almost the same amount on the same metric (0.5019 → 0.3894, ~11.25 points). A model that cannot memorize should not show a memorization effect. The real explanation is that the two test sets differ in size (2500 vs 842), label distribution (3 of 6 conflict classes have zero examples in the primary test), and training set composition (8750 vs 10183 examples). The paper changes multiple variables simultaneously and attributes the entire delta to one. The actual memorization signal, if any, is at most ~0.75 points. The paper does not acknowledge this confound anywhere. That's a significant problem because this delta is the empirical load-bearing claim for positioning BioDivergence as 'an evaluation lens rather than a single dataset.'

Referee Report

1 major / 7 minor

Summary. This paper introduces BioDivergence, a benchmark and evaluation framework for analyzing contextual contradictions in biomedical abstracts. The framework includes a six-class conflict taxonomy, a 13-axis divergence ontology, and four structured prediction targets per claim pair (conflict type, divergence axes, dominant confounder, reconciliation explanation). The authors release BioDivergence-Silver-v1.0, an article-disjoint silver benchmark of 11,865 claim pairs across five biomedical domains, with labels generated by Qwen2.5-7B. A legacy pair-level-deduplicated variant is retained for comparison. The central empirical claim is that article-disjoint evaluation separates article-level memorization from genuine task learning, evidenced by a ~12-point contextual-contradiction F1 drop for the fine-tuned reference model (0.521 → 0.401) when moving from the legacy to the article-disjoint split. The paper also reports a non-Qwen zero-shot baseline (Mistral-7B-v0.3) as an annotator-family-independent check.

Significance. The paper addresses a genuine gap: existing NLI and claim-verification benchmarks collapse context-conditioned divergence into flat contradiction labels, and none evaluate whether models can recover *why* two findings diverge. The six-class taxonomy and 13-axis ontology are well-motivated, and the article-disjoint split design is a meaningful methodological contribution. The authors deserve credit for releasing code, data, and provenance artifacts, for conducting a three-seed reference-model protocol, for including a non-Qwen annotator-family baseline, and for being transparent about silver-label limitations (only 2/6 classes reliably populated, κ=0.20 cross-annotator agreement). The construction pipeline is large-scale (202,180 abstracts, 527,907 claims) and reproducible. However, the load-bearing empirical claim about memorization separation has a confound that undermines the paper's central positioning, as detailed below.

major comments (1)

§7.2 and Appendix M.5: The paper attributes the reference model's ~12-point contextual-contradiction F1 drop (0.521 → 0.401) to article-level memorization, stating this is 'exactly what a leakage-aware benchmark should produce when article memorisation and contextual reasoning are separated.' However, Table 4 shows that the zero-shot Mistral-7B-v0.3 model, which is never fine-tuned and therefore cannot memorize training articles, experiences a nearly identical drop: Ctx F1 falls from 0.5019 (legacy, n=2500) to 0.3894 (primary, n=842), a drop of ~11.25 points. Since a zero-shot model's score difference between two test sets can only reflect test-set composition differences (different sizes, different label distributions — Appendix M.5 notes that 3 of 6 conflict classes have n_test=0 in the primary test set), the reference model's ~12-point drop is almost entirely explained by these same测试

minor comments (7)

§5.2: The pilot comparison (Table 27) uses Qwen2.5-72B while the full release uses Qwen2.5-7B. Table 28 shows the 7B annotator assigns a dominant confounder on 100% of pairs vs. 35% for 72B, and produces 20 unique axis strings (some off-schema) vs. 12. This is a substantial quality gap that should be discussed more prominently in the main text rather than buried in Appendix I.1.
Table 2: The reference model's primary-split accuracy (0.693) is higher than its legacy-split accuracy (0.588), which seems counterintuitive given the memorization narrative. The paper explains this in Appendix M.5 (sparse classes removed), but this should be acknowledged in §7.2 to avoid confusion.
§4: The 'unknown_latent_factor' axis is essentially a catch-all for unexplained disagreement. Its inclusion in the ontology should be justified more carefully, as it risks becoming a dumping ground for annotation failures.
Table 18: The training split contains 1,839 'underspecified_apparent_contradiction' examples while dev and test contain only 6 and 11 respectively. This extreme asymmetry is noted but its impact on model training (the model sees many such examples in training but virtually none at test time) should be discussed.
§7.5: The confidence-filtered slice (591 examples) is described as using a 'provenance-based confidence proxy' but the proxy itself is not defined. Clarify what this proxy is.
The paper uses both 'BioDivergence' and 'ConflictTopology' in dataset names/URLs (e.g., HuggingFace: EliasHossain/ConflictTopology-Silver-v1.0). Consistent naming would help discoverability.
Several references are incomplete: 'Anonymous [2024]' for the factuality metrics paper, and some NeurIPS citations lack page numbers or DOIs.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for a careful and substantive reading of our work. The referee correctly identifies a confound in our central empirical claim about memorization separation. We agree this requires revision and explain below how we will address it.

read point-by-point responses

Referee: §7.2 and Appendix M.5: The paper attributes the reference model's ~12-point contextual-contradiction F1 drop (0.521 → 0.401) to article-level memorization, but the zero-shot Mistral-7B-v0.3 model, which is never fine-tuned and cannot memorize training articles, experiences a nearly identical drop (0.5019 → 0.3894, ~11.25 points). Since a zero-shot model's score difference between two test sets can only reflect test-set composition differences (different sizes, different label distributions — 3 of 6 conflict classes have n_test=0 in the primary test set), the reference model's ~12-point drop is almost entirely explained by these same composition differences, not by memorization.

Authors: The referee is correct, and we acknowledge this confound without reservation. The core issue is straightforward: if a zero-shot model that cannot memorize training articles exhibits a drop of similar magnitude (~11.25 points) to the fine-tuned reference model (~12 points) when moving from the legacy test set to the primary test set, then the majority of the reference model's drop is attributable to test-set composition differences rather than to article-level memorization. We cannot honestly claim that the ~12-point delta 'cleanly measures how much of the legacy pair-level score came from article-level overlap rather than task learning,' as we currently state in Appendix M.5. The referee has identified a genuine logical flaw in our interpretation. We will revise the manuscript as follows: (1) We will remove the claim that the F1 drop is 'attributable to article-level overlap' and the statement that the pattern is 'exactly what a leakage-aware benchmark should produce when article memorisation and contextual reasoning are separated.' (2) We will add an explicit discussion of the composition confound, noting that the primary test set differs from the legacy test set in size (842 vs. 2,500), label distribution (3 of 6 classes have n_test=0 in the primary split), and class balance, and that these differences alone can account for most of the observed F1 change for both models. (3) We will reframe the contribution of the article-disjoint split: its value is that it eliminates a known leakage vector (documented in Table 1: 93.6% of legacy test rows have at least one abstract in train), not that the observed F1 delta isolates memorization from composition effects. The split is still methodologically necessary — a benchmark that allows 93.6% article overlap cannot support clean revision: no

standing simulated objections not resolved

We cannot disentangle the memorization effect from the test-set composition effect using the current experimental design. To cleanly isolate memorization, one would need to evaluate the reference model trained on the legacy split (with article overlap) on both the legacy test set and a composition-matched article-disjoint test set, or alternatively construct a test set with the same label distribution as the legacy test set but with zero article overlap. Neither experiment was run, and we cannot retroactively attribute the F1 delta to memorization with the data currently in the paper.

Circularity Check

0 steps flagged

No significant circularity found; the paper's derivation chain is self-contained and transparent about construction artifacts.

full rationale

The paper's central empirical claim—the ~12-point contextual-F1 drop under article-disjoint evaluation—is a comparison between two evaluation settings on the same labeled pool, not a quantity defined in terms of itself. The silver labels are produced by Qwen2.5-7B (§5.2), the reference model is trained on those labels, and evaluated on held-out labels from the same annotator—this is standard supervised benchmark methodology, not circularity. The paper explicitly acknowledges the Qwen3-8B/Qwen2.5 annotator-family overlap and provides Mistral-7B-v0.3 as an annotator-family-independent reference (Tables 2, 4, 5). Construction-verification artifacts (template baseline slot-overlap = 0.9956 in Table 7; lexical span heuristic dominance in Table 8) are transparently flagged as construction floors rather than capability measurements (§7.3, Appendix D). No self-citations appear in the reference list; the framework's premises rest on external benchmarks (FEVER, SciFact, VitaminC) and the paper's own construction pipeline. The skeptic's concern that the ~12-point drop is confounded by test-set differences (Mistral also drops ~11 points) is a correctness/interpretation risk, not a circularity issue—the paper is not defining a quantity in terms of its own inputs. The minor score of 1 reflects that Qwen3-8B results are still reported alongside the independent Mistral baseline despite the acknowledged family overlap, but this is not load-bearing for the central claim since the headline numbers anchor to the Mistral comparison and the article-disjoint primary split (Table 2).

Axiom & Free-Parameter Ledger

5 free parameters · 4 axioms · 1 invented entities

The benchmark introduces a hand-designed ontology (13 axes, 6 classes) and relies on LLM-produced silver labels without human validation. The free parameters are pipeline thresholds, not fitted to model performance, so they do not create circularity. The main epistemic risk is the unvalidated silver-label axiom.

free parameters (5)

λ_ct, λ_ax, λ_dc (task loss weights) = 1.0, 1.0, 0.3
Hand-set loss weights for the reference model (Table 32); not tuned via search but chosen by the authors.
Confidence threshold for heuristic top-up = 0.45
Train-split heuristic top-up threshold (§5.1) controlling which heuristic-labeled examples enter training.
Candidate-pair mining score weights = 0.35, 0.35, 0.15, 0.10, p_year
Weights in s_combined = 0.35*s_sim + 0.35*s_dis + 0.15*s_ent + 0.10*s_type - p_year (Appendix G.4); hand-set.
Semantic similarity threshold for pair retention = 0.40
Minimum semantic similarity for candidate pair retention (Appendix G.4).
Combined score threshold for pair retention = 0.25
Minimum combined score for candidate pair retention (Appendix G.4).

axioms (4)

ad hoc to paper Qwen2.5-7B silver labels are sufficiently accurate to serve as evaluation targets for contextual contradiction analysis
§5.2: The full release is relabeled with qwen2.5:7b via Ollama. No human validation is performed. This axiom underpins all benchmark-level claims.
domain assumption Rule-based claim extraction produces claim-bearing sentences suitable for contradiction analysis
§5.1, Appendix H: A deterministic rule-based extractor produces 527,907 claims. The quality of extracted claims determines the quality of mined pairs.
domain assumption The 13-axis divergence ontology is complete for biomedical contextual divergence
§4, Appendix E.1: The ontology is designed by the authors. The unknown_latent_factor slot acknowledges incompleteness, but the axis inventory itself is an unvalidated design choice.
domain assumption Article-disjoint splitting eliminates memorization as a confound for contextual reasoning evaluation
§5.1, Appendix M: Connected components of articles are kept within single splits. This assumes memorization operates at article level rather than claim or pair level.

invented entities (1)

unknown_latent_factor axis no independent evidence
purpose: Catch-all slot for contextual disagreement not identifiable from the abstract
§4, Table 14: A schema slot for unexplained disagreement. No falsifiable handle; it absorbs any case where the annotator cannot identify a specific axis.

pith-pipeline@v1.1.0-glm · 33062 in / 3359 out tokens · 84512 ms · 2026-07-04T20:09:22.922696+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

[1]

NAACL , year =

FEVER: a Large-scale Dataset for Fact Extraction and VERification , author =. NAACL , year =

work page
[2]

NAACL , year =

Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence , author =. NAACL , year =

work page
[3]

EMNLP , year =

Fact or Fiction: Verifying Scientific Claims , author =. EMNLP , year =

work page
[4]

Findings of EMNLP , year =

SCIFACT-OPEN: Towards Open-domain Scientific Claim Verification , author =. Findings of EMNLP , year =

work page
[5]

BioNLP Workshop , year =

Evidence Inference 2.0: More Data, Better Models , author =. BioNLP Workshop , year =

work page
[6]

Briefings in Bioinformatics , year =

Contexts and Contradictions: A Roadmap for Computational Drug Repurposing with Knowledge Inference , author =. Briefings in Bioinformatics , year =

work page
[7]

npj Digital Medicine , year =

HealthContradict: Evaluating Biomedical Knowledge Conflicts in Language Models , author =. npj Digital Medicine , year =

work page
[8]

NeurIPS Datasets and Benchmarks Track , year =

ClashEval: Quantifying the Tug-of-War Between an LLM's Internal Prior and External Evidence , author =. NeurIPS Datasets and Benchmarks Track , year =

work page
[9]

NeurIPS Datasets and Benchmarks Track , year =

ConflictBank: A Benchmark for Evaluating Knowledge Conflicts in Large Language Models , author =. NeurIPS Datasets and Benchmarks Track , year =

work page
[10]

NeurIPS Datasets and Benchmarks Track , year =

WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia , author =. NeurIPS Datasets and Benchmarks Track , year =

work page
[11]

NeurIPS , year =

MEDIQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning , author =. NeurIPS , year =

work page
[12]

NeurIPS Datasets and Benchmarks Track , year =

DiReCT: Diagnostic Reasoning for Clinical Notes via Large Language Models , author =. NeurIPS Datasets and Benchmarks Track , year =

work page
[13]

NeurIPS Datasets and Benchmarks Track , year =

RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation , author =. NeurIPS Datasets and Benchmarks Track , year =

work page
[14]

NeurIPS , year =

Measuring What Matters: Construct Validity in Large Language Model Benchmarks , author =. NeurIPS , year =

work page
[16]

Do automatic factuality metrics measure factuality? a critical evaluation

Anonymous. Do automatic factuality metrics measure factuality? a critical evaluation. arXiv preprint arXiv:2411.16638, 2024

work page arXiv 2024
[17]

Bean, Ryan O

Andrew M. Bean, Ryan O. Kearns, Angelika Romanou, et al. Measuring what matters: Construct validity in large language model benchmarks. In NeurIPS, 2025

work page 2025
[18]

Marshall, and Byron C

Jay DeYoung, Eric Lehman, Ben Nye, Iain J. Marshall, and Byron C. Wallace. Evidence inference 2.0: More data, better models. In BioNLP Workshop, 2020

work page 2020
[19]

Wikicontradict: A benchmark for evaluating llms on real-world knowledge conflicts from wikipedia

Yufang Hou, Alessandra Pascale, Javier Carnerero-Cano, Tigran Tchrakian, Radu Marinescu, Elizabeth Daly, Inkit Padhi, and Prasanna Sattigeri. Wikicontradict: A benchmark for evaluating llms on real-world knowledge conflicts from wikipedia. In NeurIPS Datasets and Benchmarks Track, 2024

work page 2024
[20]

Ilgen, Emma Pierson, Pang Wei Koh, and Yulia Tsvetkov

Shuyue Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan S. Ilgen, Emma Pierson, Pang Wei Koh, and Yulia Tsvetkov. Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning. In NeurIPS, 2024

work page 2024
[21]

Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation

Dongyu Ru, Lin Qiu, Xiangkun Hu, Tianhang Zhang, Peng Shi, Shuaichen Chang, Cunxiang Wang, Shichao Sun, Pengfei Liu, and Yue Zhang. Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation. In NeurIPS Datasets and Benchmarks Track, 2024

work page 2024
[22]

Get your vitamin c! robust fact verification with contrastive evidence

Tal Schuster, Adam Fisch, and Regina Barzilay. Get your vitamin c! robust fact verification with contrastive evidence. In NAACL, 2021

work page 2021
[23]

Sosa and Russ B

Daniel N. Sosa and Russ B. Altman. Contexts and contradictions: A roadmap for computational drug repurposing with knowledge inference. Briefings in Bioinformatics, 2022

work page 2022
[24]

Conflictbank: A benchmark for evaluating knowledge conflicts in large language models

Zhaochen Su, Jun Zhang, Xiaoye Qu, Tong Zhu, Yanshu Li, Jiashuo Sun, Juntao Li, Min Zhang, and Yu Cheng. Conflictbank: A benchmark for evaluating knowledge conflicts in large language models. In NeurIPS Datasets and Benchmarks Track, 2024

work page 2024
[25]

Fever: a large-scale dataset for fact extraction and verification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification. In NAACL, 2018

work page 2018
[26]

Fact or fiction: Verifying scientific claims

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. Fact or fiction: Verifying scientific claims. In EMNLP, 2020

work page 2020
[27]

Scifact-open: Towards open-domain scientific claim verification

David Wadden, Kyle Lo, Bailey Kuehl, Arman Cohan, Iz Beltagy, Lucy Lu Wang, and Hannaneh Hajishirzi. Scifact-open: Towards open-domain scientific claim verification. In Findings of EMNLP, 2022

work page 2022
[28]

Direct: Diagnostic reasoning for clinical notes via large language models

Bowen Wang, Jiuyang Chang, Yiming Qian, Guoxin Chen, Junhao Chen, Zhouqiang Jiang, Jiahao Zhang, Yuta Nakashima, and Hajime Nagahara. Direct: Diagnostic reasoning for clinical notes via large language models. In NeurIPS Datasets and Benchmarks Track, 2024

work page 2024
[29]

Clasheval: Quantifying the tug-of-war between an llm's internal prior and external evidence

Kevin Wu, Eric Wu, and James Zou. Clasheval: Quantifying the tug-of-war between an llm's internal prior and external evidence. In NeurIPS Datasets and Benchmarks Track, 2024

work page 2024
[30]

Healthcontradict: Evaluating biomedical knowledge conflicts in language models

Boya Zhang, Alban Bornet, Rui Yang, Nan Liu, and Douglas Teodoro. Healthcontradict: Evaluating biomedical knowledge conflicts in language models. npj Digital Medicine, 2025

work page 2025

[1] [1]

NAACL , year =

FEVER: a Large-scale Dataset for Fact Extraction and VERification , author =. NAACL , year =

work page

[2] [2]

NAACL , year =

Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence , author =. NAACL , year =

work page

[3] [3]

EMNLP , year =

Fact or Fiction: Verifying Scientific Claims , author =. EMNLP , year =

work page

[4] [4]

Findings of EMNLP , year =

SCIFACT-OPEN: Towards Open-domain Scientific Claim Verification , author =. Findings of EMNLP , year =

work page

[5] [5]

BioNLP Workshop , year =

Evidence Inference 2.0: More Data, Better Models , author =. BioNLP Workshop , year =

work page

[6] [6]

Briefings in Bioinformatics , year =

Contexts and Contradictions: A Roadmap for Computational Drug Repurposing with Knowledge Inference , author =. Briefings in Bioinformatics , year =

work page

[7] [7]

npj Digital Medicine , year =

HealthContradict: Evaluating Biomedical Knowledge Conflicts in Language Models , author =. npj Digital Medicine , year =

work page

[8] [8]

NeurIPS Datasets and Benchmarks Track , year =

ClashEval: Quantifying the Tug-of-War Between an LLM's Internal Prior and External Evidence , author =. NeurIPS Datasets and Benchmarks Track , year =

work page

[9] [9]

NeurIPS Datasets and Benchmarks Track , year =

ConflictBank: A Benchmark for Evaluating Knowledge Conflicts in Large Language Models , author =. NeurIPS Datasets and Benchmarks Track , year =

work page

[10] [10]

NeurIPS Datasets and Benchmarks Track , year =

WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia , author =. NeurIPS Datasets and Benchmarks Track , year =

work page

[11] [11]

NeurIPS , year =

MEDIQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning , author =. NeurIPS , year =

work page

[12] [12]

NeurIPS Datasets and Benchmarks Track , year =

DiReCT: Diagnostic Reasoning for Clinical Notes via Large Language Models , author =. NeurIPS Datasets and Benchmarks Track , year =

work page

[13] [13]

NeurIPS Datasets and Benchmarks Track , year =

RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation , author =. NeurIPS Datasets and Benchmarks Track , year =

work page

[14] [14]

NeurIPS , year =

Measuring What Matters: Construct Validity in Large Language Model Benchmarks , author =. NeurIPS , year =

work page

[15] [16]

Do automatic factuality metrics measure factuality? a critical evaluation

Anonymous. Do automatic factuality metrics measure factuality? a critical evaluation. arXiv preprint arXiv:2411.16638, 2024

work page arXiv 2024

[16] [17]

Bean, Ryan O

Andrew M. Bean, Ryan O. Kearns, Angelika Romanou, et al. Measuring what matters: Construct validity in large language model benchmarks. In NeurIPS, 2025

work page 2025

[17] [18]

Marshall, and Byron C

Jay DeYoung, Eric Lehman, Ben Nye, Iain J. Marshall, and Byron C. Wallace. Evidence inference 2.0: More data, better models. In BioNLP Workshop, 2020

work page 2020

[18] [19]

Wikicontradict: A benchmark for evaluating llms on real-world knowledge conflicts from wikipedia

Yufang Hou, Alessandra Pascale, Javier Carnerero-Cano, Tigran Tchrakian, Radu Marinescu, Elizabeth Daly, Inkit Padhi, and Prasanna Sattigeri. Wikicontradict: A benchmark for evaluating llms on real-world knowledge conflicts from wikipedia. In NeurIPS Datasets and Benchmarks Track, 2024

work page 2024

[19] [20]

Ilgen, Emma Pierson, Pang Wei Koh, and Yulia Tsvetkov

Shuyue Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan S. Ilgen, Emma Pierson, Pang Wei Koh, and Yulia Tsvetkov. Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning. In NeurIPS, 2024

work page 2024

[20] [21]

Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation

Dongyu Ru, Lin Qiu, Xiangkun Hu, Tianhang Zhang, Peng Shi, Shuaichen Chang, Cunxiang Wang, Shichao Sun, Pengfei Liu, and Yue Zhang. Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation. In NeurIPS Datasets and Benchmarks Track, 2024

work page 2024

[21] [22]

Get your vitamin c! robust fact verification with contrastive evidence

Tal Schuster, Adam Fisch, and Regina Barzilay. Get your vitamin c! robust fact verification with contrastive evidence. In NAACL, 2021

work page 2021

[22] [23]

Sosa and Russ B

Daniel N. Sosa and Russ B. Altman. Contexts and contradictions: A roadmap for computational drug repurposing with knowledge inference. Briefings in Bioinformatics, 2022

work page 2022

[23] [24]

Conflictbank: A benchmark for evaluating knowledge conflicts in large language models

Zhaochen Su, Jun Zhang, Xiaoye Qu, Tong Zhu, Yanshu Li, Jiashuo Sun, Juntao Li, Min Zhang, and Yu Cheng. Conflictbank: A benchmark for evaluating knowledge conflicts in large language models. In NeurIPS Datasets and Benchmarks Track, 2024

work page 2024

[24] [25]

Fever: a large-scale dataset for fact extraction and verification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification. In NAACL, 2018

work page 2018

[25] [26]

Fact or fiction: Verifying scientific claims

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. Fact or fiction: Verifying scientific claims. In EMNLP, 2020

work page 2020

[26] [27]

Scifact-open: Towards open-domain scientific claim verification

David Wadden, Kyle Lo, Bailey Kuehl, Arman Cohan, Iz Beltagy, Lucy Lu Wang, and Hannaneh Hajishirzi. Scifact-open: Towards open-domain scientific claim verification. In Findings of EMNLP, 2022

work page 2022

[27] [28]

Direct: Diagnostic reasoning for clinical notes via large language models

Bowen Wang, Jiuyang Chang, Yiming Qian, Guoxin Chen, Junhao Chen, Zhouqiang Jiang, Jiahao Zhang, Yuta Nakashima, and Hajime Nagahara. Direct: Diagnostic reasoning for clinical notes via large language models. In NeurIPS Datasets and Benchmarks Track, 2024

work page 2024

[28] [29]

Clasheval: Quantifying the tug-of-war between an llm's internal prior and external evidence

Kevin Wu, Eric Wu, and James Zou. Clasheval: Quantifying the tug-of-war between an llm's internal prior and external evidence. In NeurIPS Datasets and Benchmarks Track, 2024

work page 2024

[29] [30]

Healthcontradict: Evaluating biomedical knowledge conflicts in language models

Boya Zhang, Alban Bornet, Rui Yang, Nan Liu, and Douglas Teodoro. Healthcontradict: Evaluating biomedical knowledge conflicts in language models. npj Digital Medicine, 2025

work page 2025