BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts
Pith reviewed 2026-07-04 20:09 UTC · model glm-5.2
The pith
Article-disjoint splits expose memorization in contradiction benchmarks
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper's central discovery is that article-level overlap in contradiction benchmarks inflates model scores by a measurable amount — roughly 12 F1 points for contextual-contradiction detection — and that this inflation is invisible under the pair-level deduplication that most benchmarks use. By constructing an article-disjoint split with zero article, claim, or pair overlap across train, dev, and test, the authors show that a fine-tuned BiomedBERT-based reference model retains only 0.401 contextual-F1 (down from 0.521), while a zero-shot Mistral-7B model that was never fine-tuned reaches 0.389 on the same split, nearly matching the fine-tuned model. This convergence suggests that much of a
What carries the argument
The central mechanism is the article-disjoint split construction: connected components of articles (linked through any shared claim pair) are kept intact within a single split, and pair_id duplicates are resolved in favour of LLM-labelled copies. This guarantees zero article, claim, or pair overlap across train, dev, and test. The secondary mechanism is the structured multi-output task formulation: instead of a single contradiction label, each claim pair requires four outputs (conflict type, divergence axes, dominant confounder, reconciliation explanation), which makes the 'why' of disagreement directly measurable rather than collapsed into a binary.
If this is right
- Benchmark designers for scientific claim verification should audit article-level overlap in their splits, since pair-level deduplication alone can leave over 90% of test rows sharing an article with training data.
- Models that perform well on coarse contradiction classification may still fail at recovering the contextual axes that explain divergence, meaning that flat F1 scores overstate a model's biomedical reasoning capability.
- The finding that gold divergence axes lift dominant-confounder accuracy from 0.561 to 0.690 suggests that axis prediction quality is the current ceiling for structured contradiction reasoning, not the downstream classification heads.
- Practitioners evaluating contextual-contradiction capability should report on article-disjoint splits and include a non-annotator-family zero-shot baseline to avoid annotator bias in scores.
Load-bearing premise
All benchmark labels were produced by a single LLM annotator (Qwen2.5-7B) with no human-adjudicated gold subset for validation. A cross-family sensitivity check with Llama-3.1-8B yielded only fair agreement (Cohen's κ = 0.20), and the authors explicitly state this does not validate the labels. If the annotator systematically misrepresents the structure of contextual contradiction, all downstream model scores measure agreement with that annotator's labeling behaviour rather
What would settle it
If a human-adjudicated gold subset showed that Qwen2.5-7B's labels for contextual contradiction diverge substantially from expert biomedical judgement — particularly on the divergence-axis assignments and dominant-confounder selections — then all model scores on the benchmark would measure agreement with the annotator rather than contextual reasoning, and the 12-point article-disjoint drop would be an artifact of annotator consistency patterns rather than a genuine memorization signal.
read the original abstract
Biomedical findings often seem to conflict across studies, but many of these differences are context-dependent rather than true contradictions. Variations in cohort, geography, assay protocol, disease subtype, and clinical setting can make both claims locally valid. Existing NLI and scientific claim-verification benchmarks reduce such cases to entailment, contradiction, or neutral, failing to capture the contextual structure behind divergence. To address this, we introduce BioDivergence, an evaluation framework with a six-class conflict taxonomy, a 13-axis divergence ontology, and four structured outputs per claim pair: conflict type, divergence axes, dominant confounder, and reconciliation explanation. We release BioDivergence-Silver-v1.0, an article-disjoint silver benchmark of 11,865 claim pairs across five biomedical domains, alongside a legacy deduplicated variant for comparison. Results show notable ranking differences between the two variants, with the fine-tuned reference model dropping about 12 points under the article-disjoint setting, while Mistral-7B-Instruct-v0.3 achieves 0.5523 accuracy and 0.3894 contextual-F1 on the 842-example primary test set. BioDivergence offers a more faithful way to distinguish contextual divergence from direct contradiction and to separate article-level memorization from genuine task learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper introduces BioDivergence, a benchmark and evaluation framework for analyzing contextual contradictions in biomedical abstracts. The framework includes a six-class conflict taxonomy, a 13-axis divergence ontology, and four structured prediction targets per claim pair (conflict type, divergence axes, dominant confounder, reconciliation explanation). The authors release BioDivergence-Silver-v1.0, an article-disjoint silver benchmark of 11,865 claim pairs across five biomedical domains, with labels generated by Qwen2.5-7B. A legacy pair-level-deduplicated variant is retained for comparison. The central empirical claim is that article-disjoint evaluation separates article-level memorization from genuine task learning, evidenced by a ~12-point contextual-contradiction F1 drop for the fine-tuned reference model (0.521 → 0.401) when moving from the legacy to the article-disjoint split. The paper also reports a non-Qwen zero-shot baseline (Mistral-7B-v0.3) as an annotator-family-independent check.
Significance. The paper addresses a genuine gap: existing NLI and claim-verification benchmarks collapse context-conditioned divergence into flat contradiction labels, and none evaluate whether models can recover *why* two findings diverge. The six-class taxonomy and 13-axis ontology are well-motivated, and the article-disjoint split design is a meaningful methodological contribution. The authors deserve credit for releasing code, data, and provenance artifacts, for conducting a three-seed reference-model protocol, for including a non-Qwen annotator-family baseline, and for being transparent about silver-label limitations (only 2/6 classes reliably populated, κ=0.20 cross-annotator agreement). The construction pipeline is large-scale (202,180 abstracts, 527,907 claims) and reproducible. However, the load-bearing empirical claim about memorization separation has a confound that undermines the paper's central positioning, as detailed below.
major comments (1)
- §7.2 and Appendix M.5: The paper attributes the reference model's ~12-point contextual-contradiction F1 drop (0.521 → 0.401) to article-level memorization, stating this is 'exactly what a leakage-aware benchmark should produce when article memorisation and contextual reasoning are separated.' However, Table 4 shows that the zero-shot Mistral-7B-v0.3 model, which is never fine-tuned and therefore cannot memorize training articles, experiences a nearly identical drop: Ctx F1 falls from 0.5019 (legacy, n=2500) to 0.3894 (primary, n=842), a drop of ~11.25 points. Since a zero-shot model's score difference between two test sets can only reflect test-set composition differences (different sizes, different label distributions — Appendix M.5 notes that 3 of 6 conflict classes have n_test=0 in the primary test set), the reference model's ~12-point drop is almost entirely explained by these same测试
minor comments (7)
- §5.2: The pilot comparison (Table 27) uses Qwen2.5-72B while the full release uses Qwen2.5-7B. Table 28 shows the 7B annotator assigns a dominant confounder on 100% of pairs vs. 35% for 72B, and produces 20 unique axis strings (some off-schema) vs. 12. This is a substantial quality gap that should be discussed more prominently in the main text rather than buried in Appendix I.1.
- Table 2: The reference model's primary-split accuracy (0.693) is higher than its legacy-split accuracy (0.588), which seems counterintuitive given the memorization narrative. The paper explains this in Appendix M.5 (sparse classes removed), but this should be acknowledged in §7.2 to avoid confusion.
- §4: The 'unknown_latent_factor' axis is essentially a catch-all for unexplained disagreement. Its inclusion in the ontology should be justified more carefully, as it risks becoming a dumping ground for annotation failures.
- Table 18: The training split contains 1,839 'underspecified_apparent_contradiction' examples while dev and test contain only 6 and 11 respectively. This extreme asymmetry is noted but its impact on model training (the model sees many such examples in training but virtually none at test time) should be discussed.
- §7.5: The confidence-filtered slice (591 examples) is described as using a 'provenance-based confidence proxy' but the proxy itself is not defined. Clarify what this proxy is.
- The paper uses both 'BioDivergence' and 'ConflictTopology' in dataset names/URLs (e.g., HuggingFace: EliasHossain/ConflictTopology-Silver-v1.0). Consistent naming would help discoverability.
- Several references are incomplete: 'Anonymous [2024]' for the factuality metrics paper, and some NeurIPS citations lack page numbers or DOIs.
Simulated Author's Rebuttal
We thank the referee for a careful and substantive reading of our work. The referee correctly identifies a confound in our central empirical claim about memorization separation. We agree this requires revision and explain below how we will address it.
read point-by-point responses
-
Referee: §7.2 and Appendix M.5: The paper attributes the reference model's ~12-point contextual-contradiction F1 drop (0.521 → 0.401) to article-level memorization, but the zero-shot Mistral-7B-v0.3 model, which is never fine-tuned and cannot memorize training articles, experiences a nearly identical drop (0.5019 → 0.3894, ~11.25 points). Since a zero-shot model's score difference between two test sets can only reflect test-set composition differences (different sizes, different label distributions — 3 of 6 conflict classes have n_test=0 in the primary test set), the reference model's ~12-point drop is almost entirely explained by these same composition differences, not by memorization.
Authors: The referee is correct, and we acknowledge this confound without reservation. The core issue is straightforward: if a zero-shot model that cannot memorize training articles exhibits a drop of similar magnitude (~11.25 points) to the fine-tuned reference model (~12 points) when moving from the legacy test set to the primary test set, then the majority of the reference model's drop is attributable to test-set composition differences rather than to article-level memorization. We cannot honestly claim that the ~12-point delta 'cleanly measures how much of the legacy pair-level score came from article-level overlap rather than task learning,' as we currently state in Appendix M.5. The referee has identified a genuine logical flaw in our interpretation. We will revise the manuscript as follows: (1) We will remove the claim that the F1 drop is 'attributable to article-level overlap' and the statement that the pattern is 'exactly what a leakage-aware benchmark should produce when article memorisation and contextual reasoning are separated.' (2) We will add an explicit discussion of the composition confound, noting that the primary test set differs from the legacy test set in size (842 vs. 2,500), label distribution (3 of 6 classes have n_test=0 in the primary split), and class balance, and that these differences alone can account for most of the observed F1 change for both models. (3) We will reframe the contribution of the article-disjoint split: its value is that it eliminates a known leakage vector (documented in Table 1: 93.6% of legacy test rows have at least one abstract in train), not that the observed F1 delta isolates memorization from composition effects. The split is still methodologically necessary — a benchmark that allows 93.6% article overlap cannot support clean revision: no
- We cannot disentangle the memorization effect from the test-set composition effect using the current experimental design. To cleanly isolate memorization, one would need to evaluate the reference model trained on the legacy split (with article overlap) on both the legacy test set and a composition-matched article-disjoint test set, or alternatively construct a test set with the same label distribution as the legacy test set but with zero article overlap. Neither experiment was run, and we cannot retroactively attribute the F1 delta to memorization with the data currently in the paper.
Circularity Check
No significant circularity found; the paper's derivation chain is self-contained and transparent about construction artifacts.
full rationale
The paper's central empirical claim—the ~12-point contextual-F1 drop under article-disjoint evaluation—is a comparison between two evaluation settings on the same labeled pool, not a quantity defined in terms of itself. The silver labels are produced by Qwen2.5-7B (§5.2), the reference model is trained on those labels, and evaluated on held-out labels from the same annotator—this is standard supervised benchmark methodology, not circularity. The paper explicitly acknowledges the Qwen3-8B/Qwen2.5 annotator-family overlap and provides Mistral-7B-v0.3 as an annotator-family-independent reference (Tables 2, 4, 5). Construction-verification artifacts (template baseline slot-overlap = 0.9956 in Table 7; lexical span heuristic dominance in Table 8) are transparently flagged as construction floors rather than capability measurements (§7.3, Appendix D). No self-citations appear in the reference list; the framework's premises rest on external benchmarks (FEVER, SciFact, VitaminC) and the paper's own construction pipeline. The skeptic's concern that the ~12-point drop is confounded by test-set differences (Mistral also drops ~11 points) is a correctness/interpretation risk, not a circularity issue—the paper is not defining a quantity in terms of its own inputs. The minor score of 1 reflects that Qwen3-8B results are still reported alongside the independent Mistral baseline despite the acknowledged family overlap, but this is not load-bearing for the central claim since the headline numbers anchor to the Mistral comparison and the article-disjoint primary split (Table 2).
Axiom & Free-Parameter Ledger
free parameters (5)
- λ_ct, λ_ax, λ_dc (task loss weights) =
1.0, 1.0, 0.3
- Confidence threshold for heuristic top-up =
0.45
- Candidate-pair mining score weights =
0.35, 0.35, 0.15, 0.10, p_year
- Semantic similarity threshold for pair retention =
0.40
- Combined score threshold for pair retention =
0.25
axioms (4)
- ad hoc to paper Qwen2.5-7B silver labels are sufficiently accurate to serve as evaluation targets for contextual contradiction analysis
- domain assumption Rule-based claim extraction produces claim-bearing sentences suitable for contradiction analysis
- domain assumption The 13-axis divergence ontology is complete for biomedical contextual divergence
- domain assumption Article-disjoint splitting eliminates memorization as a confound for contextual reasoning evaluation
invented entities (1)
-
unknown_latent_factor axis
no independent evidence
Reference graph
Works this paper leans on
-
[1]
FEVER: a Large-scale Dataset for Fact Extraction and VERification , author =. NAACL , year =
-
[2]
Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence , author =. NAACL , year =
- [3]
-
[4]
SCIFACT-OPEN: Towards Open-domain Scientific Claim Verification , author =. Findings of EMNLP , year =
-
[5]
Evidence Inference 2.0: More Data, Better Models , author =. BioNLP Workshop , year =
-
[6]
Briefings in Bioinformatics , year =
Contexts and Contradictions: A Roadmap for Computational Drug Repurposing with Knowledge Inference , author =. Briefings in Bioinformatics , year =
-
[7]
HealthContradict: Evaluating Biomedical Knowledge Conflicts in Language Models , author =. npj Digital Medicine , year =
-
[8]
NeurIPS Datasets and Benchmarks Track , year =
ClashEval: Quantifying the Tug-of-War Between an LLM's Internal Prior and External Evidence , author =. NeurIPS Datasets and Benchmarks Track , year =
-
[9]
NeurIPS Datasets and Benchmarks Track , year =
ConflictBank: A Benchmark for Evaluating Knowledge Conflicts in Large Language Models , author =. NeurIPS Datasets and Benchmarks Track , year =
-
[10]
NeurIPS Datasets and Benchmarks Track , year =
WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia , author =. NeurIPS Datasets and Benchmarks Track , year =
-
[11]
MEDIQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning , author =. NeurIPS , year =
-
[12]
NeurIPS Datasets and Benchmarks Track , year =
DiReCT: Diagnostic Reasoning for Clinical Notes via Large Language Models , author =. NeurIPS Datasets and Benchmarks Track , year =
-
[13]
NeurIPS Datasets and Benchmarks Track , year =
RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation , author =. NeurIPS Datasets and Benchmarks Track , year =
-
[14]
Measuring What Matters: Construct Validity in Large Language Model Benchmarks , author =. NeurIPS , year =
-
[16]
Do automatic factuality metrics measure factuality? a critical evaluation
Anonymous. Do automatic factuality metrics measure factuality? a critical evaluation. arXiv preprint arXiv:2411.16638, 2024
-
[17]
Andrew M. Bean, Ryan O. Kearns, Angelika Romanou, et al. Measuring what matters: Construct validity in large language model benchmarks. In NeurIPS, 2025
work page 2025
-
[18]
Jay DeYoung, Eric Lehman, Ben Nye, Iain J. Marshall, and Byron C. Wallace. Evidence inference 2.0: More data, better models. In BioNLP Workshop, 2020
work page 2020
-
[19]
Wikicontradict: A benchmark for evaluating llms on real-world knowledge conflicts from wikipedia
Yufang Hou, Alessandra Pascale, Javier Carnerero-Cano, Tigran Tchrakian, Radu Marinescu, Elizabeth Daly, Inkit Padhi, and Prasanna Sattigeri. Wikicontradict: A benchmark for evaluating llms on real-world knowledge conflicts from wikipedia. In NeurIPS Datasets and Benchmarks Track, 2024
work page 2024
-
[20]
Ilgen, Emma Pierson, Pang Wei Koh, and Yulia Tsvetkov
Shuyue Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan S. Ilgen, Emma Pierson, Pang Wei Koh, and Yulia Tsvetkov. Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning. In NeurIPS, 2024
work page 2024
-
[21]
Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation
Dongyu Ru, Lin Qiu, Xiangkun Hu, Tianhang Zhang, Peng Shi, Shuaichen Chang, Cunxiang Wang, Shichao Sun, Pengfei Liu, and Yue Zhang. Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation. In NeurIPS Datasets and Benchmarks Track, 2024
work page 2024
-
[22]
Get your vitamin c! robust fact verification with contrastive evidence
Tal Schuster, Adam Fisch, and Regina Barzilay. Get your vitamin c! robust fact verification with contrastive evidence. In NAACL, 2021
work page 2021
-
[23]
Daniel N. Sosa and Russ B. Altman. Contexts and contradictions: A roadmap for computational drug repurposing with knowledge inference. Briefings in Bioinformatics, 2022
work page 2022
-
[24]
Conflictbank: A benchmark for evaluating knowledge conflicts in large language models
Zhaochen Su, Jun Zhang, Xiaoye Qu, Tong Zhu, Yanshu Li, Jiashuo Sun, Juntao Li, Min Zhang, and Yu Cheng. Conflictbank: A benchmark for evaluating knowledge conflicts in large language models. In NeurIPS Datasets and Benchmarks Track, 2024
work page 2024
-
[25]
Fever: a large-scale dataset for fact extraction and verification
James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification. In NAACL, 2018
work page 2018
-
[26]
Fact or fiction: Verifying scientific claims
David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. Fact or fiction: Verifying scientific claims. In EMNLP, 2020
work page 2020
-
[27]
Scifact-open: Towards open-domain scientific claim verification
David Wadden, Kyle Lo, Bailey Kuehl, Arman Cohan, Iz Beltagy, Lucy Lu Wang, and Hannaneh Hajishirzi. Scifact-open: Towards open-domain scientific claim verification. In Findings of EMNLP, 2022
work page 2022
-
[28]
Direct: Diagnostic reasoning for clinical notes via large language models
Bowen Wang, Jiuyang Chang, Yiming Qian, Guoxin Chen, Junhao Chen, Zhouqiang Jiang, Jiahao Zhang, Yuta Nakashima, and Hajime Nagahara. Direct: Diagnostic reasoning for clinical notes via large language models. In NeurIPS Datasets and Benchmarks Track, 2024
work page 2024
-
[29]
Clasheval: Quantifying the tug-of-war between an llm's internal prior and external evidence
Kevin Wu, Eric Wu, and James Zou. Clasheval: Quantifying the tug-of-war between an llm's internal prior and external evidence. In NeurIPS Datasets and Benchmarks Track, 2024
work page 2024
-
[30]
Healthcontradict: Evaluating biomedical knowledge conflicts in language models
Boya Zhang, Alban Bornet, Rui Yang, Nan Liu, and Douglas Teodoro. Healthcontradict: Evaluating biomedical knowledge conflicts in language models. npj Digital Medicine, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.