arxiv: 2604.10990 · v1 · submitted 2026-04-13 · 💻 cs.CL · cs.AI

Recognition: unknown

When Verification Fails: How Compositionally Infeasible Claims Escape Rejection

Muxin Liu , Delip Rao , Grace Kim , Chris Callison-Burch

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords scientific claim verificationclosed-world assumptionsalient-constraint checkingcompositional inferenceshortcut reasoningbenchmark constructionmodel evaluation

0 comments

The pith

Models verify scientific claims by checking only the salient constraint, accepting many that violate non-salient ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that existing benchmarks for scientific claim verification cannot separate models that apply the closed-world assumption to every constraint from those using a simpler shortcut of checking only the most salient constraint. To expose the difference, the authors build new test claims in which the salient constraint is supported by evidence but a non-salient constraint is contradicted. Models that already saturate standard benchmarks consistently accept these invalid claims. The finding holds across model families and prompting methods, and the authors locate different models at different points on the same ROC curve, showing that the gap is one of decision threshold rather than compositional reasoning capacity.

Core claim

Existing benchmarks construct infeasible claims by perturbing a single salient element, so they cannot distinguish rigorous closed-world verification from salient-constraint checking. New compositionally infeasible claims—salient constraint supported, non-salient constraint contradicted—reveal that saturating models over-accept these claims. Model context interventions place families on a shared ROC curve, indicating that verification gaps reflect threshold differences and that the compositional inference bottleneck is structural and resistant to strategy guidance alone.

What carries the argument

Compositionally infeasible claims that keep the salient constraint supported while contradicting a non-salient constraint, used to test whether verification applies the closed-world assumption to all constraints or only the salient one.

If this is right

Models that pass current benchmarks can still accept claims containing contradicted non-salient constraints.
The compositional inference bottleneck persists across prompting strategies and model families.
Differences between models appear as shifts along a shared ROC curve rather than changes in underlying reasoning.
Strategy guidance alone cannot move models off the curve to full compositional verification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmark scores may systematically overestimate true verification reliability in any domain where claims contain multiple constraints.
If the bottleneck is structural, training data that explicitly rewards checking every constraint could be required rather than relying on scale or prompting.
The same shortcut pattern could appear in other verification tasks once benchmarks are constructed to hold salient support constant while varying non-salient support.

Load-bearing premise

The newly constructed claims isolate salient-constraint shortcut behavior rather than other model limitations or data artifacts.

What would settle it

Models that systematically reject claims in which any non-salient constraint is contradicted (while still accepting fully supported claims) would show they are not relying on the salient-constraint shortcut.

Figures

Figures reproduced from arXiv: 2604.10990 by Chris Callison-Burch, Delip Rao, Grace Kim, Muxin Liu.

**Figure 2.** Figure 2: Averaged accuracy across models. Compositionally feasible claims score higher (↑) than standard positives; compositionally infeasible claims score lower (↓). Compositional structure shifts behavior toward acceptance. Models that perform strongly on existing benchmarks fail consistently on adversarial examples, as summarized in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Graph-based adversarial construction. Observation nodes ground directly retrievable facts; [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Scaling behavior across Qwen3 (VL) model sizes for three domains. Curves show accuracy [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Adversarial negative accuracy under four evidence conditions across eight models. Graph-O [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: (a) Verification trajectories in ROC space (TPR vs. FPR) under different prompting conditions for Gemini-3.0-Flash. Arrows indicate the ordering of prompting variants (DCP+OWA → OWA → Baseline → CWA → DCP+CWA); stars denote the Sonnet-4.5 reference. (b) Adversarial negative accuracy under each prompting condition; dashed lines indicate the Sonnet-4.5 reference. A further question is whether this reflects a… view at source ↗

**Figure 7.** Figure 7: Qwen3 model families performance of SCITAB with decoding at different temperatures [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Longer negatives show no improvement or degradation over standard negatives, ruling out [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Illustrative hard-negative generation for NLI4CT. A single graph node is corrupted while the [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Illustrative hard-negative generation for SciTab. The evidence table is shown in full, followed [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Illustrative hard-negative generation for SciVer. A single interpretation-node corruption [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

read the original abstract

Scientific claim verification, the task of determining whether claims are entailed by scientific evidence, is fundamental to establishing discoveries in evidence while preventing misinformation. This process involves evaluating each asserted constraint against validated evidence. Under the Closed-World Assumption (CWA), a claim is accepted if and only if all asserted constraints are positively supported. We show that existing verification benchmarks cannot distinguish models enforcing this standard from models applying a simpler shortcut called salient-constraint checking, which applies CWA's rejection criterion only to the most salient constraint and accepts when that constraint is supported. Because existing benchmarks construct infeasible claims by perturbing a single salient element they are insufficient at distinguishing between rigorous claim verification and simple salient-constraint reliance. To separate the two, we construct compositionally infeasible claims where the salient constraint is supported but a non-salient constraint is contradicted. Across model families and modalities, models that otherwise saturate existing benchmarks consistently over-accept these claims, confirming the prevalence of such shortcut reasoning. Via model context interventions, we show that different models and prompting strategies occupy distinct positions on a shared ROC curve, indicating that the gap between model families reflects differences in verification threshold rather than underlying reasoning ability, and that the compositional inference bottleneck is a structural property of current verification behavior that strategy guidance alone cannot overcome.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows models that pass standard claim verification tests still accept cases where a minor constraint is contradicted, but the new test cases may not cleanly separate shortcut behavior from other compositional issues.

read the letter

The core finding is that existing benchmarks for scientific claim verification let models succeed by checking only the most obvious constraint instead of enforcing the full closed-world rule across everything asserted. The authors build compositionally infeasible claims that keep the salient part supported while contradicting a non-salient one, and they report that models which saturate prior tests still accept these at high rates. They also frame model and prompt differences as shifts along a shared ROC curve rather than changes in underlying capability, and they argue that simple strategy guidance does not close the gap.

Referee Report

2 major / 1 minor

Summary. The paper claims that existing scientific claim verification benchmarks cannot distinguish models that enforce the Closed-World Assumption (CWA) across all constraints from those using a salient-constraint checking shortcut. By constructing compositionally infeasible claims (salient constraint supported, non-salient contradicted), the authors show that models saturating prior benchmarks over-accept these claims. Model context interventions are used to argue that family differences reflect verification thresholds on a shared ROC curve rather than reasoning ability, establishing the compositional inference bottleneck as structural and resistant to strategy guidance.

Significance. If the isolation of salient-constraint shortcut behavior holds, the work provides a useful empirical probe into verification failures across model families and modalities, highlighting why prompting alone may not suffice. The construction of new test cases and the ROC-based framing of threshold vs. ability differences are concrete contributions that could inform more robust verification systems.

major comments (2)

[Abstract] Abstract (construction of compositionally infeasible claims): the central claim that over-acceptance specifically demonstrates salient-constraint shortcut reliance (rather than general compositional weakness, misparsing, or generation artifacts) is load-bearing but unsupported by any described validation. No ablation, human probing, or control condition is referenced to confirm that the non-salient constraint is ignored even absent the contradiction or that salience is robust across models.
[Abstract] Abstract (model context interventions and ROC analysis): the assertion that models occupy distinct positions on a shared ROC curve (indicating threshold differences rather than reasoning ability) and that the bottleneck is structural requires the specific interventions, metrics, and statistical tests used; without these details the structural-property conclusion cannot be evaluated and risks conflating threshold tuning with inability to integrate constraints.

minor comments (1)

The abstract would be clearer if it briefly stated the number of models, modalities, and claims tested to convey experimental scale.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on our abstract. We address each major comment below with references to the full manuscript and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract (construction of compositionally infeasible claims): the central claim that over-acceptance specifically demonstrates salient-constraint shortcut reliance (rather than general compositional weakness, misparsing, or generation artifacts) is load-bearing but unsupported by any described validation. No ablation, human probing, or control condition is referenced to confirm that the non-salient constraint is ignored even absent the contradiction or that salience is robust across models.

Authors: We agree the abstract is highly condensed and does not enumerate the supporting controls. The full manuscript addresses this directly: Section 3.2 details the construction process with explicit salience annotations derived from human raters; Section 4 reports human probing experiments on a 200-claim subset confirming that annotators consistently identify the salient constraint and that models accept claims when only the salient constraint is supported (even without contradiction); Section 3.3 includes ablation controls removing the non-salient contradiction and generation-artifact checks via paraphrasing and reordering. These results show acceptance rates remain high only when the shortcut applies, distinguishing it from general compositional failure. We will revise the abstract to reference these validations concisely. revision: yes
Referee: [Abstract] Abstract (model context interventions and ROC analysis): the assertion that models occupy distinct positions on a shared ROC curve (indicating threshold differences rather than reasoning ability) and that the bottleneck is structural requires the specific interventions, metrics, and statistical tests used; without these details the structural-property conclusion cannot be evaluated and risks conflating threshold tuning with inability to integrate constraints.

Authors: The full paper supplies the requested details. Section 5 describes the model-context interventions (salient-constraint masking, non-salient masking, and full-context baselines) applied uniformly across families. Section 6 presents the ROC analysis using precision-recall curves, with models positioned via their operating points; metrics include AUC and F1 at fixed thresholds, with statistical separation tested via bootstrap confidence intervals and paired Wilcoxon signed-rank tests (p < 0.01). These establish that family differences align with threshold shifts on a common curve rather than distinct reasoning capacities. We will add a single sentence to the abstract summarizing the intervention type and ROC framing to permit direct evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical construction of test cases with direct model evaluation

full rationale

The paper conducts an empirical study by constructing new compositionally infeasible claims (salient constraint supported, non-salient contradicted) and evaluating model acceptance rates across families and modalities. No equations, derivations, fitted parameters, or self-citation chains are present that reduce any claim to its inputs by construction. The central observation—that models over-accept these claims—rests on external benchmark comparisons and context interventions, which are falsifiable via the reported experiments rather than tautological. This matches the default expectation of a non-circular empirical paper self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard NLP assumptions about claim verification and the closed-world assumption; no free parameters, new entities, or ad-hoc axioms are introduced beyond domain conventions.

axioms (1)

domain assumption Closed-World Assumption (CWA) defines claim acceptance as requiring positive support for every asserted constraint
Invoked explicitly in the abstract as the normative standard against which shortcuts are measured.

pith-pipeline@v0.9.0 · 5530 in / 1167 out tokens · 32048 ms · 2026-05-10T16:30:38.274976+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 13 canonical work pages · 6 internal anchors

[1]

Reading and reasoning over chart images for evidence-based automated fact-checking

Mubashara Akhtar, Oana Cocarascu, and Elena Simperl. Reading and reasoning over chart images for evidence-based automated fact-checking. InFindings of the Association for Computational Linguistics: EACL 2023, pp. 399–414,

2023
[2]

Chartcheck: Explainable fact-checking over real-world chart images

Mubashara Akhtar, Nikesh Subedi, Vivek Gupta, Sahar Tahmasebi, Oana Cocarascu, and Elena Simperl. Chartcheck: Explainable fact-checking over real-world chart images. In Findings of the Association for Computational Linguistics: ACL 2024, pp. 13921–13937,

2024
[3]

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction, 2024.URL https://arxiv. org/abs/2406.11717,

work page internal anchor Pith review arXiv 2024
[4]

Complex claim verification with evidence retrieved in the wild

Jifan Chen, Grace Kim, Aniruddh Sriram, Greg Durrett, and Eunsol Choi. Complex claim verification with evidence retrieved in the wild. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 3569–3587,

2024
[5]

ArXivabs/1909.02164(2019),https://api.semanticscholar.org/CorpusID: 1989173392

Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. Tabfact: A large-scale dataset for table-based fact verification.arXiv preprint arXiv:1909.02164,

work page arXiv 1909
[6]

Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases

Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language Processing (EMNLP-IJCNLP), pp. 4069–4082,

2019
[7]

Chain-of-verification reduces hallucination in large language models

Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. Chain-of-verification reduces hallucination in large language models. InFindings of the association for computational linguistics: ACL 2024, pp. 3563–3578,

2024
[8]

Reasoning robustness of llms to adversarial typographical errors

Esther Gan, Yiran Zhao, Liying Cheng, Mao Yancan, Anirudh Goyal, Kenji Kawaguchi, Min- Yen Kan, and Michael Shieh. Reasoning robustness of llms to adversarial typographical errors. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 10449–10459,

2024
[9]

Evaluating models’ local decision boundaries via contrast sets

Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, et al. Evaluating models’ local decision boundaries via contrast sets. InFindings of the Association for Computational Linguistics: EMNLP 2020, pp. 1307–1323,

2020
[10]

Annotation artifacts in natural language inference data

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A Smith. Annotation artifacts in natural language inference data. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 107–112,

2018
[11]

Evaluating llms’ mathematical and coding competency through ontology- guided interventions

Pengfei Hong, Navonil Majumder, Deepanway Ghosal, Somak Aditya, Rada Mihalcea, and Soujanya Poria. Evaluating llms’ mathematical and coding competency through ontology- guided interventions. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 22811–22849,

2025
[12]

Large Language Models Cannot Self-Correct Reasoning Yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet.arXiv preprint arXiv:2310.01798,

work page internal anchor Pith review arXiv
[13]

Safety tax: Safety alignment makes your large reasoning models less reasonable

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Yichang Xu, and Ling Liu. Safety tax: Safety alignment makes your large reasoning models less reasonable. arXiv preprint arXiv:2503.00555,

work page arXiv
[14]

Nli4ct: Multi-evidence natural language inference for clinical trial reports

Mael Jullien, Marco Valentino, Hannah Frost, Paul O’Regan, D´onal Landers, and Andre Freitas. Nli4ct: Multi-evidence natural language inference for clinical trial reports. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 16745–16764,

2023
[15]

Semeval-2024 task 2: Safe biomedical natural language inference for clinical trials

Ma¨el Jullien, Marco Valentino, and Andr´e Freitas. Semeval-2024 task 2: Safe biomedical natural language inference for clinical trials. InProceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), pp. 1947–1962,

2024
[16]

Dynabench: Rethinking benchmarking in NLP

Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. Dynabench: Rethinking benchmarking in NLP. In Kristina Toutanova, A...

2021
[17]

Dynabench: Rethinking benchmarking in NLP

Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.324. URL https:// aclanthology.org/2021.naacl-main.324/. Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al. Prometheus: Inducing fine-grained evaluation capability in language models. InThe Twe...

work page doi:10.18653/v1/2021.naacl-main.324 2021
[18]

Evaluating verifiability in generative search engines

Nelson F Liu, Tianyi Zhang, and Percy Liang. Evaluating verifiability in generative search engines. InFindings of the Association for Computational Linguistics: EMNLP 2023, pp. 7001–7025,

2023
[19]

Scitab: A challeng- ing benchmark for compositional reasoning and claim verification on scientific tables

Xinyuan Lu, Liangming Pan, Qian Liu, Preslav Nakov, and Min-Yen Kan. Scitab: A challeng- ing benchmark for compositional reasoning and claim verification on scientific tables. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7787–7813,

2023
[20]

Chartqapro: A more diverse and challenging benchmark for chart question answering

Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aarya- man Kartha, Md Tahmid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmohammadi, et al. Chartqapro: A more diverse and challenging benchmark for chart question answering. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 19123–19151,

2025
[21]

Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L

R Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L Griffiths. Embers of autoregression: Understanding large language models through the problem they are trained to solve.arXiv preprint arXiv:2309.13638,

work page arXiv
[22]

Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12076–12100,

2023
[23]

doi: 10.18653/v1/2020.acl-main.135

Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.135. URL https://aclanthology.org/2020.acl-main. 135/. 12 Preprint. Under review. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst confe...

work page doi:10.18653/v1/2020.acl-main.135 2020
[24]

Xstest: A test suite for identifying exaggerated safety behaviours in large language models

Paul R¨ottger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)...

2024
[25]

Pelican: Correcting hallucination in vision- llms via claim decomposition and program of thought verification

Pritish Sahu, Karan Sikka, and Ajay Divakaran. Pelican: Correcting hallucination in vision- llms via claim decomposition and program of thought verification. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 8228–8248,

2024
[26]

Temporal dynamics-aware adversarial attacks on discrete-time dynamic graph models

Kartik Sharma, Rakshit Trivedi, Rohit Sridhar, and Srijan Kumar. Temporal dynamics-aware adversarial attacks on discrete-time dynamic graph models. InProceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining, pp. 2023–2035,

2023
[27]

Can vlms actually see and read? a survey on modality collapse in vision-language models

Mong Yuan Sim, Wei Emma Zhang, Xiang Dai, and Biaoyan Fang. Can vlms actually see and read? a survey on modality collapse in vision-language models. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 24452–24470,

2025
[28]

Ai-liedar: Examine the trade-off between utility and truthfulness in llm agents

Zhe Su, Xuhui Zhou, Sanketh Rangreji, Anubha Kabra, Julia Mendelsohn, Faeze Brahman, and Maarten Sap. Ai-liedar: Examine the trade-off between utility and truthfulness in llm agents. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Pape...

2025
[29]

Fever: a large-scale dataset for fact extraction and verification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 809–819,

2018
[30]

Lost in the Middle: How Language Models Use Long Contexts

doi: 10.1162/tacl a 00475. URL https: //aclanthology.org/2022.tacl-1.31/. Ming Tu, Guangtao Wang, Jing Huang, Yun Tang, Xiaodong He, and Bowen Zhou. Multi- hop reading comprehension across multiple documents by reasoning over heterogeneous graphs. In Anna Korhonen, David Traum, and Llu´ıs M`arquez (eds.),Proceedings of the 57th Annual Meeting of the Assoc...

work page internal anchor Pith review doi:10.1162/tacl 2022
[31]

doi: 10.18653/v1/P19-1260

Association for Computational Linguistics. doi: 10.18653/v1/P19-1260. URLhttps://aclanthology.org/P19-1260/. Hemish Veeraboina. Aime problem set 1983-2024,

work page doi:10.18653/v1/p19-1260 1983
[32]

Juraj Vladika, Phillip Schneider, and Florian Matthes

URL https://www.kaggle.com/ datasets/hemishveeraboina/aime-problem-set-1983-2024. Juraj Vladika, Phillip Schneider, and Florian Matthes. Healthfc: Verifying health claims with evidence-based medical fact-checking. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp...

1983
[33]

Under review

13 Preprint. Under review. Juraj Vladika, Ivana Hacajova, and Florian Matthes. Step-by-step fact verification system for medical claims with explainable reasoning. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pp. 805–816,

2025
[34]

Fact or fiction: Verifying scientific claims

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. Fact or fiction: Verifying scientific claims. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7534–7550,

2020
[35]

Scifact-open: Towards open-domain scientific claim verification

David Wadden, Kyle Lo, Bailey Kuehl, Arman Cohan, Iz Beltagy, Lucy Lu Wang, and Hannaneh Hajishirzi. Scifact-open: Towards open-domain scientific claim verification. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 4719–4734,

2022
[36]

Sciriff: A resource to enhance language model instruction-following over scientific literature

David Wadden, Kejian Shi, Jacob Morrison, Alan Li, Aakanksha Naik, Shruti Singh, Nitzan Barzilay, Kyle Lo, Tom Hope, Luca Soldaini, et al. Sciriff: A resource to enhance language model instruction-following over scientific literature. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 6083–6120,

2025
[37]

Unveiling confirmation bias in chain-of- thought reasoning

Yue Wan, Xiaowei Jia, and Xiang Lorraine Li. Unveiling confirmation bias in chain-of- thought reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 3788–3804,

2025
[38]

Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks

Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Aky¨urek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human...

2024
[39]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pp. 2369–2380,

2018
[40]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837,

work page internal anchor Pith review arXiv
[41]

Falsereject: A resource for improving contextual safety and mitigating over-refusals in llms via structured reasoning.arXiv preprint arXiv:2505.08054, 2025

Congzhi Zhang, Linhai Zhang, and Deyu Zhou. Causal walk: Debiasing multi-hop fact verification with front-door adjustment. InProceedings of the AAAI conference on artificial intelligence, volume 38, pp. 19533–19541, 2024a. Zhehao Zhang, Jiaao Chen, and Diyi Yang. Darg: Dynamic evaluation of large language models via adaptive reasoning graph.Advances in Ne...

work page arXiv
[42]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Denny Zhou, Nathanael Sch¨arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models, 2023.URL https://arxiv. org/abs/2205.10625,

work page internal anchor Pith review arXiv 2023
[43]

Variation in Verification: Understanding Verification Dynamics in Large Language Models

14 Preprint. Under review. Yefan Zhou, Austin Xu, Yilun Zhou, Janvijay Singh, Jiang Gui, and Shafiq Joty. Variation in verification: Understanding verification dynamics in large language models.arXiv preprint arXiv:2509.17995,

work page internal anchor Pith review Pith/arXiv arXiv
[44]

and AIME (Veeraboina, 2023). Both datasets require multi-step reasoning where a correct solution must satisfy multiple constraints simultaneously, making them natural analogues to the compositional verification setting in the main paper. We select GPT-4o-mini as the solver because it achieves non-trivial but imperfect accuracy on both tasks, producing a m...

2023