arxiv: 2604.16872 · v1 · submitted 2026-04-18 · 💻 cs.DL

Recognition: unknown

Do Large Language Models know Which Published Articles have been Retracted?

Mike Thelwall

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:01 UTC · model grok-4.3

classification 💻 cs.DL

keywords large language modelsretracted articlesliterature searchacademic publishingAI reliabilityretraction detection

0 comments

The pith

Large language models usually fail to recognize retracted articles when given only their titles and abstracts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests three offline large language models on 161 high-profile retracted articles plus a benchmark of 34,070 non-retracted articles. Models are asked whether each article has been retracted, using titles and abstracts as input. In more than 80 percent of retracted cases the models state that no retraction occurred, and even their correct answers rest on inaccurate reasoning. For non-retracted articles the models make very few false retraction claims. The findings show that offline LLMs cannot reliably separate valid from retracted studies without web access, which matters because these models are already used for literature searches and summaries.

Core claim

Based on titles and abstracts, offline LLMs claim that retracted articles have not been retracted in over 80 percent of cases (GPT OSS 120B: 82 percent; Gemma 3 27B: 84 percent; DeepSeek R1 72B: 88 percent). Reasons given for correct retraction declarations are often wrong. On the benchmark of 34,070 non-retracted articles there are only 55 false retraction claims with full text and 28 with title and abstract alone. This indicates that LLMs have little ability to distinguish valid from retracted studies unless they check online.

What carries the argument

Direct prompting of offline LLMs with titles and abstracts of known retracted and non-retracted articles, followed by measurement of correct retraction detections and false retraction claims on valid papers.

If this is right

Users should verify retraction status independently when using LLM outputs on academic papers.
LLMs carry a low risk of wrongly dismissing valid studies as retracted.
Offline LLMs are unsuitable for literature reviews that might include retracted work.
New reasons exist to be cautious about any LLM statements concerning academic findings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models that can query the web in real time could check retraction databases and improve detection rates.
Performance on high-profile cases may be better than on obscure retractions, so the problem could be larger in practice.
Specialized tools that automatically cross-reference LLM answers against retraction lists would reduce the risk.
The same knowledge gap may exist for other time-sensitive facts about the scientific record.

Load-bearing premise

Results from these 161 high-profile retracted articles and the 34,070-article benchmark generalize to all retracted papers and the tested offline models represent typical LLM use without web access.

What would settle it

A new test set of retracted articles where the same offline LLMs correctly identify a majority as retracted when given only titles and abstracts.

read the original abstract

Large Language Models (LLMs) can be helpful for literature search and summarisation, but retracted articles can confuse them. This article asks three open weights (offline) LLMs whether 161 high profile retracted articles had been retracted, performing a similar check for a benchmark multidisciplinary set of 34,070 non-retracted articles. Based on titles and abstracts, in over 80% of cases the LLMs claimed that a retracted article had not been retracted (GPT OSS 120B: 82%; Gemma 3 27B: 84%; DeepSeek R1 72B: 88%). The reasons given for a correct retraction declaration were often wrong, even if detailed. This confirms that LLMs have little ability to distinguish between valid and retracted studies, unless they are allowed to, and do, check online. For the benchmark test, there were only 55 false retraction claims from 34,070 non-retracted full text articles, and 28 false claims when only the title and abstract were entered, suggesting that there is only a small chance that LLMs discount valid studies. When retractions are erroneously claimed, this does not seem to be due to mistakes in the article. Overall, the results give new reasons to be cautious about LLM claims about academic findings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript evaluates three offline open-weight LLMs (GPT OSS 120B, Gemma 3 27B, DeepSeek R1 72B) on their ability to detect retractions. Given titles and abstracts, the models are tested on 161 high-profile retracted articles and a benchmark of 34,070 non-retracted articles. The LLMs incorrectly state that retracted articles have not been retracted in 82–88% of cases; even when they correctly flag a retraction, the stated reasons are often erroneous. False-positive retraction claims on non-retracted articles are rare (55/34,070 for full text; 28 for title+abstract only). The authors conclude that LLMs lack reliable internal knowledge of retractions and require online verification.

Significance. If the core empirical pattern holds, the work supplies direct evidence that current LLMs cannot be trusted to filter retracted literature during summarization or search tasks, thereby supporting stronger cautionary guidance for LLM use in academic workflows. The large non-retracted benchmark set yields a precise, low false-positive rate that usefully bounds one side of the error profile; this quantitative grounding is a clear strength of the study.

major comments (1)

[Methods and Results sections describing the retracted-article corpus] The central claim that LLMs have 'little ability to distinguish between valid and retracted studies' rests on results from a curated set of 161 high-profile retracted articles. High-profile cases are the subset most likely to appear in training data or public discussion, yet the models still fail >80% of the time. Without a random or stratified sample drawn from the full population of retractions (e.g., the Retraction Watch database), it is impossible to determine whether the observed failure rate is representative or whether performance differs for lower-visibility retractions. This limits the scope of the general conclusion.

minor comments (2)

[Methods] The exact prompts used to elicit retraction judgments and the precise decision rule for classifying a model output as 'claiming retraction' are not stated; providing these would improve reproducibility.
[Abstract and Methods] Model names such as 'GPT OSS 120B' should be accompanied by version identifiers or repository links to avoid ambiguity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and recommendation. We respond to the single major comment below and have incorporated a partial revision to address the scope limitation.

read point-by-point responses

Referee: The central claim that LLMs have 'little ability to distinguish between valid and retracted studies' rests on results from a curated set of 161 high-profile retracted articles. High-profile cases are the subset most likely to appear in training data or public discussion, yet the models still fail >80% of the time. Without a random or stratified sample drawn from the full population of retractions (e.g., the Retraction Watch database), it is impossible to determine whether the observed failure rate is representative or whether performance differs for lower-visibility retractions. This limits the scope of the general conclusion.

Authors: We deliberately focused on 161 high-profile retracted articles because these are the cases most likely to have received public attention and thus to appear in LLM training data or discussion; the >80% failure rate even under these favorable conditions supplies direct evidence that current models lack reliable internalized knowledge of retractions. We agree that the sample is not random or stratified from the full Retraction Watch population and that the results cannot be extrapolated to lower-visibility retractions without additional data. We will revise the Discussion to qualify the conclusions as applying specifically to high-profile retractions and to recommend future studies that draw random or stratified samples from the complete retraction database to evaluate broader generalizability. revision: partial

Circularity Check

0 steps flagged

No circularity: direct empirical evaluation against ground truth

full rationale

The paper performs a straightforward empirical test: it queries three offline LLMs with titles/abstracts (and full texts for the benchmark) of 161 known retracted articles and 34,070 non-retracted articles, then compares the model outputs to the known retraction status. No equations, parameter fitting, model derivations, or self-citations appear in the load-bearing steps. Claims rest on observed error rates (e.g., >80% failure to detect retractions) versus the external ground truth of retraction records, with no reduction of any result to its own inputs by construction. Generalization concerns exist but are separate from circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities as this is an empirical benchmarking study without theoretical modeling.

pith-pipeline@v0.9.0 · 5521 in / 1078 out tokens · 45592 ms · 2026-05-10T07:01:50.338640+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 6 canonical work pages

[1]

Böschen, I. (2021). Software review: The JATSdecoder package—extract metadata, abstract and sectioned text from NISO -JATS coded XML documents; Insights to PubMed central’s open access database. Scientometrics, 126(12), 9585-9601

2021
[2]

Dai, Y. (2026). Metacognitive Strategy Use in GenAI -Supported Academic Reading: A Qualitative Study of Postgraduate Students in UK Higher Education. Frontiers in Psychology, 17, 1787647

2026
[3]

B., & Wider, W

Giray, L., Sevnarayan, K., Maphoto, K. B., & Wider, W. (2026). AI Slop in Academic Publishing: History, Characteristics, Manifestations, Causes, and Mitigation Strategies. Internet Reference Services Quarterly, 1-24

2026
[4]

He, Y., & Bu, Y. (2026). Academic journals’ AI policies fail to curb the surge in AI-assisted academic writing. Proceedings of the National Academy of Sciences , 123(9), e2526734123

2026
[5]

Liao, Z., Antoniak, M., Cheong, I., Cheng, E. Y. Y., Lee, A. H., Lo, K., & Zhang, A. X. (2024). LLMs as research tools: A large scale survey of researchers' usage and perceptions. arXiv preprint arXiv:2411.05025

work page arXiv 2024
[6]

Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems , 35, 17359-17372

2022
[7]

Mohammadi, E., Thelwall, M., Cai, Y., Collier, T., Tahamtan, I., & Eftekhar, A. (2026). Is generative AI reshaping academic practices worldwide? A survey of adoption, benefits, and concerns. Information Processing & Management, 63(1), 104350

2026
[8]

Roberts, R. J. (2001). PubMed Central: The GenBank of the published literature. Proceedings of the National Academy of Sciences, 98(2), 381-382

2001
[9]

M., & Horvát, E

Shi, H., Yu, Y., Romero, D. M., & Horvát, E. Á. (2025). The Persistence of Retracted Papers on Wikipedia. arXiv preprint arXiv:2509.18403

work page arXiv 2025
[10]

Thelwall, M., & Lehtisaari, M., Katsirea, I., Holmberg, K., & Zheng, E. -T. (2025). Does ChatGPT ignore article retractions and other reliability concerns? Learned Publishing. 38(4), e2018. https://doi.org/10.1002/leap.2018

work page doi:10.1002/leap.2018 2025
[11]

& Mohammadi, E

Thelwall, M. & Mohammadi, E. (2026). Can small and reasoning Large Language Models score journal articles for research quality and do averaging and few -shot help? Scientometrics. https://doi.org/10.1007/s11192-026-05585-2

work page doi:10.1007/s11192-026-05585-2 2026
[12]

Thomas, S. P. (2025). Concerns About Use of Artificial Intelligence (AI) in Literature

2025
[13]

Although Powerful, it's not Infallible

Searches, Scholarly Writing, and Manuscript Reviews. Issues in Mental Health Nursing, 46(12), 1175-1177. Visani Scozzi, M., Makri, S., & Madhyastha, P. (2026, March). "Although Powerful, it's not Infallible": Investigating Academic Researchers' Verification Challenges with LLMs. In Proceedings of the 2026 Conference on Human Information Interaction and Re...

2026
[14]

B., Cummings, J

Vu, C. B., Cummings, J. J., & Park, D. Y. (2026). Student Engagement with ChatGPT for Educational Tasks: Effects of Inoculation Training on Verification Intentions and Behavior. Computers and Education Open, 100335

2026
[15]

Yang, Y., & Jia, R. (2025). When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction. arXiv preprint arXiv:2505.16170

work page arXiv 2025
[16]

Fu, H-Z, Thelwall, M

Zheng, E.-T. Fu, H-Z, Thelwall, M. & Fan, Z. (2026). Can social media provide early warning of retraction? Evidence from critical tweets identified by human annotation and large language models. Journal of the Association for Information Science and Technology, 77(4), 624-639. https://doi.org/10.1002/asi.70028

work page doi:10.1002/asi.70028 2026