Recognition: no theorem link
Can Humans Tell? A Dual-Axis Study of Human Perception of LLM-Generated News
Pith reviewed 2026-05-13 17:07 UTC · model grok-4.3
The pith
Humans cannot reliably distinguish LLM-generated news from human-written articles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Participants show no statistically significant ability to attribute articles correctly to human or machine sources on continuous scales, with the same pattern holding across every LLM tested; domain expertise correlates modestly with better performance while political orientation does not, and accuracy falls after repeated judgments.
What carries the argument
JudgeGPT platform that separately measures source attribution (human-machine scale) and authenticity judgment (legitimate-fake scale) on continuous sliders.
If this is right
- End users cannot reliably detect AI-generated news content.
- Domain expertise provides only modest improvement in judgment accuracy.
- Participants exhibit distinct response patterns such as skepticism or belief.
- Judgment accuracy declines after about 30 sequential evaluations due to fatigue.
- Defenses against AI news must rely on system-level methods such as cryptographic provenance rather than user detection.
Where Pith is reading between the lines
- Content platforms may need automated verification tools instead of depending on reader vigilance.
- The fatigue effect from sequential judgments implies real-world detection would be even poorer when people read many articles in one session.
- Indistinguishability even for small open models suggests that widely accessible AI writing tools could make manual checks impractical without additional safeguards.
Load-bearing premise
The news topics, LLM prompts, and online participant sample produce perception judgments that would hold under different real-world conditions and that the rating scales capture genuine differences without interface bias.
What would settle it
A replication using new news topics or a different participant pool that finds statistically significant differences between human and machine attribution scores would falsify the claim.
Figures
read the original abstract
Can humans tell whether a news article was written by a person or a large language model (LLM)? We investigate this question using JudgeGPT, a study platform that independently measures source attribution (human vs. machine) and authenticity judgment (legitimate vs. fake) on continuous scales. From 2,318 judgments collected from 1,054 participants across content generated by six LLMs, we report five findings: (1) participants cannot reliably distinguish machine-generated from human-written text (p > .05, Welch's t-test); (2) this inability holds across all tested models, including open-weight models with as few as 7B parameters; (3) self-reported domain expertise predicts judgment accuracy (r = .35, p < .001) whereas political orientation does not (r = -.10, n.s.); (4) clustering reveals distinct response strategies ("Skeptics" vs. "Believers"); and (5) accuracy degrades after approximately 30 sequential evaluations due to cognitive fatigue. The answer, in short, is no: humans cannot reliably tell. These results indicate that user-side detection is not a viable defense and motivate system-level countermeasures such as cryptographic content provenance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports results from the JudgeGPT platform, collecting 2,318 judgments from 1,054 participants on continuous scales measuring source attribution (human vs. machine) and authenticity (legitimate vs. fake) for news articles generated by six LLMs. The central findings are that participants cannot reliably distinguish LLM-generated from human-written text (p > .05, Welch's t-test), this holds across models including 7B-parameter open-weight ones, self-reported expertise correlates with accuracy (r = .35, p < .001) while political orientation does not, response strategies cluster into 'Skeptics' vs. 'Believers', and accuracy degrades after ~30 sequential trials due to fatigue. The authors conclude that user-side detection is not viable and motivate system-level countermeasures such as cryptographic provenance.
Significance. If the empirical results hold after addressing measurement validation, the work supplies concrete data on human perception limits for AI-generated news, separating attribution from authenticity judgments and documenting expertise and fatigue effects. This strengthens arguments for technical provenance solutions over reliance on human detection and provides a replicable dual-axis protocol for future studies in the cs.CY and HCI domains.
major comments (3)
- [Findings (1) / Results] Findings (1): The claim that participants 'cannot reliably distinguish' rests on a non-significant Welch's t-test (p > .05). No effect sizes, power analysis, or validation of the continuous scales (e.g., test-retest reliability or forced-choice comparison) are reported, leaving open whether the null arises from genuine indistinguishability or from interface artifacts such as slider anchoring and demand characteristics in JudgeGPT. This directly affects the load-bearing inference that user-side detection is not viable.
- [Methods] Methods: Full details on article selection criteria, exact LLM prompts and generation parameters, and participant exclusion rules are missing. These are required to assess confounds, reproducibility, and whether the null result generalizes beyond the specific news topics and participant pool.
- [Findings (5)] Findings (5): The fatigue claim after ~30 sequential evaluations is reported with a total of 2,318 judgments from 1,054 participants (average ~2.2 per person). Clarification is needed on how many participants completed long sequences, how fatigue was operationalized and statistically controlled, and whether the degradation affects the primary t-test results.
minor comments (2)
- [Abstract] The abstract states results hold 'across all tested models' but does not name the six LLMs or their parameter counts in the summary; this should be added for immediate clarity.
- [Findings (4)] The clustering analysis into 'Skeptics' vs. 'Believers' is mentioned but lacks details on the clustering algorithm, number of clusters selected, and validation metrics; move to a dedicated subsection or supplement.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which help clarify key aspects of our work. We address each major comment point by point below, indicating where we will revise the manuscript.
read point-by-point responses
-
Referee: [Findings (1) / Results] Findings (1): The claim that participants 'cannot reliably distinguish' rests on a non-significant Welch's t-test (p > .05). No effect sizes, power analysis, or validation of the continuous scales (e.g., test-retest reliability or forced-choice comparison) are reported, leaving open whether the null arises from genuine indistinguishability or from interface artifacts such as slider anchoring and demand characteristics in JudgeGPT. This directly affects the load-bearing inference that user-side detection is not viable.
Authors: We agree that effect sizes and a power analysis would strengthen interpretation of the null result. In the revised manuscript we will report Cohen's d for the Welch's t-test and include a post-hoc power calculation. Regarding scale validation, we will add a limitations discussion addressing potential slider anchoring and demand characteristics, while noting that the dual-axis design (separate attribution and authenticity judgments) was intended to reduce some response biases. We will qualify the conclusion to emphasize that the null supports limited user-side detection viability but does not rule out all possible detection methods. These additions will be made without altering the core empirical claim. revision: partial
-
Referee: [Methods] Methods: Full details on article selection criteria, exact LLM prompts and generation parameters, and participant exclusion rules are missing. These are required to assess confounds, reproducibility, and whether the null result generalizes beyond the specific news topics and participant pool.
Authors: We acknowledge these details are necessary for reproducibility and evaluation of generalizability. In the revised Methods section we will specify article selection criteria (recent news articles balanced across politics, science, and entertainment domains, matched for length and style), the exact prompts and generation parameters for each of the six LLMs (including temperature, top-p, and max tokens), and full participant exclusion rules (failed attention checks, incomplete responses, and response time outliers). These additions will allow readers to assess potential confounds and the scope of the findings. revision: yes
-
Referee: [Findings (5)] Findings (5): The fatigue claim after ~30 sequential evaluations is reported with a total of 2,318 judgments from 1,054 participants (average ~2.2 per person). Clarification is needed on how many participants completed long sequences, how fatigue was operationalized and statistically controlled, and whether the degradation affects the primary t-test results.
Authors: We will expand the reporting in the revised manuscript to include the distribution of judgments per participant (noting the subset who completed 30+ trials), the operationalization of fatigue via regression of accuracy on trial order, and confirmation that the primary t-test null result holds when fatigue is statistically controlled or when analysis is restricted to participants with short sequences. These clarifications will be added to the Results and Discussion sections. revision: yes
Circularity Check
No circularity: direct empirical study with standard statistical reporting
full rationale
The paper reports findings from a human-subjects experiment collecting 2,318 judgments and applying off-the-shelf tests (Welch's t-test, Pearson correlation). No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing premises exist. All claims reduce directly to the collected data and conventional statistical outputs rather than to any self-referential construction or prior author result.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Participant judgments on continuous scales accurately reflect source attribution and authenticity perception without systematic bias from the study interface.
- domain assumption The selected articles and LLM outputs are representative of typical news content for generalizing the inability to distinguish.
Reference graph
Works this paper leans on
-
[1]
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv:2303.12712 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [2]
-
[3]
Elizabeth Clark, Tal August, Sofia Serber, Nithum Haber, Asli Celikyilmaz, and Noah A. Smith. 2021. All That’s ’Human’ Is Not Gold: Evaluating Human Evalua- tion of Generated Text. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics. ACL, 7282–7296. 4 A. Loth, M. Kappes, M.-O. Pahl WebSci Companion ’26 Figure 5: Pairp...
work page 2021
-
[4]
Emilio Ferrara. 2024. GenAI against humanity: nefarious applications of genera- tive artificial intelligence and large language models.Journal of Computational Social Science(22 Feb 2024). doi:10.1007/s42001-024-00250-1
-
[5]
Human heuristics for AI-generated language are flawed
Maurice Jakesch, Jeffrey T. Hancock, and Mor Naaman. 2023. Human Heuristics for AI-Generated Language Are Flawed.Proceedings of the National Academy of Sciences120, 11 (2023), e2208839120. doi:10.1073/pnas.2208839120
-
[6]
Anastasia Kozyreva, Sam Wineburg, Stephan Lewandowsky, and Ralph Hertwig
-
[7]
Critical Ignoring as a Core Competence for Digital Citizens.Current Direc- tions in Psychological Science32, 1 (2023), 39–43. doi:10.1177/09637214221121570
-
[8]
Miles McCain, and Miles Brundage
Sarah Kreps, R. Miles McCain, and Miles Brundage. 2022. All the News That’s Fit to Fabricate: AI-Generated Text as a Tool of Media Misinformation.Journal of Experimental Political Science9, 1 (2022), 104–117
work page 2022
-
[9]
David MJ Lazer, Matthew A Baum, Yochai Benkler, Adam J Berinsky, Kelly M Greenhill, Filippo Menczer, Miriam J Metzger, Brendan Nyhan, Gordon Penny- cook, David Rothschild, et al. 2018. The science of fake news.Science359, 6380 (2018), 1094–1096
work page 2018
-
[10]
Stephan Lewandowsky and Sander Van Der Linden. 2021. Countering misinfor- mation and fake news through inoculation and prebunking.European Review of Social Psychology32, 2 (2021), 348–384. 5 A. Loth, M. Kappes, M.-O. Pahl WebSci Companion ’26 Figure 6: Rolling mean of judgment scores across sequential evaluations. Scores improve initially before declining...
work page 2021
-
[11]
Alexander Loth. 2026. The Indistinguishability Threshold: Measuring Cog- nitive Vulnerabilities to AI-Generated Disinformation. In18th ACM Web Sci- ence Conference (WebSci Companion ’26), May 26–29, 2026, Braunschweig, Ger- many, PhD Symposium(Braunschweig, Germany). ACM, New York, NY, USA. doi:10.1145/3795513.3807421
- [12]
-
[13]
Alexander Loth, Martin Kappes, and Marc-Oliver Pahl. 2026. CRED-1: An Open Multi-Signal Domain Credibility Dataset for Automated Pre-Bunking of Online Misinformation. (2026). doi:10.2139/ssrn.6448466 Preprint available at SSRN
-
[14]
Alexander Loth, Martin Kappes, and Marc-Oliver Pahl. 2026. Eroding the Truth- Default: A Causal Analysis of Human Susceptibility to Foundation Model Hal- lucinations and Disinformation in the Wild. InCompanion Proceedings of the ACM Web Conference 2026 (WWW ’26 Companion)(Dubai, United Arab Emirates). ACM, New York, NY, USA. doi:10.1145/3774905.3795832 To appear
-
[16]
Alexander Loth, Martin Kappes, and Marc-Oliver Pahl. 2026. The Verification Crisis: Expert Perceptions of GenAI Disinformation and the Case for Reproducible Provenance. InCompanion Proceedings of the ACM Web Conference 2026 (WWW ’26 Companion)(Dubai, United Arab Emirates). ACM, New York, NY, USA. doi:10. 1145/3774905.3795484 To appear
-
[17]
Alexander Loth, Dominique Conceicao Rosario, Peter Ebinger, Martin Kappes, and Marc-Oliver Pahl. 2026. Origin Lens: A Privacy-First Mobile Framework for Cryptographic Image Provenance and AI Detection. InCompanion Proceedings of the ACM Web Conference 2026 (WWW ’26 Companion)(Dubai, United Arab Emirates). ACM, New York, NY, USA. To appear
work page 2026
-
[18]
Alexander Loth, Dominique Conceicao Rosario, Peter Ebinger, Martin Kappes, and Marc-Oliver Pahl. 2026. Origin Lens: Reclaiming Trust on the AI-Mediated Web Through On-Device Image Provenance Verification. In18th ACM Web Science Conference (WebSci Companion ’26), May 26–29, 2026, Braunschweig, Germany (Braunschweig, Germany). ACM, New York, NY, USA. doi:10...
-
[19]
Gordon Pennycook and David G Rand. 2021. The psychology of fake news.Trends in cognitive sciences25, 5 (2021), 388–402
work page 2021
-
[20]
Jon Roozenbeek, Sander van der Linden, Beth Goldberg, Steve Rathje, and Stephan Lewandowsky. 2022. Psychological Inoculation Improves Resilience Against Misinformation on Social Media.Science Advances8, 34
work page 2022
- [21]
- [22]
-
[23]
Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The Spread of True and False News Online.Science359, 6380 (2018), 1146–1151
work page 2018
-
[24]
Claire Wardle and Hossein Derakhshan. 2017. Information Disorder: Toward an Interdisciplinary Framework for Research and Policy Making.Council of Europe Report(2017)
work page 2017
-
[25]
Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defending Against Neural Fake News. InAdvances in Neural Information Processing Systems, Vol. 32
work page 2019
-
[26]
Xinyi Zhou and Reza Zafarani. 2020. A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities.ACM Comput. Surv.53, 5, Article 109 (sep 2020), 40 pages. doi:10.1145/3395046 6
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.