arxiv: 2604.03755 · v1 · submitted 2026-04-04 · 💻 cs.CY · cs.AI· cs.CL· cs.HC

Recognition: no theorem link

Can Humans Tell? A Dual-Axis Study of Human Perception of LLM-Generated News

Alexander Loth , Martin Kappes , Marc-Oliver Pahl

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:07 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CLcs.HC

keywords AI-generated newshuman perceptionLLM detectionsource attributionauthenticity judgmentcognitive fatiguemisinformation defense

0 comments

The pith

Humans cannot reliably distinguish LLM-generated news from human-written articles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether people can spot news articles created by large language models rather than human writers. It introduces a platform that collects separate continuous ratings for source (human or machine) and authenticity (legitimate or fake) across thousands of judgments from over a thousand participants. The central result is that accuracy stays at chance levels for all six tested models, including small open-weight ones, with only modest help from self-reported topic expertise. This matters because it removes the option of treating ordinary readers as a practical check against AI misinformation in news.

Core claim

Participants show no statistically significant ability to attribute articles correctly to human or machine sources on continuous scales, with the same pattern holding across every LLM tested; domain expertise correlates modestly with better performance while political orientation does not, and accuracy falls after repeated judgments.

What carries the argument

JudgeGPT platform that separately measures source attribution (human-machine scale) and authenticity judgment (legitimate-fake scale) on continuous sliders.

If this is right

End users cannot reliably detect AI-generated news content.
Domain expertise provides only modest improvement in judgment accuracy.
Participants exhibit distinct response patterns such as skepticism or belief.
Judgment accuracy declines after about 30 sequential evaluations due to fatigue.
Defenses against AI news must rely on system-level methods such as cryptographic provenance rather than user detection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Content platforms may need automated verification tools instead of depending on reader vigilance.
The fatigue effect from sequential judgments implies real-world detection would be even poorer when people read many articles in one session.
Indistinguishability even for small open models suggests that widely accessible AI writing tools could make manual checks impractical without additional safeguards.

Load-bearing premise

The news topics, LLM prompts, and online participant sample produce perception judgments that would hold under different real-world conditions and that the rating scales capture genuine differences without interface bias.

What would settle it

A replication using new news topics or a different participant pool that finds statistically significant differences between human and machine attribution scores would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.03755 by Alexander Loth, Marc-Oliver Pahl, Martin Kappes.

**Figure 2.** Figure 2: Source score (left) and authenticity score (right) dis [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Mean source and authenticity scores per LLM ( [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Relationships between participant covariates and judgment scores. Top: political view (weak slope). Bottom: fake [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Pairplot of participant-level mean scores colored by cluster. The two groups show clearly separated distributions on [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Rolling mean of judgment scores across sequential evaluations. Scores improve initially before declining after [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

read the original abstract

Can humans tell whether a news article was written by a person or a large language model (LLM)? We investigate this question using JudgeGPT, a study platform that independently measures source attribution (human vs. machine) and authenticity judgment (legitimate vs. fake) on continuous scales. From 2,318 judgments collected from 1,054 participants across content generated by six LLMs, we report five findings: (1) participants cannot reliably distinguish machine-generated from human-written text (p > .05, Welch's t-test); (2) this inability holds across all tested models, including open-weight models with as few as 7B parameters; (3) self-reported domain expertise predicts judgment accuracy (r = .35, p < .001) whereas political orientation does not (r = -.10, n.s.); (4) clustering reveals distinct response strategies ("Skeptics" vs. "Believers"); and (5) accuracy degrades after approximately 30 sequential evaluations due to cognitive fatigue. The answer, in short, is no: humans cannot reliably tell. These results indicate that user-side detection is not a viable defense and motivate system-level countermeasures such as cryptographic content provenance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Large sample finds no average difference on continuous scales across models but the null result interpretation is weakened by unvalidated measurement and possible interface effects.

read the letter

The main takeaway is that this study collected 2318 judgments from 1054 people and found no reliable mean difference between human-written and LLM-generated news on two continuous scales, with the same pattern holding for models as small as 7B parameters. That null result is the headline finding, and the paper adds data on response clustering into skeptics versus believers plus a clear drop in performance after roughly 30 trials due to fatigue. Those secondary results look like the most useful new pieces because they come from a decent-sized sample and point to practical limits on user-side checking. The expertise correlation (r=.35) is also worth noting as a modest but consistent signal. The work is straightforward empirical user testing with standard stats, no circular derivations, and it tests multiple LLMs, which is better than many smaller detection studies. The soft spots sit mainly around the central claim. A non-significant Welch t-test on continuous ratings does not by itself prove people cannot tell the difference; without test-retest checks or a forced-choice comparison, similar averages could come from slider anchoring or demand effects in the JudgeGPT interface rather than true perceptual overlap. Full details on how the articles were chosen, how prompts were written, and exact exclusion rules are still needed to judge generalizability. The political orientation null result is unsurprising and minor. This paper is for people working on detection tools, provenance systems, or media trust experiments. The fatigue and clustering angles give it some value even if the indistinguishability claim needs tighter measurement validation. It deserves a serious referee to pressure the authors on scale validity and methods transparency rather than a desk reject.

Referee Report

3 major / 2 minor

Summary. The manuscript reports results from the JudgeGPT platform, collecting 2,318 judgments from 1,054 participants on continuous scales measuring source attribution (human vs. machine) and authenticity (legitimate vs. fake) for news articles generated by six LLMs. The central findings are that participants cannot reliably distinguish LLM-generated from human-written text (p > .05, Welch's t-test), this holds across models including 7B-parameter open-weight ones, self-reported expertise correlates with accuracy (r = .35, p < .001) while political orientation does not, response strategies cluster into 'Skeptics' vs. 'Believers', and accuracy degrades after ~30 sequential trials due to fatigue. The authors conclude that user-side detection is not viable and motivate system-level countermeasures such as cryptographic provenance.

Significance. If the empirical results hold after addressing measurement validation, the work supplies concrete data on human perception limits for AI-generated news, separating attribution from authenticity judgments and documenting expertise and fatigue effects. This strengthens arguments for technical provenance solutions over reliance on human detection and provides a replicable dual-axis protocol for future studies in the cs.CY and HCI domains.

major comments (3)

[Findings (1) / Results] Findings (1): The claim that participants 'cannot reliably distinguish' rests on a non-significant Welch's t-test (p > .05). No effect sizes, power analysis, or validation of the continuous scales (e.g., test-retest reliability or forced-choice comparison) are reported, leaving open whether the null arises from genuine indistinguishability or from interface artifacts such as slider anchoring and demand characteristics in JudgeGPT. This directly affects the load-bearing inference that user-side detection is not viable.
[Methods] Methods: Full details on article selection criteria, exact LLM prompts and generation parameters, and participant exclusion rules are missing. These are required to assess confounds, reproducibility, and whether the null result generalizes beyond the specific news topics and participant pool.
[Findings (5)] Findings (5): The fatigue claim after ~30 sequential evaluations is reported with a total of 2,318 judgments from 1,054 participants (average ~2.2 per person). Clarification is needed on how many participants completed long sequences, how fatigue was operationalized and statistically controlled, and whether the degradation affects the primary t-test results.

minor comments (2)

[Abstract] The abstract states results hold 'across all tested models' but does not name the six LLMs or their parameter counts in the summary; this should be added for immediate clarity.
[Findings (4)] The clustering analysis into 'Skeptics' vs. 'Believers' is mentioned but lacks details on the clustering algorithm, number of clusters selected, and validation metrics; move to a dedicated subsection or supplement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which help clarify key aspects of our work. We address each major comment point by point below, indicating where we will revise the manuscript.

read point-by-point responses

Referee: [Findings (1) / Results] Findings (1): The claim that participants 'cannot reliably distinguish' rests on a non-significant Welch's t-test (p > .05). No effect sizes, power analysis, or validation of the continuous scales (e.g., test-retest reliability or forced-choice comparison) are reported, leaving open whether the null arises from genuine indistinguishability or from interface artifacts such as slider anchoring and demand characteristics in JudgeGPT. This directly affects the load-bearing inference that user-side detection is not viable.

Authors: We agree that effect sizes and a power analysis would strengthen interpretation of the null result. In the revised manuscript we will report Cohen's d for the Welch's t-test and include a post-hoc power calculation. Regarding scale validation, we will add a limitations discussion addressing potential slider anchoring and demand characteristics, while noting that the dual-axis design (separate attribution and authenticity judgments) was intended to reduce some response biases. We will qualify the conclusion to emphasize that the null supports limited user-side detection viability but does not rule out all possible detection methods. These additions will be made without altering the core empirical claim. revision: partial
Referee: [Methods] Methods: Full details on article selection criteria, exact LLM prompts and generation parameters, and participant exclusion rules are missing. These are required to assess confounds, reproducibility, and whether the null result generalizes beyond the specific news topics and participant pool.

Authors: We acknowledge these details are necessary for reproducibility and evaluation of generalizability. In the revised Methods section we will specify article selection criteria (recent news articles balanced across politics, science, and entertainment domains, matched for length and style), the exact prompts and generation parameters for each of the six LLMs (including temperature, top-p, and max tokens), and full participant exclusion rules (failed attention checks, incomplete responses, and response time outliers). These additions will allow readers to assess potential confounds and the scope of the findings. revision: yes
Referee: [Findings (5)] Findings (5): The fatigue claim after ~30 sequential evaluations is reported with a total of 2,318 judgments from 1,054 participants (average ~2.2 per person). Clarification is needed on how many participants completed long sequences, how fatigue was operationalized and statistically controlled, and whether the degradation affects the primary t-test results.

Authors: We will expand the reporting in the revised manuscript to include the distribution of judgments per participant (noting the subset who completed 30+ trials), the operationalization of fatigue via regression of accuracy on trial order, and confirmation that the primary t-test null result holds when fatigue is statistically controlled or when analysis is restricted to participants with short sequences. These clarifications will be added to the Results and Discussion sections. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical study with standard statistical reporting

full rationale

The paper reports findings from a human-subjects experiment collecting 2,318 judgments and applying off-the-shelf tests (Welch's t-test, Pearson correlation). No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing premises exist. All claims reduce directly to the collected data and conventional statistical outputs rather than to any self-referential construction or prior author result.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study rests on standard human-subjects assumptions about measurement validity and sample representativeness rather than fitted parameters or new entities.

axioms (2)

domain assumption Participant judgments on continuous scales accurately reflect source attribution and authenticity perception without systematic bias from the study interface.
Invoked when interpreting the two judgment axes as valid measures of human perception.
domain assumption The selected articles and LLM outputs are representative of typical news content for generalizing the inability to distinguish.
Required for the claim that the result applies beyond the tested items.

pith-pipeline@v0.9.0 · 5526 in / 1213 out tokens · 60805 ms · 2026-05-13T17:07:56.223137+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 1 internal anchor

[1]

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv:2303.12712 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Canyu Chen and Kai Shu. 2023. Combating Misinformation in the Age of LLMs: Opportunities and Challenges. arXiv:2311.05656 [cs.CY]

work page arXiv 2023
[3]

Elizabeth Clark, Tal August, Sofia Serber, Nithum Haber, Asli Celikyilmaz, and Noah A. Smith. 2021. All That’s ’Human’ Is Not Gold: Evaluating Human Evalua- tion of Generated Text. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics. ACL, 7282–7296. 4 A. Loth, M. Kappes, M.-O. Pahl WebSci Companion ’26 Figure 5: Pairp...

work page 2021
[4]

Emilio Ferrara. 2024. GenAI against humanity: nefarious applications of genera- tive artificial intelligence and large language models.Journal of Computational Social Science(22 Feb 2024). doi:10.1007/s42001-024-00250-1

work page doi:10.1007/s42001-024-00250-1 2024
[5]

Human heuristics for AI-generated language are ﬂawed

Maurice Jakesch, Jeffrey T. Hancock, and Mor Naaman. 2023. Human Heuristics for AI-Generated Language Are Flawed.Proceedings of the National Academy of Sciences120, 11 (2023), e2208839120. doi:10.1073/pnas.2208839120

work page doi:10.1073/pnas.2208839120 2023
[6]

Anastasia Kozyreva, Sam Wineburg, Stephan Lewandowsky, and Ralph Hertwig

work page
[7]

doi:10.1177/09637214221121570

Critical Ignoring as a Core Competence for Digital Citizens.Current Direc- tions in Psychological Science32, 1 (2023), 39–43. doi:10.1177/09637214221121570

work page doi:10.1177/09637214221121570 2023
[8]

Miles McCain, and Miles Brundage

Sarah Kreps, R. Miles McCain, and Miles Brundage. 2022. All the News That’s Fit to Fabricate: AI-Generated Text as a Tool of Media Misinformation.Journal of Experimental Political Science9, 1 (2022), 104–117

work page 2022
[9]

David MJ Lazer, Matthew A Baum, Yochai Benkler, Adam J Berinsky, Kelly M Greenhill, Filippo Menczer, Miriam J Metzger, Brendan Nyhan, Gordon Penny- cook, David Rothschild, et al. 2018. The science of fake news.Science359, 6380 (2018), 1094–1096

work page 2018
[10]

Stephan Lewandowsky and Sander Van Der Linden. 2021. Countering misinfor- mation and fake news through inoculation and prebunking.European Review of Social Psychology32, 2 (2021), 348–384. 5 A. Loth, M. Kappes, M.-O. Pahl WebSci Companion ’26 Figure 6: Rolling mean of judgment scores across sequential evaluations. Scores improve initially before declining...

work page 2021
[11]

Alexander Loth. 2026. The Indistinguishability Threshold: Measuring Cog- nitive Vulnerabilities to AI-Generated Disinformation. In18th ACM Web Sci- ence Conference (WebSci Companion ’26), May 26–29, 2026, Braunschweig, Ger- many, PhD Symposium(Braunschweig, Germany). ACM, New York, NY, USA. doi:10.1145/3795513.3807421

work page doi:10.1145/3795513.3807421 2026
[12]

Alexander Loth, Martin Kappes, and Marc-Oliver Pahl. 2024. Blessing or curse? A survey on the Impact of Generative AI on Fake News. arXiv:2404.03021 [cs.CL]

work page arXiv 2024
[13]

Alexander Loth, Martin Kappes, and Marc-Oliver Pahl. 2026. CRED-1: An Open Multi-Signal Domain Credibility Dataset for Automated Pre-Bunking of Online Misinformation. (2026). doi:10.2139/ssrn.6448466 Preprint available at SSRN

work page doi:10.2139/ssrn.6448466 2026
[14]

Alexander Loth, Martin Kappes, and Marc-Oliver Pahl. 2026. Eroding the Truth- Default: A Causal Analysis of Human Susceptibility to Foundation Model Hal- lucinations and Disinformation in the Wild. InCompanion Proceedings of the ACM Web Conference 2026 (WWW ’26 Companion)(Dubai, United Arab Emirates). ACM, New York, NY, USA. doi:10.1145/3774905.3795832 To appear

work page doi:10.1145/3774905.3795832 2026
[16]

Alexander Loth, Martin Kappes, and Marc-Oliver Pahl. 2026. The Verification Crisis: Expert Perceptions of GenAI Disinformation and the Case for Reproducible Provenance. InCompanion Proceedings of the ACM Web Conference 2026 (WWW ’26 Companion)(Dubai, United Arab Emirates). ACM, New York, NY, USA. doi:10. 1145/3774905.3795484 To appear

work page arXiv 2026
[17]

Alexander Loth, Dominique Conceicao Rosario, Peter Ebinger, Martin Kappes, and Marc-Oliver Pahl. 2026. Origin Lens: A Privacy-First Mobile Framework for Cryptographic Image Provenance and AI Detection. InCompanion Proceedings of the ACM Web Conference 2026 (WWW ’26 Companion)(Dubai, United Arab Emirates). ACM, New York, NY, USA. To appear

work page 2026
[18]

Alexander Loth, Dominique Conceicao Rosario, Peter Ebinger, Martin Kappes, and Marc-Oliver Pahl. 2026. Origin Lens: Reclaiming Trust on the AI-Mediated Web Through On-Device Image Provenance Verification. In18th ACM Web Science Conference (WebSci Companion ’26), May 26–29, 2026, Braunschweig, Germany (Braunschweig, Germany). ACM, New York, NY, USA. doi:10...

work page doi:10.1145/3795513 2026
[19]

Gordon Pennycook and David G Rand. 2021. The psychology of fake news.Trends in cognitive sciences25, 5 (2021), 388–402

work page 2021
[20]

Jon Roozenbeek, Sander van der Linden, Beth Goldberg, Steve Rathje, and Stephan Lewandowsky. 2022. Psychological Inoculation Improves Resilience Against Misinformation on Social Media.Science Advances8, 34

work page 2022
[21]

Jinyan Su, Claire Cardie, and Preslav Nakov. 2024. Adapting Fake News Detection to the Era of Large Language Models. arXiv:2311.04917 [cs.AI]

work page arXiv 2024
[22]

fake news

Edson C Tandoc, Zheng Wei Lim, and Richard Ling. 2018. Defining “fake news”: A typology of scholarly definitions.Digital journalism6, 2 (2018), 137–153

work page 2018
[23]

Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The Spread of True and False News Online.Science359, 6380 (2018), 1146–1151

work page 2018
[24]

Claire Wardle and Hossein Derakhshan. 2017. Information Disorder: Toward an Interdisciplinary Framework for Research and Policy Making.Council of Europe Report(2017)

work page 2017
[25]

Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defending Against Neural Fake News. InAdvances in Neural Information Processing Systems, Vol. 32

work page 2019
[26]

Xinyi Zhou and Reza Zafarani. 2020. A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities.ACM Comput. Surv.53, 5, Article 109 (sep 2020), 40 pages. doi:10.1145/3395046 6

work page doi:10.1145/3395046 2020