pith. machine review for the scientific record. sign in

arxiv: 2604.03755 · v1 · submitted 2026-04-04 · 💻 cs.CY · cs.AI· cs.CL· cs.HC

Recognition: no theorem link

Can Humans Tell? A Dual-Axis Study of Human Perception of LLM-Generated News

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:07 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CLcs.HC
keywords AI-generated newshuman perceptionLLM detectionsource attributionauthenticity judgmentcognitive fatiguemisinformation defense
0
0 comments X

The pith

Humans cannot reliably distinguish LLM-generated news from human-written articles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether people can spot news articles created by large language models rather than human writers. It introduces a platform that collects separate continuous ratings for source (human or machine) and authenticity (legitimate or fake) across thousands of judgments from over a thousand participants. The central result is that accuracy stays at chance levels for all six tested models, including small open-weight ones, with only modest help from self-reported topic expertise. This matters because it removes the option of treating ordinary readers as a practical check against AI misinformation in news.

Core claim

Participants show no statistically significant ability to attribute articles correctly to human or machine sources on continuous scales, with the same pattern holding across every LLM tested; domain expertise correlates modestly with better performance while political orientation does not, and accuracy falls after repeated judgments.

What carries the argument

JudgeGPT platform that separately measures source attribution (human-machine scale) and authenticity judgment (legitimate-fake scale) on continuous sliders.

If this is right

  • End users cannot reliably detect AI-generated news content.
  • Domain expertise provides only modest improvement in judgment accuracy.
  • Participants exhibit distinct response patterns such as skepticism or belief.
  • Judgment accuracy declines after about 30 sequential evaluations due to fatigue.
  • Defenses against AI news must rely on system-level methods such as cryptographic provenance rather than user detection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Content platforms may need automated verification tools instead of depending on reader vigilance.
  • The fatigue effect from sequential judgments implies real-world detection would be even poorer when people read many articles in one session.
  • Indistinguishability even for small open models suggests that widely accessible AI writing tools could make manual checks impractical without additional safeguards.

Load-bearing premise

The news topics, LLM prompts, and online participant sample produce perception judgments that would hold under different real-world conditions and that the rating scales capture genuine differences without interface bias.

What would settle it

A replication using new news topics or a different participant pool that finds statistically significant differences between human and machine attribution scores would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.03755 by Alexander Loth, Marc-Oliver Pahl, Martin Kappes.

Figure 1
Figure 1. Figure 1: The JudgeGPT interface. Participants rate each frag [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Source score (left) and authenticity score (right) dis [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mean source and authenticity scores per LLM ( [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Relationships between participant covariates and judgment scores. Top: political view (weak slope). Bottom: fake [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pairplot of participant-level mean scores colored by cluster. The two groups show clearly separated distributions on [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Rolling mean of judgment scores across sequential evaluations. Scores improve initially before declining after [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
read the original abstract

Can humans tell whether a news article was written by a person or a large language model (LLM)? We investigate this question using JudgeGPT, a study platform that independently measures source attribution (human vs. machine) and authenticity judgment (legitimate vs. fake) on continuous scales. From 2,318 judgments collected from 1,054 participants across content generated by six LLMs, we report five findings: (1) participants cannot reliably distinguish machine-generated from human-written text (p > .05, Welch's t-test); (2) this inability holds across all tested models, including open-weight models with as few as 7B parameters; (3) self-reported domain expertise predicts judgment accuracy (r = .35, p < .001) whereas political orientation does not (r = -.10, n.s.); (4) clustering reveals distinct response strategies ("Skeptics" vs. "Believers"); and (5) accuracy degrades after approximately 30 sequential evaluations due to cognitive fatigue. The answer, in short, is no: humans cannot reliably tell. These results indicate that user-side detection is not a viable defense and motivate system-level countermeasures such as cryptographic content provenance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript reports results from the JudgeGPT platform, collecting 2,318 judgments from 1,054 participants on continuous scales measuring source attribution (human vs. machine) and authenticity (legitimate vs. fake) for news articles generated by six LLMs. The central findings are that participants cannot reliably distinguish LLM-generated from human-written text (p > .05, Welch's t-test), this holds across models including 7B-parameter open-weight ones, self-reported expertise correlates with accuracy (r = .35, p < .001) while political orientation does not, response strategies cluster into 'Skeptics' vs. 'Believers', and accuracy degrades after ~30 sequential trials due to fatigue. The authors conclude that user-side detection is not viable and motivate system-level countermeasures such as cryptographic provenance.

Significance. If the empirical results hold after addressing measurement validation, the work supplies concrete data on human perception limits for AI-generated news, separating attribution from authenticity judgments and documenting expertise and fatigue effects. This strengthens arguments for technical provenance solutions over reliance on human detection and provides a replicable dual-axis protocol for future studies in the cs.CY and HCI domains.

major comments (3)
  1. [Findings (1) / Results] Findings (1): The claim that participants 'cannot reliably distinguish' rests on a non-significant Welch's t-test (p > .05). No effect sizes, power analysis, or validation of the continuous scales (e.g., test-retest reliability or forced-choice comparison) are reported, leaving open whether the null arises from genuine indistinguishability or from interface artifacts such as slider anchoring and demand characteristics in JudgeGPT. This directly affects the load-bearing inference that user-side detection is not viable.
  2. [Methods] Methods: Full details on article selection criteria, exact LLM prompts and generation parameters, and participant exclusion rules are missing. These are required to assess confounds, reproducibility, and whether the null result generalizes beyond the specific news topics and participant pool.
  3. [Findings (5)] Findings (5): The fatigue claim after ~30 sequential evaluations is reported with a total of 2,318 judgments from 1,054 participants (average ~2.2 per person). Clarification is needed on how many participants completed long sequences, how fatigue was operationalized and statistically controlled, and whether the degradation affects the primary t-test results.
minor comments (2)
  1. [Abstract] The abstract states results hold 'across all tested models' but does not name the six LLMs or their parameter counts in the summary; this should be added for immediate clarity.
  2. [Findings (4)] The clustering analysis into 'Skeptics' vs. 'Believers' is mentioned but lacks details on the clustering algorithm, number of clusters selected, and validation metrics; move to a dedicated subsection or supplement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which help clarify key aspects of our work. We address each major comment point by point below, indicating where we will revise the manuscript.

read point-by-point responses
  1. Referee: [Findings (1) / Results] Findings (1): The claim that participants 'cannot reliably distinguish' rests on a non-significant Welch's t-test (p > .05). No effect sizes, power analysis, or validation of the continuous scales (e.g., test-retest reliability or forced-choice comparison) are reported, leaving open whether the null arises from genuine indistinguishability or from interface artifacts such as slider anchoring and demand characteristics in JudgeGPT. This directly affects the load-bearing inference that user-side detection is not viable.

    Authors: We agree that effect sizes and a power analysis would strengthen interpretation of the null result. In the revised manuscript we will report Cohen's d for the Welch's t-test and include a post-hoc power calculation. Regarding scale validation, we will add a limitations discussion addressing potential slider anchoring and demand characteristics, while noting that the dual-axis design (separate attribution and authenticity judgments) was intended to reduce some response biases. We will qualify the conclusion to emphasize that the null supports limited user-side detection viability but does not rule out all possible detection methods. These additions will be made without altering the core empirical claim. revision: partial

  2. Referee: [Methods] Methods: Full details on article selection criteria, exact LLM prompts and generation parameters, and participant exclusion rules are missing. These are required to assess confounds, reproducibility, and whether the null result generalizes beyond the specific news topics and participant pool.

    Authors: We acknowledge these details are necessary for reproducibility and evaluation of generalizability. In the revised Methods section we will specify article selection criteria (recent news articles balanced across politics, science, and entertainment domains, matched for length and style), the exact prompts and generation parameters for each of the six LLMs (including temperature, top-p, and max tokens), and full participant exclusion rules (failed attention checks, incomplete responses, and response time outliers). These additions will allow readers to assess potential confounds and the scope of the findings. revision: yes

  3. Referee: [Findings (5)] Findings (5): The fatigue claim after ~30 sequential evaluations is reported with a total of 2,318 judgments from 1,054 participants (average ~2.2 per person). Clarification is needed on how many participants completed long sequences, how fatigue was operationalized and statistically controlled, and whether the degradation affects the primary t-test results.

    Authors: We will expand the reporting in the revised manuscript to include the distribution of judgments per participant (noting the subset who completed 30+ trials), the operationalization of fatigue via regression of accuracy on trial order, and confirmation that the primary t-test null result holds when fatigue is statistically controlled or when analysis is restricted to participants with short sequences. These clarifications will be added to the Results and Discussion sections. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical study with standard statistical reporting

full rationale

The paper reports findings from a human-subjects experiment collecting 2,318 judgments and applying off-the-shelf tests (Welch's t-test, Pearson correlation). No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing premises exist. All claims reduce directly to the collected data and conventional statistical outputs rather than to any self-referential construction or prior author result.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study rests on standard human-subjects assumptions about measurement validity and sample representativeness rather than fitted parameters or new entities.

axioms (2)
  • domain assumption Participant judgments on continuous scales accurately reflect source attribution and authenticity perception without systematic bias from the study interface.
    Invoked when interpreting the two judgment axes as valid measures of human perception.
  • domain assumption The selected articles and LLM outputs are representative of typical news content for generalizing the inability to distinguish.
    Required for the claim that the result applies beyond the tested items.

pith-pipeline@v0.9.0 · 5526 in / 1213 out tokens · 60805 ms · 2026-05-13T17:07:56.223137+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 1 internal anchor

  1. [1]

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv:2303.12712 [cs.CL]

  2. [2]

    Canyu Chen and Kai Shu. 2023. Combating Misinformation in the Age of LLMs: Opportunities and Challenges. arXiv:2311.05656 [cs.CY]

  3. [3]

    Elizabeth Clark, Tal August, Sofia Serber, Nithum Haber, Asli Celikyilmaz, and Noah A. Smith. 2021. All That’s ’Human’ Is Not Gold: Evaluating Human Evalua- tion of Generated Text. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics. ACL, 7282–7296. 4 A. Loth, M. Kappes, M.-O. Pahl WebSci Companion ’26 Figure 5: Pairp...

  4. [4]

    Emilio Ferrara. 2024. GenAI against humanity: nefarious applications of genera- tive artificial intelligence and large language models.Journal of Computational Social Science(22 Feb 2024). doi:10.1007/s42001-024-00250-1

  5. [5]

    Human heuristics for AI-generated language are flawed

    Maurice Jakesch, Jeffrey T. Hancock, and Mor Naaman. 2023. Human Heuristics for AI-Generated Language Are Flawed.Proceedings of the National Academy of Sciences120, 11 (2023), e2208839120. doi:10.1073/pnas.2208839120

  6. [6]

    Anastasia Kozyreva, Sam Wineburg, Stephan Lewandowsky, and Ralph Hertwig

  7. [7]

    doi:10.1177/09637214221121570

    Critical Ignoring as a Core Competence for Digital Citizens.Current Direc- tions in Psychological Science32, 1 (2023), 39–43. doi:10.1177/09637214221121570

  8. [8]

    Miles McCain, and Miles Brundage

    Sarah Kreps, R. Miles McCain, and Miles Brundage. 2022. All the News That’s Fit to Fabricate: AI-Generated Text as a Tool of Media Misinformation.Journal of Experimental Political Science9, 1 (2022), 104–117

  9. [9]

    David MJ Lazer, Matthew A Baum, Yochai Benkler, Adam J Berinsky, Kelly M Greenhill, Filippo Menczer, Miriam J Metzger, Brendan Nyhan, Gordon Penny- cook, David Rothschild, et al. 2018. The science of fake news.Science359, 6380 (2018), 1094–1096

  10. [10]

    Stephan Lewandowsky and Sander Van Der Linden. 2021. Countering misinfor- mation and fake news through inoculation and prebunking.European Review of Social Psychology32, 2 (2021), 348–384. 5 A. Loth, M. Kappes, M.-O. Pahl WebSci Companion ’26 Figure 6: Rolling mean of judgment scores across sequential evaluations. Scores improve initially before declining...

  11. [11]

    Alexander Loth. 2026. The Indistinguishability Threshold: Measuring Cog- nitive Vulnerabilities to AI-Generated Disinformation. In18th ACM Web Sci- ence Conference (WebSci Companion ’26), May 26–29, 2026, Braunschweig, Ger- many, PhD Symposium(Braunschweig, Germany). ACM, New York, NY, USA. doi:10.1145/3795513.3807421

  12. [12]

    Alexander Loth, Martin Kappes, and Marc-Oliver Pahl. 2024. Blessing or curse? A survey on the Impact of Generative AI on Fake News. arXiv:2404.03021 [cs.CL]

  13. [13]

    Alexander Loth, Martin Kappes, and Marc-Oliver Pahl. 2026. CRED-1: An Open Multi-Signal Domain Credibility Dataset for Automated Pre-Bunking of Online Misinformation. (2026). doi:10.2139/ssrn.6448466 Preprint available at SSRN

  14. [14]

    Alexander Loth, Martin Kappes, and Marc-Oliver Pahl. 2026. Eroding the Truth- Default: A Causal Analysis of Human Susceptibility to Foundation Model Hal- lucinations and Disinformation in the Wild. InCompanion Proceedings of the ACM Web Conference 2026 (WWW ’26 Companion)(Dubai, United Arab Emirates). ACM, New York, NY, USA. doi:10.1145/3774905.3795832 To appear

  15. [16]

    Alexander Loth, Martin Kappes, and Marc-Oliver Pahl. 2026. The Verification Crisis: Expert Perceptions of GenAI Disinformation and the Case for Reproducible Provenance. InCompanion Proceedings of the ACM Web Conference 2026 (WWW ’26 Companion)(Dubai, United Arab Emirates). ACM, New York, NY, USA. doi:10. 1145/3774905.3795484 To appear

  16. [17]

    Alexander Loth, Dominique Conceicao Rosario, Peter Ebinger, Martin Kappes, and Marc-Oliver Pahl. 2026. Origin Lens: A Privacy-First Mobile Framework for Cryptographic Image Provenance and AI Detection. InCompanion Proceedings of the ACM Web Conference 2026 (WWW ’26 Companion)(Dubai, United Arab Emirates). ACM, New York, NY, USA. To appear

  17. [18]

    Alexander Loth, Dominique Conceicao Rosario, Peter Ebinger, Martin Kappes, and Marc-Oliver Pahl. 2026. Origin Lens: Reclaiming Trust on the AI-Mediated Web Through On-Device Image Provenance Verification. In18th ACM Web Science Conference (WebSci Companion ’26), May 26–29, 2026, Braunschweig, Germany (Braunschweig, Germany). ACM, New York, NY, USA. doi:10...

  18. [19]

    Gordon Pennycook and David G Rand. 2021. The psychology of fake news.Trends in cognitive sciences25, 5 (2021), 388–402

  19. [20]

    Jon Roozenbeek, Sander van der Linden, Beth Goldberg, Steve Rathje, and Stephan Lewandowsky. 2022. Psychological Inoculation Improves Resilience Against Misinformation on Social Media.Science Advances8, 34

  20. [21]

    Jinyan Su, Claire Cardie, and Preslav Nakov. 2024. Adapting Fake News Detection to the Era of Large Language Models. arXiv:2311.04917 [cs.AI]

  21. [22]

    fake news

    Edson C Tandoc, Zheng Wei Lim, and Richard Ling. 2018. Defining “fake news”: A typology of scholarly definitions.Digital journalism6, 2 (2018), 137–153

  22. [23]

    Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The Spread of True and False News Online.Science359, 6380 (2018), 1146–1151

  23. [24]

    Claire Wardle and Hossein Derakhshan. 2017. Information Disorder: Toward an Interdisciplinary Framework for Research and Policy Making.Council of Europe Report(2017)

  24. [25]

    Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defending Against Neural Fake News. InAdvances in Neural Information Processing Systems, Vol. 32

  25. [26]

    Xinyi Zhou and Reza Zafarani. 2020. A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities.ACM Comput. Surv.53, 5, Article 109 (sep 2020), 40 pages. doi:10.1145/3395046 6