pith. sign in

arxiv: 2502.11614 · v3 · submitted 2025-02-17 · 💻 cs.CL · cs.AI

Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI

Pith reviewed 2026-05-23 03:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords human detectionAI-generated textmultilingual evaluationLLM texttext preferenceconcretenesscultural nuances
0
0 comments X

The pith

Humans detect AI-generated text at 87.6 percent accuracy across 16 datasets in nine languages, far above prior random-guessing estimates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether humans can reliably distinguish large-language-model text from human-written text when the test spans many languages and subject areas. Earlier studies concluded that detection is usually no better than chance. Here, 19 annotators working on 16 datasets achieved 87.6 percent average accuracy. The clearest differences between the two kinds of text appear in concreteness, cultural detail, and expressive variety. Prompting that names these differences improves human performance in more than half the cases, yet people do not reliably favor the human-written version once they cannot name its source.

Core claim

Across 16 datasets covering nine languages and nine domains, 19 annotators reached an average detection accuracy of 87.6 percent when separating machine-generated text from human-written text. This finding directly challenges earlier conclusions that such distinction is highly challenging and often equivalent to random guessing. The study locates the main gaps in concreteness, cultural nuances, and diversity. Prompting that explicitly explains these gaps narrows the performance difference in over 50 percent of cases. Humans show no consistent preference for human-written text when they cannot identify its origin.

What carries the argument

The multilingual, multidomain human annotation experiment that measures both detection accuracy and source preference.

If this is right

  • Earlier claims that humans cannot detect AI text better than chance do not generalize across languages and domains.
  • Explicit prompts that name concreteness, cultural nuance, and diversity gaps can raise human detection rates in more than half the tested cases.
  • Human preference does not automatically favor human-written text when the source remains unidentified.
  • The released dataset of human labels and annotator metadata can be used to test further distinctions between the two text types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the high detection accuracy holds for typical readers, automated detectors may be less critical in some everyday settings.
  • The preference result suggests that AI text could gain acceptance in practice even when it remains distinguishable in principle.
  • Future work could test whether brief training on the identified gaps raises accuracy further or whether accuracy drops for non-expert readers.

Load-bearing premise

The 19 annotators and the 16 chosen datasets are representative of typical human detection performance and text distributions across the covered languages and domains.

What would settle it

A new study that recruits a substantially larger and more varied group of annotators on comparable datasets and obtains average accuracy near 50 percent would falsify the central accuracy claim.

Figures

Figures reproduced from arXiv: 2502.11614 by Akim Tsvigun, Alexander Aziz, Alham Fikri Aji, Artem Shelmanov, Ekaterina Artemova, Giovanni Puccetti, Iryna Gurevych, Jiahui Geng, Jinyan Su, Jonibek Mansurov, Kareem Elozeiri, Maiya Goloburda, Masahiro Kaneko, Mervat Abassy, Minh Ngoc Ta, Nizar Habash, Nurkhan Laiyk, Preslav Nakov, Raj Vardhan Tomar, Rui Xing, Ryuto Koike, Saad El Dine Ahmed, Tarek Mahmoud, Vladislav Mikhailov, Yuxia Wang, Zhuohan Xie.

Figure 1
Figure 1. Figure 1: Evaluating whether the new generations fill in [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Human preferences for three Chinese datasets [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Human preferences for two Russian (three [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Three annotator agreement on Chinese essays regarding whether the improved prompts mitigate the gap [PITH_FULL_IMAGE:figures/full_fig_p032_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Detection accuracy differences of 26 automatic machine-generated text detection approaches on original vs. improved generations. emotional issues and provide practical suggestions. For example, I have 87 days left until the college entrance exam as a student who attended this exam the second time, and my girlfriend, who is in uni- [PITH_FULL_IMAGE:figures/full_fig_p032_6.png] view at source ↗
read the original abstract

Prior studies have shown that distinguishing text generated by Large Language Models (LLMs) from human-written one is highly challenging for humans, and often no better than random guessing. To verify the generalizability of this finding across languages and domains, we perform an extensive case study to identify the upper bound of human detection accuracy. Across 16 datasets covering 9 languages and 9 domains, 19 annotators achieved an average detection accuracy of 87.6%, thus challenging previous conclusions. We find that major gaps between human and machine text lie in concreteness, cultural nuances, and diversity. Prompting by explicitly explaining the distinctions in the prompts can partially bridge the gaps in over 50% of the cases. However, we also find that humans do not always prefer human-written text, particularly when they cannot clearly identify its source. We release our dataset, the human labels, and the annotator metadata at https://github.com/xnlp-lab/HumanEval-MGT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that prior findings of humans performing near random guessing in distinguishing LLM-generated from human-written text do not generalize: across 16 datasets spanning 9 languages and 9 domains, 19 annotators achieved 87.6% average detection accuracy. It identifies gaps in concreteness, cultural nuances, and diversity as major discriminators, reports that explicit prompting bridges the gap in over 50% of cases, and finds that humans do not consistently prefer human text when the source cannot be identified. The authors release the dataset, human labels, and annotator metadata.

Significance. If the 87.6% figure and its generalizability hold, the result would materially revise the empirical baseline for human detection of machine-generated text in multilingual settings and would have downstream implications for detection tools and preference studies. The public release of the annotated data and metadata is a clear strength that supports reproducibility.

major comments (2)
  1. [Abstract] Abstract: the headline 87.6% average detection accuracy is presented without inter-annotator agreement statistics, per-language or per-domain variance, or error analysis; these omissions make it impossible to evaluate whether the result reliably overturns the random-guessing literature.
  2. [Abstract] Abstract: the claim that the 19 annotators and 16 datasets verify generalizability across languages and domains rests on an unstated assumption of representativeness; no information is given on annotator recruitment criteria, language proficiency screening, or dataset curation procedures that would rule out selection toward high-expertise annotators or easily distinguishable texts.
minor comments (1)
  1. [Abstract] The abstract states that prompting 'can partially bridge the gaps in over 50% of the cases' but does not define the exact success metric or the baseline prompting condition used for this comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting ways to strengthen the abstract's self-contained presentation of our results. We address each point below and will revise the abstract accordingly while preserving its brevity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline 87.6% average detection accuracy is presented without inter-annotator agreement statistics, per-language or per-domain variance, or error analysis; these omissions make it impossible to evaluate whether the result reliably overturns the random-guessing literature.

    Authors: The full manuscript reports inter-annotator agreement (Fleiss' kappa = 0.82 overall), per-language accuracies ranging from 78.4% to 94.1% and per-domain from 81.2% to 93.7% (Tables 2-3), plus error analysis attributing errors to concreteness, cultural nuance, and diversity gaps (Section 4.2). We agree the abstract should briefly reference these to allow immediate evaluation and will add one sentence summarizing agreement and variance ranges. revision: yes

  2. Referee: [Abstract] Abstract: the claim that the 19 annotators and 16 datasets verify generalizability across languages and domains rests on an unstated assumption of representativeness; no information is given on annotator recruitment criteria, language proficiency screening, or dataset curation procedures that would rule out selection toward high-expertise annotators or easily distinguishable texts.

    Authors: The manuscript details annotator recruitment (via Prolific with native-speaker screening and language-proficiency self-reports plus qualification tests) and dataset curation (standard public corpora balanced across languages/domains, with no post-hoc filtering for distinguishability). We acknowledge the abstract omits these and will add a concise clause on recruitment and curation criteria. We note that while our scale improves on prior work, full population-level generalizability remains a limitation discussed in Section 6. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical annotation study

full rationale

The paper reports new human annotations by 19 annotators on 16 datasets, yielding an observed 87.6% average detection accuracy. No equations, fitted parameters, derivations, or self-citation chains are present in the abstract or described methodology. The central claim is a direct empirical measurement rather than any reduction of a prediction to its inputs by construction. This is the normal case of a self-contained data-collection study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the assumption that annotator judgments constitute a valid and generalizable measure of detection ability and that the selected datasets adequately sample the target languages and domains; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The judgments of the 19 annotators accurately reflect human detection capabilities across languages and domains.
    The accuracy figure is derived directly from these annotations.

pith-pipeline@v0.9.0 · 5831 in / 1184 out tokens · 47456 ms · 2026-05-23T03:11:53.047143+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LLM Output Detectability and Task Performance Can be Jointly Optimized

    cs.CL 2026-05 unverdicted novelty 6.0

    PUPPET jointly optimizes LLM outputs for high detectability and task performance via RL rewards from a detector and a task evaluator, outperforming watermarking on tasks while matching detectability.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    URL: " 'urlintro :=

    ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Anthropic. 2024. https://api.semanticscholar.org/CorpusID:268232499 The claude 3 model family: Opus, sonnet, haiku

  4. [4]

    Giovanni Bonisoli, Maria Pia di Buono, Laura Po, and Federica Rollo. 2023. https://doi.org/10.1145/3539618.3591904 Dice: A dataset of italian crime event news . In SIGIR '23: The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, July 23 - 27, 2023 . ACM

  5. [5]

    JM Chein, SA Martinez, and AR Barone. 2024. Human intelligence can safeguard against artificial intelligence: individual differences in the discernment of human from ai texts. Scientific Reports, 14(1):25989

  6. [6]

    Elizabeth Clark, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, and Noah A. Smith. 2021. https://doi.org/10.18653/v1/2021.acl-long.565 All that`s human' is not gold: Evaluating human evaluation of generated text . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conf...

  7. [7]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The Llama 3 Herd of Models . arXiv preprint arXiv:2407.21783

  8. [8]

    Liam Dugan, Daphne Ippolito, Arun Kirubarajan, Sherry Shi, and Chris Callison - Burch. 2023. https://doi.org/10.1609/AAAI.V37I11.26501 Real or fake text?: Investigating human ability to detect boundaries between human-written and machine-generated text . In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on In...

  9. [9]

    Omar Einea, Ashraf Elnagar, and Ridhwan Al Debsi . 2019. https://doi.org/https://doi.org/10.1016/j.dib.2019.104076 Sanad: Single-label arabic news articles dataset for automatic text categorization . Data in Brief, 25:104076

  10. [10]

    Cristina Garbacea, Samuel Carton, Shiyan Yan, and Qiaozhu Mei. 2019. https://doi.org/10.18653/v1/D19-1409 Judge the judges: A large-scale evaluation study of neural language models for online review generation . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural L...

  11. [11]

    Biyang Guo, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jinran Nie, Yuxuan Ding, Jianwei Yue, and Yupeng Wu. 2023. How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv preprint arxiv:2301.07597

  12. [12]

    Saiful Islam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin Kang, M

    Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Islam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin Kang, M. Sohel Rahman, and Rifat Shahriyar. 2021. https://doi.org/10.18653/v1/2021.findings-acl.413 XL -sum: Large-scale multilingual abstractive summarization for 44 languages . In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4...

  13. [13]

    Jimpei Hitsuwari, Yoshiyuki Ueda, Woojin Yun, and Michio Nomura. 2023. https://doi.org/10.1016/J.CHB.2022.107502 Does human-ai collaboration lead to more creative art? aesthetic evaluation of human-made and ai-generated haiku poetry . Comput. Hum. Behav., 139:107502

  14. [14]

    Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. 2020. https://doi.org/10.18653/v1/2020.acl-main.164 Automatic detection of generated text is easiest when humans are fooled . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1808--1822, Online. Association for Computational Linguistics

  15. [15]

    Miao Li, Eduard Hovy, and Jey Han Lau. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.472 Summarizing multiple documents with conversational structure for meta-review generation . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7089--7112, Singapore. Association for Computational Linguistics

  16. [16]

    Zaijing Li, Gongwei Chen, Rui Shao, Dongmei Jiang, and Liqiang Nie. 2024. https://doi.org/10.48550/ARXIV.2401.06836 Enhancing the emotional generation capability of large language models via emotional chain-of-thought . CoRR, abs/2401.06836

  17. [17]

    David M Markowitz, Jeffrey T Hancock, and Jeremy N Bailenson. 2024. Linguistic markers of inherently false ai communication and intentionally false human communication: Evidence from hotel reviews. Journal of Language and Social Psychology, 43(1):63--82

  18. [18]

    Hannah Mieczkowski, Jeffrey T Hancock, Mor Naaman, Malte Jung, and Jess Hohenstein. 2021. Ai-mediated communication: Language use and interpersonal effects in a referential communication task. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW1):1--14

  19. [19]

    MOP-LIWU Community and MNBVC Team . 2023. Mnbvc: Massive never-ending bt vast chinese corpus. https://github.com/esbatmop/MNBVC

  20. [20]

    OpenAI. 2023. https://api.semanticscholar.org/CorpusID:257532815 Gpt-4 technical report . ArXiv, abs/2303.08774

  21. [21]

    Marco Polignano, Pierpaolo Basile, and Giovanni Semeraro. 2024. http://arxiv.org/abs/2405.07101 Advanced natural-based interaction for the italian language: Llamantino-3-anita

  22. [22]

    Tatiana Shamardina, Vladislav Mikhailov, Daniil Chernianskii, Alena Fenogenova, Marat Saidov, Anastasiya Valeeva, Tatiana Shavrina, Ivan Smurov, Elena Tutubalina, and Ekaterina Artemova. 2022. Findings of the ruatd shared task 2022 on artificial text detection in russian . arXiv preprint arXiv:2206.01583

  23. [23]

    Tatiana Shamardina, Marat Saidov, Alena Fenogenova, Aleksandr Tumanov, Alina Zemlyakova, Anna Lebedeva, Ekaterina Gryaznova, Tatiana Shavrina, Vladislav Mikhailov, and Ekaterina Artemova. 2025. Coat: Corpus of artificial texts. Natural Language Processing, 31(1):150--175

  24. [24]

    Wei Song, Kai Zhang, Ruiji Fu, Lizhen Liu, Ting Liu, and Miaomiao Cheng. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.546 Multi-stage pre-training for automated chinese essay scoring . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6723--6733, Online. Association for Computational Linguistics

  25. [25]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805

  26. [26]

    Chris van der Lee, Albert Gatt, Emiel van Miltenburg, Sander Wubben, and Emiel Krahmer. 2019. https://doi.org/10.18653/v1/W19-8643 Best practices for the human evaluation of automatically generated text . In Proceedings of the 12th International Conference on Natural Language Generation, pages 355--368, Tokyo, Japan. Association for Computational Linguistics

  27. [27]

    Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan Su, Artem Shelmanov, Akim Tsvigun, Osama Mohanned Afzal, Tarek Mahmoud, Giovanni Puccetti, Thomas Arnold, Alham Fikri Aji, Nizar Habash, Iryna Gurevych, and Preslav Nakov. 2024 a . https://doi.org/10.48550/ARXIV.2402.11175 M4gt-bench: Evaluation benchmark for black-box machine-generated text detection . C...

  28. [28]

    Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan Su, Artem Shelmanov, Akim Tsvigun, Chenxi Whitehouse, Osama Mohammed Afzal, Tarek Mahmoud, Toru Sasaki, Thomas Arnold, Alham Aji, Nizar Habash, Iryna Gurevych, and Preslav Nakov. 2024 b . https://aclanthology.org/2024.eacl-long.83 M4: Multi-generator, multi-domain, and multi-lingual black-box machine-gene...

  29. [29]

    Yuxia Wang, Artem Shelmanov, Jonibek Mansurov, Akim Tsvigun, Vladislav Mikhailov, Rui Xing, Zhuohan Xie, Jiahui Geng, Giovanni Puccetti, Ekaterina Artemova, Jinyan Su, Minh Ngoc Ta, Mervat Abassy, Kareem Ashraf Elozeiri, Saad El Dine Ahmed El Etter, Maiya Goloburda, Tarek Mahmoud, Raj Vardhan Tomar, Nurkhan Laiyk, Osama Mohammed Afzal, Ryuto Koike, Masahi...

  30. [30]

    Rustem Yeshpanov, Pavel Efimov, Leonid Boytsov, Ardak Shalkarbayuli, and Pavel Braslavski. 2024. KazQAD : Kazakh open-domain question answering dataset. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9645--9656