Why AI-Generated Text Detection Fails: Evidence from Explainable AI Beyond Benchmark Accuracy

David Lillis; Luis Miralles-Pechu\'an; Marisa Llorens Salvador; Shushanta Pudasaini

arxiv: 2603.23146 · v2 · submitted 2026-03-24 · 💻 cs.CL · cs.AI

Why AI-Generated Text Detection Fails: Evidence from Explainable AI Beyond Benchmark Accuracy

Shushanta Pudasaini , Luis Miralles-Pechu\'an , David Lillis , Marisa Llorens Salvador This is my paper

Pith reviewed 2026-05-15 00:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords AI-generated texttext detectiongeneralizationexplainable AISHAPlinguistic featuresdistribution shiftbenchmark performance

0 comments

The pith

AI text detectors achieve high benchmark scores but fail to generalize across domains and generators

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether AI-generated text detectors truly identify machine authorship or simply exploit quirks in their training data. A model using 30 linguistic features reaches an F1 score of 0.9734 on two benchmark datasets. Systematic testing reveals sharp performance drops when the data comes from different domains or text generators. Explanations based on SHAP values show that the features with the biggest impact vary widely between datasets. This indicates that detectors latch onto dataset-specific stylistic patterns instead of consistent signs of AI origin, which limits their usefulness in varied real-world scenarios.

Core claim

Contemporary AI text detectors do not identify stable signals of machine authorship. Instead, they rely on dataset-specific stylistic cues. This is evidenced by high in-domain performance on the PAN CLEF 2025 and COLING 2025 corpora that collapses under cross-domain and cross-generator evaluation. SHAP explanations confirm that the most influential features change markedly across datasets, and error analysis shows that the features most useful in training are also the most vulnerable to shifts in domain, formatting, and text length.

What carries the argument

Linguistic features combined with SHAP-based explanations to reveal varying feature importance across different datasets and generators

If this is right

Classifiers with high in-domain accuracy will degrade significantly under distribution shift to new domains or generators.
The most influential features identified by SHAP differ between datasets, pointing to reliance on stylistic artifacts.
The features most discriminative in training data are the ones most susceptible to domain shift and other variations.
Robust detectors require signals that remain stable across settings rather than in-domain discriminators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Detectors could be improved by selecting or engineering features that are invariant to domain and length variations.
Current benchmark evaluations may overestimate real-world reliability of AI text detectors.
Instance-level explanations could be used to debug and refine detectors for better generalization.
Similar generalization failures may occur in other NLP tasks that rely on stylistic or surface-level features.

Load-bearing premise

That the 30 linguistic features and the two benchmark corpora capture the key distribution shifts and that SHAP values correctly identify reliance on artifacts rather than true authorship signals.

What would settle it

A detector that maintains high accuracy on texts from new, unseen domains and generators while showing consistent feature importance rankings across all test sets would contradict the generalization failure claim.

Figures

Figures reproduced from arXiv: 2603.23146 by David Lillis, Luis Miralles-Pechu\'an, Marisa Llorens Salvador, Shushanta Pudasaini.

**Figure 1.** Figure 1: t-SNE projections of linguistic feature representations for the COLING dataset (left) and PAN CLEF dataset [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the experimental pipeline, illustrating dataset usage, linguistic feature extraction, model training [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Confusion matrices for the four classifiers evaluated on the PAN CLEF in-domain test set. Rows indicate [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Feature importance obtained while training an XGBoost model on different datasets. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: SHAP summary plots showing feature importance distributions for different datasets. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: SHAP waterfall plots illustrating feature contributions for human-written and AI-generated samples. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: SHAP bar plots showing average feature impact for human-written and AI-generated samples. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

The widespread adoption of Large Language Models (LLMs) has made the detection of AI-Generated text a pressing and complex challenge. Although many detection systems report high benchmark accuracy, their reliability in real-world settings remains uncertain, and their interpretability is often unexplored. In this work, we investigate whether contemporary detectors genuinely identify machine authorship or merely exploit dataset-specific artefacts. We propose an interpretable detection framework that integrates linguistic feature engineering, machine learning, and explainable AI techniques. When evaluated on two prominent benchmark corpora, namely PAN CLEF 2025 and COLING 2025, our model trained on 30 linguistic features achieves leaderboard-competitive performance, attaining an F1 score of 0.9734. However, systematic cross-domain and cross-generator evaluation reveals substantial generalisation failure: classifiers that excel in-domain degrade significantly under distribution shift. Using SHAP- based explanations, we show that the most influential features differ markedly between datasets, indicating that detectors often rely on dataset-specific stylistic cues rather than stable signals of machine authorship. Further investigation with in-depth error analysis exposes a fundamental tension in linguistic-feature-based AI text detection: the features that are most discriminative on in-domain data are also the features most susceptible to domain shift, formatting variation, and text-length effects. We believe that this knowledge helps build AI detectors that are robust across different settings. To support replication and practical use, we release an open-source Python package that returns both predictions and instance-level explanations for individual texts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A linguistic-feature detector hits high in-domain F1 but drops under shift, with SHAP showing unstable top features, yet the setup stays limited to shallow models.

read the letter

The main thing to know is that this paper trains a 30-feature linguistic ML model to 0.9734 F1 on PAN CLEF 2025, then documents clear degradation on COLING 2025 and across generators, with SHAP values showing the most discriminative features change sharply between datasets. That pattern supports the claim that the model leans on dataset-specific cues like length or style rather than stable authorship signals, and the released package makes the explanations easy to inspect on new text.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that AI-generated text detectors achieving high in-domain accuracy (F1=0.9734 using a 30-feature linguistic model on PAN CLEF 2025 and COLING 2025) fail to generalize, as shown by substantial degradation under cross-domain and cross-generator shifts; SHAP attributions indicate reliance on dataset-specific stylistic cues rather than stable authorship signals, with an accompanying open-source package for predictions and explanations.

Significance. If the core results hold, the work provides concrete evidence via cross-domain tests and post-hoc explanations that benchmark success does not imply robustness, highlighting a practical barrier to reliable deployment. The combination of linguistic feature analysis, SHAP, and error analysis offers actionable diagnostics for improving detectors, and the released package supports replication.

major comments (3)

[Abstract and Experiments] Abstract and Experiments section: The central claim addresses 'contemporary detectors' broadly, yet all reported results (in-domain F1, cross-domain degradation, and SHAP feature rankings) are obtained exclusively from a 30-feature hand-crafted linguistic model. No evaluation is provided on prevalent transformer-based detectors (e.g., fine-tuned RoBERTa or DeBERTa), leaving open whether the observed artifact reliance and generalization failure are properties of shallow feature sets or of machine-authorship detection in general.
[Cross-domain evaluation] Cross-domain evaluation (implied §4–5): The reported performance drops under distribution shift are load-bearing for the generalization-failure claim, but the manuscript does not detail the precise cross-generator protocols (which generators appear in each corpus), length-normalization steps, or statistical tests confirming that the degradation exceeds what would be expected from topic or formatting shifts alone.
[SHAP analysis] SHAP analysis (implied §5): While SHAP values are used to show differing top features across the two corpora, the paper does not report controls for feature correlations with text length or formatting artifacts; without these, it remains possible that the 'dataset-specific' attributions partly reflect sensitivity of the chosen linguistic features rather than a fundamental limitation of authorship signals.

minor comments (2)

[References] Ensure the two benchmark corpora are fully cited with exact versions and access dates in the references section.
[Results] The abstract states an F1 of 0.9734; report the corresponding precision, recall, and baseline comparisons in the main results table for completeness.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important areas for improving the clarity and robustness of our claims. We address each major comment point by point below and will revise the manuscript accordingly to incorporate the suggested clarifications and additional analyses.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: The central claim addresses 'contemporary detectors' broadly, yet all reported results (in-domain F1, cross-domain degradation, and SHAP feature rankings) are obtained exclusively from a 30-feature hand-crafted linguistic model. No evaluation is provided on prevalent transformer-based detectors (e.g., fine-tuned RoBERTa or DeBERTa), leaving open whether the observed artifact reliance and generalization failure are properties of shallow feature sets or of machine-authorship detection in general.

Authors: We agree that the scope of our experiments is limited to the interpretable 30-feature linguistic model. This design choice was intentional to enable direct SHAP-based explanations of feature reliance, which forms the core of our contribution on why detection fails. Transformer-based detectors are indeed more prevalent in contemporary practice, but our results provide evidence that even simple, transparent models succeed in-domain by exploiting unstable artifacts. In the revision, we will update the abstract, introduction, and add a dedicated limitations section to explicitly narrow the claims to linguistic feature-based detectors while discussing the potential extension to neural models and calling for future comparative studies. revision: yes
Referee: [Cross-domain evaluation] Cross-domain evaluation (implied §4–5): The reported performance drops under distribution shift are load-bearing for the generalization-failure claim, but the manuscript does not detail the precise cross-generator protocols (which generators appear in each corpus), length-normalization steps, or statistical tests confirming that the degradation exceeds what would be expected from topic or formatting shifts alone.

Authors: This is a valid request for greater methodological transparency. The revised manuscript will expand the experimental setup and evaluation sections to specify: the exact generators used in the PAN CLEF 2025 and COLING 2025 corpora, the length-normalization procedure (including whether length was treated as a feature or texts were truncated/padded), and the application of statistical tests such as paired t-tests or McNemar's test to establish that cross-domain degradation is significant beyond topic or formatting variations. revision: yes
Referee: [SHAP analysis] SHAP analysis (implied §5): While SHAP values are used to show differing top features across the two corpora, the paper does not report controls for feature correlations with text length or formatting artifacts; without these, it remains possible that the 'dataset-specific' attributions partly reflect sensitivity of the chosen linguistic features rather than a fundamental limitation of authorship signals.

Authors: We appreciate this concern regarding potential confounds. In the revision, we will augment the SHAP analysis section with explicit controls: reporting correlations between the highest-attribution features and text length, and recomputing SHAP explanations on length-matched data subsets. These additions will help demonstrate whether the dataset-specific feature rankings persist independently of length or formatting effects, thereby strengthening the evidence that the attributions reflect genuine reliance on unstable stylistic cues. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on independent cross-domain experiments and post-hoc attributions

full rationale

The paper trains standard ML classifiers on 30 hand-crafted linguistic features, reports actual in-domain F1 on held-out test splits, measures degradation on separate cross-domain and cross-generator corpora, and applies SHAP post-hoc to the trained models. None of these steps define a quantity in terms of itself, rename a fitted parameter as a prediction, or rely on self-citation chains for the central result. The derivation chain is therefore self-contained empirical evaluation rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on standard supervised machine learning assumptions and pre-existing linguistic feature sets; no new free parameters, axioms, or invented entities are introduced beyond conventional model training.

pith-pipeline@v0.9.0 · 5583 in / 1039 out tokens · 39706 ms · 2026-05-15T00:45:57.353151+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using SHAP-based explanations, we show that the most influential features differ markedly between datasets

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

[1]

Understanding the false positive rate for sentences of our ai writing detection capability, June

Annie Chechitelli. Understanding the false positive rate for sentences of our ai writing detection capability, June

work page
[2]

Accessed: 2026-04-21

work page 2026
[3]

Student Generative AI Survey 2025

Josh Freeman. Student Generative AI Survey 2025. Policy Note 61, Higher Education Policy Institute (HEPI), February 2025. Accessed: 2026-01-08

work page 2025
[4]

Towards possibilities & impossibilities of ai-generated text detection: A survey.arXiv preprint arXiv:2310.15264,

Soumya Suvra Ghosal, Souradip Chakraborty, Jonas Geiping, Furong Huang, Dinesh Manocha, and Am- rit Singh Bedi. Towards possibilities & impossibilities of ai-generated text detection: A survey.arXiv preprint arXiv:2310.15264, 2023

work page arXiv 2023
[5]

SemEval-2024 task 8: Multidomain, multimodel and multilingual machine-generated text detection

Wang Yuxia et al. SemEval-2024 task 8: Multidomain, multimodel and multilingual machine-generated text detection. In Atul Kr. Ojha, A. Seza Do ˘gruöz, Harish Tayyar Madabushi, Giovanni Da San Martino, Sara Rosenthal, and Aiala Rosá, editors,Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), pages 2057–2079, Mexico City, ...

work page 2024
[6]

Overview of pan 2025: V oight-kampff 19 generative ai detection, multilingual text detoxification, multi-author writing style analysis, and generative plagiarism detection

Janek Bevendorff, Daryna Dementieva, Maik Fröbe, Bela Gipp, André Greiner-Petter, Jussi Karlgren, Maximilian Mayerl, Preslav Nakov, Alexander Panchenko, Martin Potthast, et al. Overview of pan 2025: V oight-kampff 19 generative ai detection, multilingual text detoxification, multi-author writing style analysis, and generative plagiarism detection. InInter...

work page 2025
[7]

International Conference on Computational Linguistics

Firoj Alam, Preslav Nakov, Nizar Habash, Iryna Gurevych, Shammur Chowdhury, Artem Shelmanov, Yuxia Wang, Ekaterina Artemova, Mucahid Kutlu, and George Mikros, editors.Proceedings of the 1st Workshop on GenAI Content Detection (GenAIDetect), Abu Dhabi, UAE, January 2025. International Conference on Computational Linguistics

work page 2025
[8]

Can AI-Generated Text be Reliably Detected?

Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi. Can ai-generated text be reliably detected?arXiv preprint arXiv:2303.11156, 2023

work page Pith review arXiv 2023
[9]

Watermarks in the sand: impossibility of strong watermarking for language models

Hanlin Zhang, Benjamin L Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, and Boaz Barak. Watermarks in the sand: impossibility of strong watermarking for language models. InForty-first International Conference on Machine Learning, 2024

work page 2024
[10]

ai vs humans

Jiazhou Ji, Ruizhe Li, Shujun Li, Jie Guo, Weidong Qiu, Zheng Huang, Chiyu Chen, Xiaoyu Jiang, and Xinru Lu. Detecting machine-generated texts: Not just "ai vs humans" and explainability is complicated, 2025

work page 2025
[11]

Styloai: Distinguishing ai-generated content with stylometric analysis

Chidimma Opara. Styloai: Distinguishing ai-generated content with stylometric analysis. InInternational conference on artificial intelligence in education, pages 105–114. Springer, 2024

work page 2024
[12]

Linguistic characteristics of ai-generated text: A survey.arXiv preprint arXiv:2510.05136, 2025

Luka Terˇcon and Kaja Dobrovoljc. Linguistic characteristics of ai-generated text: A survey.arXiv preprint arXiv:2510.05136, 2025

work page arXiv 2025
[13]

The imitation game revisited: A comprehensive survey on recent advances in ai-generated text detection.Expert Systems with Applications, 272:126694, 2025

Zhiwei Yang, Zhengjie Feng, Rongxin Huo, Huiru Lin, Hanghan Zheng, Ruichi Nie, and Hongrui Chen. The imitation game revisited: A comprehensive survey on recent advances in ai-generated text detection.Expert Systems with Applications, 272:126694, 2025

work page 2025
[14]

Survey on ai-generated plagiarism detection: The impact of large language models on academic integrity.Journal of Academic Ethics, pages 1–34, 2024

Shushanta Pudasaini, Luis Miralles-Pechuán, David Lillis, and Marisa Llorens Salvador. Survey on ai-generated plagiarism detection: The impact of large language models on academic integrity.Journal of Academic Ethics, pages 1–34, 2024

work page 2024
[15]

Radar: Robust ai-text detection via adversarial learning.Advances in neural information processing systems, 36:15077–15095, 2023

Xiaomeng Hu, Pin-Yu Chen, and Tsung-Yi Ho. Radar: Robust ai-text detection via adversarial learning.Advances in neural information processing systems, 36:15077–15095, 2023

work page 2023
[16]

RAID: A shared benchmark for robust evaluation of machine-generated text detectors

Liam Dugan, Alyssa Hwang, Filip Trhlík, Andrew Zhu, Josh Magnus Ludan, Hainiu Xu, Daphne Ippolito, and Chris Callison-Burch. RAID: A shared benchmark for robust evaluation of machine-generated text detectors. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12463–12492, Bangkok, Thail...

work page 2024
[18]

Benchmarking AI text detection: Assessing detectors against new datasets, evasion tactics, and enhanced LLMs

Shushanta Pudasaini, Luis Miralles, David Lillis, and Marisa Llorens Salvador. Benchmarking AI text detection: Assessing detectors against new datasets, evasion tactics, and enhanced LLMs. In Firoj Alam, Preslav Nakov, Nizar Habash, Iryna Gurevych, Shammur Chowdhury, Artem Shelmanov, Yuxia Wang, Ekaterina Artemova, Mucahid Kutlu, and George Mikros, editor...

work page 2025
[19]

GLTR: Statistical detection and visualization of generated text

Sebastian Gehrmann, Hendrik Strobelt, and Alexander Rush. GLTR: Statistical detection and visualization of generated text. In Marta R. Costa-jussà and Enrique Alfonseca, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 111–116, Florence, Italy, July

work page
[20]

Association for Computational Linguistics

work page
[21]

GenAI content detection task 1: English and multilingual machine-generated text detection: AI vs

Yuxia Wang, Artem Shelmanov, Jonibek Mansurov, Akim Tsvigun, Vladislav Mikhailov, Rui Xing, Zhuohan Xie, Jiahui Geng, Giovanni Puccetti, Ekaterina Artemova, Jinyan Su, Minh Ngoc Ta, Mervat Abassy, Kareem Ashraf Elozeiri, Saad El Dine Ahmed El Etter, Maiya Goloburda, Tarek Mahmoud, Raj Vardhan Tomar, Nurkhan Laiyk, Osama Mohammed Afzal, Ryuto Koike, Masahi...

work page 2025
[22]

Kamrujjaman Mobin and Md Saiful Islam

MD. Kamrujjaman Mobin and Md Saiful Islam. LuxVeri at GenAI detection task 1: Inverse perplexity weighted ensemble for robust detection of AI-generated text across English and multilingual contexts. In Firoj Alam, Preslav Nakov, Nizar Habash, Iryna Gurevych, Shammur Chowdhury, Artem Shelmanov, Yuxia Wang, Ekaterina 20 Artemova, Mucahid Kutlu, and George M...

work page 2025
[23]

Enhancing AI Text Detection with Frozen Pretrained Encoders and Ensemble Learning

Shushanta Pudasaini, Luis Miralles-Pechúán, David Lillis, and Marisa Llorens Salvador. Enhancing AI Text Detection with Frozen Pretrained Encoders and Ensemble Learning. InWorking Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum, volume 4038 ofCEUR Workshop Proceedings, Madrid, Spain, 2025. CEUR- WS.org

work page 2025
[24]

A theoretically grounded hybrid ensemble for reliable detection of llm-generated text.arXiv preprint arXiv:2511.22153, 2025

Sepyan Purnama Kristanto and Lutfi Hakim. A theoretically grounded hybrid ensemble for reliable detection of llm-generated text.arXiv preprint arXiv:2511.22153, 2025

work page arXiv 2025
[25]

Classifying human vs

Adven Masih, Bushra Afzal, Shamyla Firdoos, Jabar Mahmood, Aitizaz Ali, Mohamed Shabbir Abdulnabi, and Daniel Musafiri Balungu. Classifying human vs. ai text with machine learning and explainable transformer models. Scientific Reports, 15(1):43310, 2025

work page 2025
[26]

Najjar, Huthaifa Ashqar, Omar Darwish, and Eman A

Ayat A. Najjar, Huthaifa Ashqar, Omar Darwish, and Eman A. Hammad. Leveraging explainable ai for llm text attribution: Differentiating human-written and multiple llm-generated text.Information (Switzerland), 16(9):767, sep 2025

work page 2025
[27]

O’Reilly Media, 2009

Steven Bird, Ewan Klein, and Edward Loper.Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, 2009

work page 2009
[28]

spacy: Industrial-strength natural language processing in python, 2020

Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. spacy: Industrial-strength natural language processing in python, 2020

work page 2020
[29]

A rule-based style and grammar checker

Daniel Naber. A rule-based style and grammar checker. Diploma thesis, Bielefeld University, 2003

work page 2003
[30]

Textblob: Simplified text processing

Steven Loria. Textblob: Simplified text processing. https://textblob.readthedocs.io/, 2018. Accessed: 2026-02-12

work page 2018
[31]

Scikit-learn: Machine learning in python.Journal of Machine Learning Research, 12:2825–2830, 2011

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine learning in python.Journal of Machine Learning Research...

work page 2011
[32]

Ghostbuster: Detecting text ghostwritten by large language models

Vivek Verma, Eve Fleisig, Nicholas Tomlin, and Dan Klein. Ghostbuster: Detecting text ghostwritten by large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 17...

work page 2024
[33]

Exploring the limitations of detecting machine-generated text

Jad Doughman, Osama Mohammed Afzal, Hawau Olamide Toyin, Shady Shehata, Preslav Nakov, and Zeerak Talat. Exploring the limitations of detecting machine-generated text. InProceedings of the 31st International Conference on Computational Linguistics, pages 4274–4281, 2025

work page 2025
[34]

Gene selection for cancer classification using support vector machines.Machine Learning, 46(1–3):389–422, 2002

Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene selection for cancer classification using support vector machines.Machine Learning, 46(1–3):389–422, 2002

work page 2002
[35]

Analytics of machine learning-based algorithms for text classification.Sustainable operations and computers, 3:238–248, 2022

Sayar Ul Hassan, Jameel Ahamed, and Khaleel Ahmad. Analytics of machine learning-based algorithms for text classification.Sustainable operations and computers, 3:238–248, 2022

work page 2022
[36]

Support-vector networks.Machine learning, 20(3):273–297, 1995

Corinna Cortes and Vladimir Vapnik. Support-vector networks.Machine learning, 20(3):273–297, 1995

work page 1995
[37]

Haowei Hua and Co-Jiayu Yao. Investigating generative ai models and detection techniques: impacts of tok- enization and dataset size on identification of ai-generated text.Frontiers in Artificial Intelligence, 7:1469197, 2024

work page 2024
[38]

An explainable framework for assisting the detection of ai-generated textual content.Decision Support Systems, page 114498, 2025

Sen Yan, Zhiyi Wang, and David Dobolyi. An explainable framework for assisting the detection of ai-generated textual content.Decision Support Systems, page 114498, 2025

work page 2025
[39]

mdok of kinit: Robustly fine-tuned llm for binary and multiclass ai-generated text detection, 2025

Dominik Macko. mdok of kinit: Robustly fine-tuned llm for binary and multiclass ai-generated text detection, 2025

work page 2025
[40]

Shapiro and M.B

S.S. Shapiro and M.B. Wilk. An analysis of variance test for normality.Biometrika, 52(3):591–611, 1965

work page 1965
[41]

One-class learning for ai-generated essay detection.Applied Sciences, 13(13):7901, 2023

Roberto Corizzo and Sebastian Leal-Arenas. One-class learning for ai-generated essay detection.Applied Sciences, 13(13):7901, 2023. 21

work page 2023

[1] [1]

Understanding the false positive rate for sentences of our ai writing detection capability, June

Annie Chechitelli. Understanding the false positive rate for sentences of our ai writing detection capability, June

work page

[2] [2]

Accessed: 2026-04-21

work page 2026

[3] [3]

Student Generative AI Survey 2025

Josh Freeman. Student Generative AI Survey 2025. Policy Note 61, Higher Education Policy Institute (HEPI), February 2025. Accessed: 2026-01-08

work page 2025

[4] [4]

Towards possibilities & impossibilities of ai-generated text detection: A survey.arXiv preprint arXiv:2310.15264,

Soumya Suvra Ghosal, Souradip Chakraborty, Jonas Geiping, Furong Huang, Dinesh Manocha, and Am- rit Singh Bedi. Towards possibilities & impossibilities of ai-generated text detection: A survey.arXiv preprint arXiv:2310.15264, 2023

work page arXiv 2023

[5] [5]

SemEval-2024 task 8: Multidomain, multimodel and multilingual machine-generated text detection

Wang Yuxia et al. SemEval-2024 task 8: Multidomain, multimodel and multilingual machine-generated text detection. In Atul Kr. Ojha, A. Seza Do ˘gruöz, Harish Tayyar Madabushi, Giovanni Da San Martino, Sara Rosenthal, and Aiala Rosá, editors,Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), pages 2057–2079, Mexico City, ...

work page 2024

[6] [6]

Overview of pan 2025: V oight-kampff 19 generative ai detection, multilingual text detoxification, multi-author writing style analysis, and generative plagiarism detection

Janek Bevendorff, Daryna Dementieva, Maik Fröbe, Bela Gipp, André Greiner-Petter, Jussi Karlgren, Maximilian Mayerl, Preslav Nakov, Alexander Panchenko, Martin Potthast, et al. Overview of pan 2025: V oight-kampff 19 generative ai detection, multilingual text detoxification, multi-author writing style analysis, and generative plagiarism detection. InInter...

work page 2025

[7] [7]

International Conference on Computational Linguistics

Firoj Alam, Preslav Nakov, Nizar Habash, Iryna Gurevych, Shammur Chowdhury, Artem Shelmanov, Yuxia Wang, Ekaterina Artemova, Mucahid Kutlu, and George Mikros, editors.Proceedings of the 1st Workshop on GenAI Content Detection (GenAIDetect), Abu Dhabi, UAE, January 2025. International Conference on Computational Linguistics

work page 2025

[8] [8]

Can AI-Generated Text be Reliably Detected?

Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi. Can ai-generated text be reliably detected?arXiv preprint arXiv:2303.11156, 2023

work page Pith review arXiv 2023

[9] [9]

Watermarks in the sand: impossibility of strong watermarking for language models

Hanlin Zhang, Benjamin L Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, and Boaz Barak. Watermarks in the sand: impossibility of strong watermarking for language models. InForty-first International Conference on Machine Learning, 2024

work page 2024

[10] [10]

ai vs humans

Jiazhou Ji, Ruizhe Li, Shujun Li, Jie Guo, Weidong Qiu, Zheng Huang, Chiyu Chen, Xiaoyu Jiang, and Xinru Lu. Detecting machine-generated texts: Not just "ai vs humans" and explainability is complicated, 2025

work page 2025

[11] [11]

Styloai: Distinguishing ai-generated content with stylometric analysis

Chidimma Opara. Styloai: Distinguishing ai-generated content with stylometric analysis. InInternational conference on artificial intelligence in education, pages 105–114. Springer, 2024

work page 2024

[12] [12]

Linguistic characteristics of ai-generated text: A survey.arXiv preprint arXiv:2510.05136, 2025

Luka Terˇcon and Kaja Dobrovoljc. Linguistic characteristics of ai-generated text: A survey.arXiv preprint arXiv:2510.05136, 2025

work page arXiv 2025

[13] [13]

The imitation game revisited: A comprehensive survey on recent advances in ai-generated text detection.Expert Systems with Applications, 272:126694, 2025

Zhiwei Yang, Zhengjie Feng, Rongxin Huo, Huiru Lin, Hanghan Zheng, Ruichi Nie, and Hongrui Chen. The imitation game revisited: A comprehensive survey on recent advances in ai-generated text detection.Expert Systems with Applications, 272:126694, 2025

work page 2025

[14] [14]

Survey on ai-generated plagiarism detection: The impact of large language models on academic integrity.Journal of Academic Ethics, pages 1–34, 2024

Shushanta Pudasaini, Luis Miralles-Pechuán, David Lillis, and Marisa Llorens Salvador. Survey on ai-generated plagiarism detection: The impact of large language models on academic integrity.Journal of Academic Ethics, pages 1–34, 2024

work page 2024

[15] [15]

Radar: Robust ai-text detection via adversarial learning.Advances in neural information processing systems, 36:15077–15095, 2023

Xiaomeng Hu, Pin-Yu Chen, and Tsung-Yi Ho. Radar: Robust ai-text detection via adversarial learning.Advances in neural information processing systems, 36:15077–15095, 2023

work page 2023

[16] [16]

RAID: A shared benchmark for robust evaluation of machine-generated text detectors

Liam Dugan, Alyssa Hwang, Filip Trhlík, Andrew Zhu, Josh Magnus Ludan, Hainiu Xu, Daphne Ippolito, and Chris Callison-Burch. RAID: A shared benchmark for robust evaluation of machine-generated text detectors. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12463–12492, Bangkok, Thail...

work page 2024

[17] [18]

Benchmarking AI text detection: Assessing detectors against new datasets, evasion tactics, and enhanced LLMs

Shushanta Pudasaini, Luis Miralles, David Lillis, and Marisa Llorens Salvador. Benchmarking AI text detection: Assessing detectors against new datasets, evasion tactics, and enhanced LLMs. In Firoj Alam, Preslav Nakov, Nizar Habash, Iryna Gurevych, Shammur Chowdhury, Artem Shelmanov, Yuxia Wang, Ekaterina Artemova, Mucahid Kutlu, and George Mikros, editor...

work page 2025

[18] [19]

GLTR: Statistical detection and visualization of generated text

Sebastian Gehrmann, Hendrik Strobelt, and Alexander Rush. GLTR: Statistical detection and visualization of generated text. In Marta R. Costa-jussà and Enrique Alfonseca, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 111–116, Florence, Italy, July

work page

[19] [20]

Association for Computational Linguistics

work page

[20] [21]

GenAI content detection task 1: English and multilingual machine-generated text detection: AI vs

Yuxia Wang, Artem Shelmanov, Jonibek Mansurov, Akim Tsvigun, Vladislav Mikhailov, Rui Xing, Zhuohan Xie, Jiahui Geng, Giovanni Puccetti, Ekaterina Artemova, Jinyan Su, Minh Ngoc Ta, Mervat Abassy, Kareem Ashraf Elozeiri, Saad El Dine Ahmed El Etter, Maiya Goloburda, Tarek Mahmoud, Raj Vardhan Tomar, Nurkhan Laiyk, Osama Mohammed Afzal, Ryuto Koike, Masahi...

work page 2025

[21] [22]

Kamrujjaman Mobin and Md Saiful Islam

MD. Kamrujjaman Mobin and Md Saiful Islam. LuxVeri at GenAI detection task 1: Inverse perplexity weighted ensemble for robust detection of AI-generated text across English and multilingual contexts. In Firoj Alam, Preslav Nakov, Nizar Habash, Iryna Gurevych, Shammur Chowdhury, Artem Shelmanov, Yuxia Wang, Ekaterina 20 Artemova, Mucahid Kutlu, and George M...

work page 2025

[22] [23]

Enhancing AI Text Detection with Frozen Pretrained Encoders and Ensemble Learning

Shushanta Pudasaini, Luis Miralles-Pechúán, David Lillis, and Marisa Llorens Salvador. Enhancing AI Text Detection with Frozen Pretrained Encoders and Ensemble Learning. InWorking Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum, volume 4038 ofCEUR Workshop Proceedings, Madrid, Spain, 2025. CEUR- WS.org

work page 2025

[23] [24]

A theoretically grounded hybrid ensemble for reliable detection of llm-generated text.arXiv preprint arXiv:2511.22153, 2025

Sepyan Purnama Kristanto and Lutfi Hakim. A theoretically grounded hybrid ensemble for reliable detection of llm-generated text.arXiv preprint arXiv:2511.22153, 2025

work page arXiv 2025

[24] [25]

Classifying human vs

Adven Masih, Bushra Afzal, Shamyla Firdoos, Jabar Mahmood, Aitizaz Ali, Mohamed Shabbir Abdulnabi, and Daniel Musafiri Balungu. Classifying human vs. ai text with machine learning and explainable transformer models. Scientific Reports, 15(1):43310, 2025

work page 2025

[25] [26]

Najjar, Huthaifa Ashqar, Omar Darwish, and Eman A

Ayat A. Najjar, Huthaifa Ashqar, Omar Darwish, and Eman A. Hammad. Leveraging explainable ai for llm text attribution: Differentiating human-written and multiple llm-generated text.Information (Switzerland), 16(9):767, sep 2025

work page 2025

[26] [27]

O’Reilly Media, 2009

Steven Bird, Ewan Klein, and Edward Loper.Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, 2009

work page 2009

[27] [28]

spacy: Industrial-strength natural language processing in python, 2020

Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. spacy: Industrial-strength natural language processing in python, 2020

work page 2020

[28] [29]

A rule-based style and grammar checker

Daniel Naber. A rule-based style and grammar checker. Diploma thesis, Bielefeld University, 2003

work page 2003

[29] [30]

Textblob: Simplified text processing

Steven Loria. Textblob: Simplified text processing. https://textblob.readthedocs.io/, 2018. Accessed: 2026-02-12

work page 2018

[30] [31]

Scikit-learn: Machine learning in python.Journal of Machine Learning Research, 12:2825–2830, 2011

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine learning in python.Journal of Machine Learning Research...

work page 2011

[31] [32]

Ghostbuster: Detecting text ghostwritten by large language models

Vivek Verma, Eve Fleisig, Nicholas Tomlin, and Dan Klein. Ghostbuster: Detecting text ghostwritten by large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 17...

work page 2024

[32] [33]

Exploring the limitations of detecting machine-generated text

Jad Doughman, Osama Mohammed Afzal, Hawau Olamide Toyin, Shady Shehata, Preslav Nakov, and Zeerak Talat. Exploring the limitations of detecting machine-generated text. InProceedings of the 31st International Conference on Computational Linguistics, pages 4274–4281, 2025

work page 2025

[33] [34]

Gene selection for cancer classification using support vector machines.Machine Learning, 46(1–3):389–422, 2002

Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene selection for cancer classification using support vector machines.Machine Learning, 46(1–3):389–422, 2002

work page 2002

[34] [35]

Analytics of machine learning-based algorithms for text classification.Sustainable operations and computers, 3:238–248, 2022

Sayar Ul Hassan, Jameel Ahamed, and Khaleel Ahmad. Analytics of machine learning-based algorithms for text classification.Sustainable operations and computers, 3:238–248, 2022

work page 2022

[35] [36]

Support-vector networks.Machine learning, 20(3):273–297, 1995

Corinna Cortes and Vladimir Vapnik. Support-vector networks.Machine learning, 20(3):273–297, 1995

work page 1995

[36] [37]

Haowei Hua and Co-Jiayu Yao. Investigating generative ai models and detection techniques: impacts of tok- enization and dataset size on identification of ai-generated text.Frontiers in Artificial Intelligence, 7:1469197, 2024

work page 2024

[37] [38]

An explainable framework for assisting the detection of ai-generated textual content.Decision Support Systems, page 114498, 2025

Sen Yan, Zhiyi Wang, and David Dobolyi. An explainable framework for assisting the detection of ai-generated textual content.Decision Support Systems, page 114498, 2025

work page 2025

[38] [39]

mdok of kinit: Robustly fine-tuned llm for binary and multiclass ai-generated text detection, 2025

Dominik Macko. mdok of kinit: Robustly fine-tuned llm for binary and multiclass ai-generated text detection, 2025

work page 2025

[39] [40]

Shapiro and M.B

S.S. Shapiro and M.B. Wilk. An analysis of variance test for normality.Biometrika, 52(3):591–611, 1965

work page 1965

[40] [41]

One-class learning for ai-generated essay detection.Applied Sciences, 13(13):7901, 2023

Roberto Corizzo and Sebastian Leal-Arenas. One-class learning for ai-generated essay detection.Applied Sciences, 13(13):7901, 2023. 21

work page 2023