Findings of the Counter Turing Test: AI-Generated Text Detection

Aishwarya Naresh Reganti; Aman Chadha; Amitava Das; Amit Sheth; Ashhar Aziz; Gurpreet Singh; Kapil Wanaskar; Nasrin Imanpour; Nilesh Ranjan Pal; Parth Patwa

arxiv: 2605.20761 · v2 · pith:DRTIUUA6new · submitted 2026-05-20 · 💻 cs.CL

Findings of the Counter Turing Test: AI-Generated Text Detection

Rajarshi Roy , Gurpreet Singh , Ashhar Aziz , Shashwat Bajpai , Nasrin Imanpour , Shwetangshu Biswas , Kapil Wanaskar , Parth Patwa

show 11 more authors

Subhankar Ghosh Shreyas Dixit Nilesh Ranjan Pal Vipula Rawte Ritvik Garimella Amitava Das Amit Sheth Vasu Sharma Aishwarya Naresh Reganti Vinija Jain Aman Chadha

This is my paper

Pith reviewed 2026-05-21 05:23 UTC · model grok-4.3

classification 💻 cs.CL

keywords AI-generated text detectionshared task evaluationbinary classificationmodel attributiontransformer-based detectorslarge language modelsgenerative AI verification

0 comments

The pith

Systems distinguish human text from AI-generated text with high reliability but perform worse when asked to identify the exact model that produced it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports results from the Counter Turing Test shared tasks that evaluate automated methods for spotting AI-written content. One task requires systems to label text as human or machine produced, while the other asks them to name the specific language model behind the text. Leading entries used fine-tuned transformer models and ensemble combinations to reach strong results on the simpler distinction but noticeably weaker results on the finer attribution problem. This pattern matters because reliable basic detection could help protect against fabricated online content, yet the added difficulty of model identification points to remaining gaps in understanding how different generators leave traces. A sympathetic reader sees here both practical progress and clear directions for improvement in verification tools.

Core claim

The paper establishes that current detection approaches achieve strong results on binary classification of human-written versus AI-generated text through fine-tuned transformer models, ensemble learning, and hybrid methods, while performance drops on the more demanding task of attributing text to particular language models, thereby indicating that distinguishing outputs across different generators requires additional advances in robustness and feature analysis.

What carries the argument

The Counter Turing Test shared tasks that separately measure binary human-AI classification and multi-class model attribution on fixed test sets of human and generated texts.

If this is right

Fine-tuned transformer models combined in ensembles provide effective tools for basic separation of human and AI text.
Model attribution exposes greater challenges in isolating distinctive patterns left by individual language models.
Progress on detection will depend on improvements in adversarial robustness, better feature extraction, and stronger cross-domain performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-world platforms could integrate binary detectors to surface likely AI content for human review without needing to name the source model.
The performance difference between the two tasks suggests that model-specific signatures exist and could become the focus of next-generation detectors.
Applying the winning systems to text from models released after the shared task would test whether the reported results hold under genuine distribution shift.

Load-bearing premise

The shared task test sets give a representative sample of real-world text without meaningful distribution shifts or overlap with the data used to train the submitted detection systems.

What would settle it

A new collection of human-written and AI-generated texts drawn from domains or models absent from the original test sets, run through the top submitted systems to check whether binary classification accuracy remains as high as before.

Figures

Figures reproduced from arXiv: 2605.20761 by Aishwarya Naresh Reganti, Aman Chadha, Amitava Das, Amit Sheth, Ashhar Aziz, Gurpreet Singh, Kapil Wanaskar, Nasrin Imanpour, Nilesh Ranjan Pal, Parth Patwa, Rajarshi Roy, Ritvik Garimella, Shashwat Bajpai, Shreyas Dixit, Shwetangshu Biswas, Subhankar Ghosh, Vasu Sharma, Vinija Jain, Vipula Rawte.

**Figure 1.** Figure 1: Illustration of Raidar concept. Given a News data text and an LLM-generated text, the same LLM is asked to rewrite the inputs while preserving meaning. The rewriting of a human-written text undergoes more character-level edits (highlighted in red/yellow), while the rewriting of an LLM-generated text remains largely unchanged. 4. Participating Systems With over 52 registrations on the competition web page, … view at source ↗

read the original abstract

The growing capability of large language models to produce fluent, contextually coherent text has created mounting pressure on the systems and institutions responsible for ensuring the authenticity of digital content. Advanced generative models such as GPT-4, Claude 3.5, and Llama can produce highly coherent and human-like text, making it increasingly difficult to differentiate between human-written and AI-generated content. While these models have transformative applications, their misuse has raised concerns about misinformation, biased narratives, and security threats. This paper provides a comprehensive analysis of state-of-the-art AI-generated text detection techniques and evaluates their effectiveness through the Counter Turing Test (CT2) shared tasks. Task A (Binary Classification) required participants to distinguish between human-written and AI-generated text, while Task B (Model Attribution) focused on identifying the specific language model responsible for generating a given text. The results demonstrated high performance in binary classification, with the top system achieving an F1 score of 1.0000, but significantly lower scores in model attribution, where the best system achieved 0.9531, highlighting the increased complexity of this task. The top-performing teams leveraged fine-tuned transformer models, ensemble learning, and hybrid detection approaches, with DeBERTa-based and BART-based methods demonstrating strong results. However, the lower scores in Task B underscore the challenges of distinguishing outputs from different LLMs, necessitating further research into adversarial robustness, feature extraction, and cross-domain generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript reports the outcomes of the Counter Turing Test (CT2) shared tasks for detecting AI-generated text. In Task A (binary classification of human vs. AI text), the top participating system achieved an F1 score of 1.0000. In Task B (model attribution to specific LLMs), the best system scored 0.9531. The paper highlights the use of fine-tuned transformer models such as DeBERTa and BART, ensembles, and hybrid approaches by top teams, while noting the greater difficulty of the attribution task.

Significance. Should the test sets prove to be uncontaminated and representative of real-world distributions, the findings would demonstrate that binary AI-text detection has reached high reliability under the shared-task conditions, whereas distinguishing among generative models remains more challenging. This would provide a useful benchmark for the field and motivate further work on robustness and generalization. The shared-task format itself offers value by enabling direct comparison of methods.

major comments (3)

The abstract presents the headline F1 scores of 1.0000 and 0.9531 without any qualification regarding dataset construction or potential confounds such as leakage.
No information is provided on how the AI-generated texts were created (e.g., specific prompts, temperature settings, or models used beyond the general mention of GPT-4, Claude, Llama), nor on any verification steps to ensure the test data was unseen by participants or free from overlap with common training corpora. This directly affects the interpretability of the perfect binary-classification score.
The manuscript reports aggregate performance numbers but does not include statistical significance tests, confidence intervals, or analysis of failure cases that would strengthen the claim that binary detection is solved while attribution is harder.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on the manuscript reporting the findings of the Counter Turing Test shared tasks. We have revised the paper to address concerns about dataset transparency and statistical analysis, as detailed in the point-by-point responses below.

read point-by-point responses

Referee: The abstract presents the headline F1 scores of 1.0000 and 0.9531 without any qualification regarding dataset construction or potential confounds such as leakage.

Authors: We agree that the abstract would benefit from additional context. The revised abstract now qualifies the reported scores by noting that they were obtained on a test set constructed to reduce leakage risks, with full details on dataset creation provided in the main text. revision: yes
Referee: No information is provided on how the AI-generated texts were created (e.g., specific prompts, temperature settings, or models used beyond the general mention of GPT-4, Claude, Llama), nor on any verification steps to ensure the test data was unseen by participants or free from overlap with common training corpora. This directly affects the interpretability of the perfect binary-classification score.

Authors: We thank the referee for highlighting this gap. We have added a dedicated subsection summarizing the data generation pipeline, including the models (GPT-4, Claude 3.5, Llama-3), prompt strategies, temperature settings of 0.7, and verification procedures such as post-cutoff generation dates and n-gram overlap checks against public corpora to confirm the test data was unseen. revision: yes
Referee: The manuscript reports aggregate performance numbers but does not include statistical significance tests, confidence intervals, or analysis of failure cases that would strengthen the claim that binary detection is solved while attribution is harder.

Authors: We accept this recommendation. The revised manuscript now includes bootstrap-derived 95% confidence intervals for the F1 scores, McNemar's tests confirming the performance difference between tasks is statistically significant, and a new error analysis section examining failure modes, particularly in model attribution. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical shared-task results on fixed test data

full rationale

The paper is a findings report summarizing participant submissions to a shared task on binary AI-text detection and model attribution. Performance numbers (top F1 1.0000 for Task A, 0.9531 for Task B) are direct evaluation outcomes on held-out test sets submitted by independent teams; no equations, fitted parameters, or first-principles derivations appear that could reduce to the paper's own inputs by construction. The central claims rest on external team results rather than self-referential fitting or self-citation chains. This is a standard competition summary whose content is self-contained against the reported test data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard supervised classification assumptions and competition data splits rather than new theoretical constructs; no free parameters, invented entities, or ad-hoc axioms are introduced beyond typical ML evaluation practices.

axioms (1)

domain assumption Test data in the shared tasks follows the same distribution as real-world human and AI text without leakage or selection bias.
Implicit in treating reported F1 scores as generalizable performance measures.

pith-pipeline@v0.9.0 · 5864 in / 1250 out tokens · 66394 ms · 2026-05-21T05:23:07.316734+00:00 · methodology

Findings of the Counter Turing Test: AI-Generated Text Detection

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)