pith. machine review for the scientific record. sign in

arxiv: 2605.10851 · v1 · submitted 2026-05-11 · 💻 cs.AI · cs.CL· cs.LG

Recognition: no theorem link

The Generalized Turing Test: A Foundation for Comparing Intelligence

Daniel Mitropolsky, Emanuele Rimoldi, Riccardo Neumarker, Susan S. Hong, Tomaso Poggio

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:16 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords Generalized Turing Testindistinguishabilityintelligence comparisonagent evaluationimitationrelative capabilityAI benchmarks
0
0 comments X

The pith

The Generalized Turing Test defines relative intelligence by whether one agent can indistinguishably imitate another.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a formal comparator where agent A is at least as capable as agent B if B cannot reliably tell its own interactions apart from those with A when A is told to imitate B. This creates a relative intelligence measure that avoids any dependence on particular datasets, tasks, or benchmarks. The work examines when the comparator is transitive enough to induce orderings on equivalence classes and defines restricted versions using queries or bounded exchanges. Experiments apply the test to current models across thousands of trials and recover stratified rankings that align with existing performance hierarchies.

Core claim

We define the Turing comparator A ≥ B to hold if B, acting as distinguisher, cannot reliably distinguish between interactions with A (instructed to imitate B) and another instance of B. This yields a dataset- and task-agnostic notion of relative intelligence. We study the comparator's structure, including conditions under which it is transitive and therefore induces an ordering over equivalence classes, and we define and analyze variants with querying, bounded interaction, and fixed distinguishers. Empirical pairwise tests on modern models produce stratified orderings consistent with known rankings.

What carries the argument

The Turing comparator, which declares A at least as capable as B when B cannot distinguish its own behavior from an imitator of B in instructed interactions.

If this is right

  • When the comparator is transitive it partitions agents into equivalence classes that can be ordered by relative intelligence.
  • Restricted variants with querying or bounded interactions still support the same core comparisons while limiting computational cost.
  • Empirical consistency with existing model rankings indicates the framework can recover known capability hierarchies without task-specific data.
  • The indistinguishability lens supplies a foundation for evaluation protocols and training objectives that remain independent of fixed datasets or benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same protocol could be used to place human experts and AI systems on a single scale by treating humans as the target agents to be imitated.
  • Iterative training that directly optimizes for passing the comparator against stronger models might produce capability gains without curated datasets.
  • The framework could be extended to non-language domains such as robotic control by defining imitation through action sequences rather than text.

Load-bearing premise

That successful indistinguishability in instructed imitation interactions reliably indicates comparable or greater underlying capability rather than depending on the interaction protocol or model selection.

What would settle it

If new pairwise indistinguishability trials on a fresh collection of agents produce orderings that contradict independent capability measures obtained through standard benchmarks, the claim that the comparator captures meaningful relative intelligence would be falsified.

Figures

Figures reproduced from arXiv: 2605.10851 by Daniel Mitropolsky, Emanuele Rimoldi, Riccardo Neumarker, Susan S. Hong, Tomaso Poggio.

Figure 1
Figure 1. Figure 1: An example transcript from a GTT between Gemini 3.1 Pro as actor and Claude Opus 4.6 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Empirical GTT relation at ϵ = 0.005. Left: the pairwise matrix of db(A, B), with rows as actors and columns as target/distinguishers. Right: the graph; an edge A → B reads db(A, B) ≤ ϵ. this study is intended as a first empirical instantiation of the framework rather than a high-precision measurement campaign, and because each additional pairwise multi-turn trial incurs nontrivial API cost, the resulting p… view at source ↗
Figure 3
Figure 3. Figure 3: Controlled-resource experiments, where A → B denotes A imitating B. (a): actor fooling prob. as a function of the distinguishing-turn budget (b): actor fooling prob. as a function of query rounds in the GTTQ. For interpretation, see discussion in Section 6.1. contain at least one such probe. These results suggest that present GTT play is still substantially shaped by behavioral signatures rather than prima… view at source ↗
Figure 4
Figure 4. Figure 4: Left: Graphs of ≥D at different values of ϵ, with Claude and Gemini D; both create meaningful hierarchies (e.g. SOTA models on top) and while both collapse for high enough ϵ, Opus stratifies up to ϵ = 0.3. Right: Fixed distinguisher Turing scores for all tested fixed distinguishers and actors. the best fixed-distinguisher. Canonically weaker models, DeepSeek and Qwen, fail to establish a hierarchy of model… view at source ↗
Figure 5
Figure 5. Figure 5: Experimental pipeline used to produce the empirical results. For each protocol and ordered [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: An example transcript from a GTT between Gemini 3.1 Pro as actor and Claude Opus 4.6 [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: An example transcript from a GTT in which a Gemini 3.1 Pro distinguisher detects Claude [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: An example transcript from a GTT between Gemini 3.1 Pro as actor and Claude Opus 4.6 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: An example transcript from a GTT in which Gemini 3.1 Pro imitates Grok, showing [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: An example transcript from a GTT in which DeepSeek imitates Grok, illustrating how a [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: GTTQ analogue of Figure 2 at [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Circular redrawings of the thresholded GTT and GTTQ relations at [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Verticalized versions of the thresholded GTT and GTTQ graphs from Figure 12. Nodes [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Strict versions of the empirical relation at [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Ranking of models by the three Turing scores in both settings: Fooling Score [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: As ϵ increases, it becomes easier for A to satisfy A ≥ϵ B. Left: the number of pairs s.t. A ≥ B as a function of ϵ. Right: transitivity violations (A, B, C s.t. A ≥ B ≥ C but not A ≥ C) vs. ϵ; this quantity need not decrease monotonically because adding edges can create new two-step chains before it closes them. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Rank heatmap comparing our aggregate Turing-score ordering against AAII, LiveBench, [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Turing-Score-based ranking of models separated by fixed distinguisher. [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Probability of each fixed distinguisher (x-axis) correctly recognizing each actor model [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Directed graphs where A → B if A ⪰ϵ;D B ⇐⇒ db(A, B) ≤ ϵ. Models without incoming or outgoing edges are omitted. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗
read the original abstract

We introduce the Generalized Turing Test (GTT), a formal framework for comparing the capabilities of arbitrary agents via indistinguishability. For agents A and B, we define the Turing comparator A $\geq$ B to hold if B, acting as a distinguisher, cannot reliably distinguish between interactions with A (instructed to imitate B) and another instance of B. This yields a dataset- and task-agnostic notion of relative intelligence. We study the comparator's structure, including conditions under which it is transitive and therefore induces an ordering over equivalence classes, and we define and analyze variants with querying, bounded interaction, and fixed distinguishers. To complement the theory, we instantiate the framework on a collection of modern models, empirically evaluating pairwise indistinguishability across thousands of trials. The resulting comparisons exhibit a stratified structure consistent with existing rankings, hinting that the proposed framework yields meaningful empirical orderings. Our results position indistinguishability as a unifying lens for reasoning about intelligence, suggesting a foundation for evaluation and, potentially, training objectives that are inherently independent of fixed datasets or benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the Generalized Turing Test (GTT), a formal framework for comparing capabilities of arbitrary agents via indistinguishability. For agents A and B, it defines A ≥ B if B (as distinguisher) cannot reliably distinguish interactions with A instructed to imitate B from interactions with another instance of B. The work analyzes the comparator's mathematical structure (including transitivity conditions for inducing orderings over equivalence classes), defines variants (querying, bounded interaction, fixed distinguishers), and reports empirical pairwise evaluations on modern models across thousands of trials, yielding stratified comparisons consistent with existing rankings.

Significance. If the central claims hold, the GTT would provide a genuinely dataset- and task-agnostic lens for relative intelligence, potentially unifying evaluation methods and enabling training objectives independent of fixed benchmarks. The theoretical analysis of transitivity and the empirical stratification consistent with known orderings are strengths that could position indistinguishability as a foundational comparator, provided the framework's assumptions about what indistinguishability measures are validated.

major comments (2)
  1. [Definition of the Turing comparator (abstract and §2)] Core definition of the Turing comparator (abstract and §2): A ≥ B holds when B cannot reliably distinguish A (instructed to imitate B) from B. This can produce false A ≥ B results when B has limited discriminative power, allowing a strictly weaker A to generate indistinguishable outputs. The manuscript studies transitivity but does not test whether the observed empirical orderings survive replacement of the self-distinguisher with an independent, stronger oracle. This assumption is load-bearing for the claim that the comparator yields a true, capability-based relative intelligence ordering rather than an artifact of correlated model limitations.
  2. [Empirical evaluation] Empirical evaluation section: The report of a 'stratified structure consistent with existing rankings' across thousands of trials lacks detail on measurement of indistinguishability (quantification of 'reliably', trial protocols, statistical controls for biases in interaction setup or model selection). Without these, it is unclear whether the results validate the framework or reflect protocol artifacts, undermining support for the dataset-agnostic claim.
minor comments (1)
  1. [Theoretical variants] Notation for the variants (querying, bounded interaction, fixed distinguishers) could be made more uniform and explicitly cross-referenced to the base definition to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of the Generalized Turing Test's definition and empirical support. We address each major comment below and outline revisions to improve clarity and rigor.

read point-by-point responses
  1. Referee: Core definition of the Turing comparator (abstract and §2): A ≥ B holds when B cannot reliably distinguish A (instructed to imitate B) from B. This can produce false A ≥ B results when B has limited discriminative power, allowing a strictly weaker A to generate indistinguishable outputs. The manuscript studies transitivity but does not test whether the observed empirical orderings survive replacement of the self-distinguisher with an independent, stronger oracle. This assumption is load-bearing for the claim that the comparator yields a true, capability-based relative intelligence ordering rather than an artifact of correlated model limitations.

    Authors: The self-referential use of B as distinguisher is deliberate to ensure the comparator remains dataset- and task-agnostic without introducing external oracles that could reintroduce benchmark dependence. This can indeed yield A ≥ B even for weaker A when B's discrimination is limited, but the paper's transitivity analysis precisely characterizes when the relation induces consistent orderings over equivalence classes. We did not empirically substitute stronger independent distinguishers, as that would alter the core self-contained framework. We will add a discussion subsection acknowledging this limitation, referencing the fixed-distinguisher variants already analyzed, and noting that current empirical results align with known rankings while warranting cautious interpretation. revision: partial

  2. Referee: Empirical evaluation section: The report of a 'stratified structure consistent with existing rankings' across thousands of trials lacks detail on measurement of indistinguishability (quantification of 'reliably', trial protocols, statistical controls for biases in interaction setup or model selection). Without these, it is unclear whether the results validate the framework or reflect protocol artifacts, undermining support for the dataset-agnostic claim.

    Authors: We agree that greater detail is required for full reproducibility and to substantiate the claims. The revised manuscript will expand the empirical section with: explicit quantification of 'reliably' (including the statistical threshold, such as indistinguishability defined as accuracy not significantly above chance with p < 0.05 and confidence intervals over trials); complete trial protocols (interaction formats, bounded lengths, imitation prompting, randomization of order and model selection); and statistical controls (bias mitigation via randomization, multiple-testing corrections, and checks for setup artifacts). These additions will clarify that the observed stratification supports the framework rather than arising from protocol specifics. revision: yes

Circularity Check

0 steps flagged

New formal definition of relative intelligence via indistinguishability is self-contained with no reduction to fitted inputs or self-citations

full rationale

The paper defines the Generalized Turing Test comparator A ≥ B directly as a primitive: B cannot distinguish A (instructed to imitate B) from another B. This is presented as the foundational framework yielding a dataset-agnostic ordering, not derived from prior equations, parameters, or data. Properties such as transitivity conditions and variants are analyzed as consequences of this definition. The empirical instantiation on models is explicitly complementary validation showing consistency with existing rankings, rather than the source of the claims. No load-bearing self-citations, ansatzes smuggled via prior work, or fitted inputs renamed as predictions appear in the derivation chain. The framework is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central contribution is the new formal definition of the Turing comparator and its analysis; the framework introduces no numerical free parameters and no new physical or computational entities beyond the abstract notion of instructed imitation.

axioms (2)
  • domain assumption Agents can be instructed to imitate the behavior of another specified agent during interactions
    The comparator definition in the abstract relies on this premise that imitation instructions are meaningful and executable.
  • domain assumption Indistinguishability by a distinguisher agent measures relative capability
    This is the load-bearing definition that turns interaction outcomes into an intelligence ordering.

pith-pipeline@v0.9.0 · 5500 in / 1359 out tokens · 80298 ms · 2026-05-12T03:16:07.945765+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 3 internal anchors

  1. [1]

    Artificial Analysis Intelligence Index , year =

  2. [2]

    International Conference on Learning Representations , year =

    White, Colin and Dooley, Samuel and Roberts, Manley and Pal, Arka and Feuer, Benjamin and Jain, Siddhartha and Shwartz-Ziv, Ravid and Jain, Neel and Saifullah, Khalid and Dey, Sreemanti and Agrawal, Shubh and Sandha, Sandeep Singh and Naidu, Siddartha Venkat and Hegde, Chinmay and LeCun, Yann and Goldstein, Tom and Neiswanger, Willie and Goldblum, Micah ,...

  3. [3]

    2025 , eprint=

    GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks , author=. 2025 , eprint=

  4. [4]

    Humanity's Last Exam

    Phan, Long and Gatti, Alice and Li, Nathaniel and others , year=. A benchmark of expert-level academic questions to assess AI capabilities , volume=. Nature , publisher=. doi:10.1038/s41586-025-09962-4 , number=

  5. [5]

    International Conference on Learning Representations , year =

    Measuring Massive Multitask Language Understanding , author =. International Conference on Learning Representations , year =

  6. [6]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark , author =. arXiv preprint arXiv:2311.12022 , year =

  7. [7]

    and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , booktitle =

    Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , booktitle =. 2024 , url =

  8. [8]

    2024 , url =

    Zhang, Huixuan and Lin, Yun and Wan, Xiaojun , journal =. 2024 , url =

  9. [9]

    arXiv preprint arXiv:2310.19736 , year =

    Evaluating Large Language Models: A Comprehensive Survey , author =. arXiv preprint arXiv:2310.19736 , year =

  10. [10]

    Akhtar, Mubashara and Reuel, Anka and Soni, Prajna and Ahuja, Sanchit and Ammanamanchi, Pawan Sasanka and Rawal, Ruchit and Zouhar, Vil. When. arXiv preprint arXiv:2602.16763 , year =

  11. [11]

    and Stoica, Ion , journal =

    Chiang, Wei-Lin and Zheng, Lianmin and Sheng, Ying and Angelopoulos, Anastasios Nikolas and Li, Tianle and Li, Dacheng and Zhang, Hao and Zhu, Banghua and Jordan, Michael and Gonzalez, Joseph E. and Stoica, Ion , journal =. Chatbot Arena: An Open Platform for Evaluating. 2024 , url =

  12. [12]

    A Survey on

    Gu, Jiawei and Jiang, Xuhui and Shi, Zhichao and Tan, Hexiang and Zhai, Xuehao and Xu, Chengjin and Li, Wei and Shen, Yinghan and Ma, Shengjie and Liu, Honghao and others , journal =. A Survey on. 2024 , url =

  13. [13]

    2025 , url =

    Pasch, Stefan , journal =. 2025 , url =

  14. [14]

    arXiv preprint arXiv:2501.17858 , year =

    Improving Your Model Ranking on Chatbot Arena by Vote Rigging , author =. arXiv preprint arXiv:2501.17858 , year =

  15. [15]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

    Large Language Models are not Fair Evaluators , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

  16. [16]

    Judging the Judges: A Systematic Study of Position Bias in

    Shi, Lin and Ma, Chiyu and Liang, Wenhua and Diao, Xingjian and Ma, Weicheng and Vosoughi, Soroush , journal =. Judging the Judges: A Systematic Study of Position Bias in. 2024 , url =

  17. [17]

    and Feng, Shi , journal =

    Panickssery, Arjun and Bowman, Samuel R. and Feng, Shi , journal =. 2024 , url =

  18. [18]

    Self-Preference Bias in

    Wataoka, Koki and Takahashi, Tsubasa and Ri, Ryokan , journal =. Self-Preference Bias in. 2024 , url =

  19. [19]

    and Gonzalez, Joseph E

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zhexin and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Gonzalez, Joseph E. and Stoica, Ion , journal =. Judging. 2023 , url =

  20. [20]

    Findings of the Association for Computational Linguistics: ACL 2024 , year =

    Ranking Large Language Models without Ground Truth , author =. Findings of the Association for Computational Linguistics: ACL 2024 , year =

  21. [21]

    Preference Leakage: A Contamination Problem in

    Li, Dawei and Sun, Renliang and Huang, Yue and Zhong, Ming and Jiang, Bohan and Han, Jiawei and Zhang, Xiangliang and Wang, Wei and Liu, Huan , journal =. Preference Leakage: A Contamination Problem in. 2025 , url =

  22. [22]

    Minds and Machines , volume =

    Universal Intelligence: A Definition of Machine Intelligence , author =. Minds and Machines , volume =. 2007 , publisher =

  23. [23]

    Artificial Intelligence , volume =

    Measuring Universal Intelligence: Towards an Anytime Intelligence Test , author =. Artificial Intelligence , volume =. 2010 , publisher =

  24. [24]

    On the Measure of Intelligence

    On the Measure of Intelligence , author =. arXiv preprint arXiv:1911.01547 , year =

  25. [25]

    Mind , volume =

    Computing Machinery and Intelligence , author =. Mind , volume =. 1950 , url =

  26. [26]

    and Bergen, Benjamin K

    Jones, Cameron R. and Bergen, Benjamin K. , journal =. People cannot distinguish. 2024 , url =

  27. [27]

    Jones and Benjamin K

    Large Language Models Pass the Turing Test , author =. arXiv preprint arXiv:2503.23674 , year =

  28. [28]

    arXiv preprint arXiv:2205.05268 , year =

    The Meta-Turing Test , author =. arXiv preprint arXiv:2205.05268 , year =

  29. [29]

    2024 , url =

    Wu, Weiqi and Wu, Hongqiu and Zhao, Hai , journal =. 2024 , url =

  30. [30]

    ArXiv , year=

    Benchmark Data Contamination of Large Language Models: A Survey , author=. ArXiv , year=

  31. [31]

    Probabilistic Encryption , author=. J. Comput. Syst. Sci. , year=

  32. [32]

    Nature Communications , year=

    Mapping global dynamics of benchmark creation and saturation in artificial intelligence , author=. Nature Communications , year=

  33. [33]

    Symposium on the Theory of Computing , year=

    The knowledge complexity of interactive proof-systems , author=. Symposium on the Theory of Computing , year=

  34. [34]

    SIAM Journal on Computing , volume =

    Garg, Sanjam and Gentry, Craig and Halevi, Shai and Raykova, Mariana and Sahai, Amit and Waters, Brent , title =. SIAM Journal on Computing , volume =. 2016 , doi =

  35. [35]

    The Collected Papers of Albert Einstein, Volume 2: The Swiss Years: Writings, 1900--1909 , editor =

    Einstein, Albert , title =. The Collected Papers of Albert Einstein, Volume 2: The Swiss Years: Writings, 1900--1909 , editor =

  36. [36]

    arXiv preprint arXiv:2306.09194 , year =

    Christ, Mischa and Gunn, Sam and Zamir, Or and Mahloujifar, Saeid , title =. arXiv preprint arXiv:2306.09194 , year =