Recognition: no theorem link
The Generalized Turing Test: A Foundation for Comparing Intelligence
Pith reviewed 2026-05-12 03:16 UTC · model grok-4.3
The pith
The Generalized Turing Test defines relative intelligence by whether one agent can indistinguishably imitate another.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We define the Turing comparator A ≥ B to hold if B, acting as distinguisher, cannot reliably distinguish between interactions with A (instructed to imitate B) and another instance of B. This yields a dataset- and task-agnostic notion of relative intelligence. We study the comparator's structure, including conditions under which it is transitive and therefore induces an ordering over equivalence classes, and we define and analyze variants with querying, bounded interaction, and fixed distinguishers. Empirical pairwise tests on modern models produce stratified orderings consistent with known rankings.
What carries the argument
The Turing comparator, which declares A at least as capable as B when B cannot distinguish its own behavior from an imitator of B in instructed interactions.
If this is right
- When the comparator is transitive it partitions agents into equivalence classes that can be ordered by relative intelligence.
- Restricted variants with querying or bounded interactions still support the same core comparisons while limiting computational cost.
- Empirical consistency with existing model rankings indicates the framework can recover known capability hierarchies without task-specific data.
- The indistinguishability lens supplies a foundation for evaluation protocols and training objectives that remain independent of fixed datasets or benchmarks.
Where Pith is reading between the lines
- The same protocol could be used to place human experts and AI systems on a single scale by treating humans as the target agents to be imitated.
- Iterative training that directly optimizes for passing the comparator against stronger models might produce capability gains without curated datasets.
- The framework could be extended to non-language domains such as robotic control by defining imitation through action sequences rather than text.
Load-bearing premise
That successful indistinguishability in instructed imitation interactions reliably indicates comparable or greater underlying capability rather than depending on the interaction protocol or model selection.
What would settle it
If new pairwise indistinguishability trials on a fresh collection of agents produce orderings that contradict independent capability measures obtained through standard benchmarks, the claim that the comparator captures meaningful relative intelligence would be falsified.
Figures
read the original abstract
We introduce the Generalized Turing Test (GTT), a formal framework for comparing the capabilities of arbitrary agents via indistinguishability. For agents A and B, we define the Turing comparator A $\geq$ B to hold if B, acting as a distinguisher, cannot reliably distinguish between interactions with A (instructed to imitate B) and another instance of B. This yields a dataset- and task-agnostic notion of relative intelligence. We study the comparator's structure, including conditions under which it is transitive and therefore induces an ordering over equivalence classes, and we define and analyze variants with querying, bounded interaction, and fixed distinguishers. To complement the theory, we instantiate the framework on a collection of modern models, empirically evaluating pairwise indistinguishability across thousands of trials. The resulting comparisons exhibit a stratified structure consistent with existing rankings, hinting that the proposed framework yields meaningful empirical orderings. Our results position indistinguishability as a unifying lens for reasoning about intelligence, suggesting a foundation for evaluation and, potentially, training objectives that are inherently independent of fixed datasets or benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Generalized Turing Test (GTT), a formal framework for comparing capabilities of arbitrary agents via indistinguishability. For agents A and B, it defines A ≥ B if B (as distinguisher) cannot reliably distinguish interactions with A instructed to imitate B from interactions with another instance of B. The work analyzes the comparator's mathematical structure (including transitivity conditions for inducing orderings over equivalence classes), defines variants (querying, bounded interaction, fixed distinguishers), and reports empirical pairwise evaluations on modern models across thousands of trials, yielding stratified comparisons consistent with existing rankings.
Significance. If the central claims hold, the GTT would provide a genuinely dataset- and task-agnostic lens for relative intelligence, potentially unifying evaluation methods and enabling training objectives independent of fixed benchmarks. The theoretical analysis of transitivity and the empirical stratification consistent with known orderings are strengths that could position indistinguishability as a foundational comparator, provided the framework's assumptions about what indistinguishability measures are validated.
major comments (2)
- [Definition of the Turing comparator (abstract and §2)] Core definition of the Turing comparator (abstract and §2): A ≥ B holds when B cannot reliably distinguish A (instructed to imitate B) from B. This can produce false A ≥ B results when B has limited discriminative power, allowing a strictly weaker A to generate indistinguishable outputs. The manuscript studies transitivity but does not test whether the observed empirical orderings survive replacement of the self-distinguisher with an independent, stronger oracle. This assumption is load-bearing for the claim that the comparator yields a true, capability-based relative intelligence ordering rather than an artifact of correlated model limitations.
- [Empirical evaluation] Empirical evaluation section: The report of a 'stratified structure consistent with existing rankings' across thousands of trials lacks detail on measurement of indistinguishability (quantification of 'reliably', trial protocols, statistical controls for biases in interaction setup or model selection). Without these, it is unclear whether the results validate the framework or reflect protocol artifacts, undermining support for the dataset-agnostic claim.
minor comments (1)
- [Theoretical variants] Notation for the variants (querying, bounded interaction, fixed distinguishers) could be made more uniform and explicitly cross-referenced to the base definition to improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important aspects of the Generalized Turing Test's definition and empirical support. We address each major comment below and outline revisions to improve clarity and rigor.
read point-by-point responses
-
Referee: Core definition of the Turing comparator (abstract and §2): A ≥ B holds when B cannot reliably distinguish A (instructed to imitate B) from B. This can produce false A ≥ B results when B has limited discriminative power, allowing a strictly weaker A to generate indistinguishable outputs. The manuscript studies transitivity but does not test whether the observed empirical orderings survive replacement of the self-distinguisher with an independent, stronger oracle. This assumption is load-bearing for the claim that the comparator yields a true, capability-based relative intelligence ordering rather than an artifact of correlated model limitations.
Authors: The self-referential use of B as distinguisher is deliberate to ensure the comparator remains dataset- and task-agnostic without introducing external oracles that could reintroduce benchmark dependence. This can indeed yield A ≥ B even for weaker A when B's discrimination is limited, but the paper's transitivity analysis precisely characterizes when the relation induces consistent orderings over equivalence classes. We did not empirically substitute stronger independent distinguishers, as that would alter the core self-contained framework. We will add a discussion subsection acknowledging this limitation, referencing the fixed-distinguisher variants already analyzed, and noting that current empirical results align with known rankings while warranting cautious interpretation. revision: partial
-
Referee: Empirical evaluation section: The report of a 'stratified structure consistent with existing rankings' across thousands of trials lacks detail on measurement of indistinguishability (quantification of 'reliably', trial protocols, statistical controls for biases in interaction setup or model selection). Without these, it is unclear whether the results validate the framework or reflect protocol artifacts, undermining support for the dataset-agnostic claim.
Authors: We agree that greater detail is required for full reproducibility and to substantiate the claims. The revised manuscript will expand the empirical section with: explicit quantification of 'reliably' (including the statistical threshold, such as indistinguishability defined as accuracy not significantly above chance with p < 0.05 and confidence intervals over trials); complete trial protocols (interaction formats, bounded lengths, imitation prompting, randomization of order and model selection); and statistical controls (bias mitigation via randomization, multiple-testing corrections, and checks for setup artifacts). These additions will clarify that the observed stratification supports the framework rather than arising from protocol specifics. revision: yes
Circularity Check
New formal definition of relative intelligence via indistinguishability is self-contained with no reduction to fitted inputs or self-citations
full rationale
The paper defines the Generalized Turing Test comparator A ≥ B directly as a primitive: B cannot distinguish A (instructed to imitate B) from another B. This is presented as the foundational framework yielding a dataset-agnostic ordering, not derived from prior equations, parameters, or data. Properties such as transitivity conditions and variants are analyzed as consequences of this definition. The empirical instantiation on models is explicitly complementary validation showing consistency with existing rankings, rather than the source of the claims. No load-bearing self-citations, ansatzes smuggled via prior work, or fitted inputs renamed as predictions appear in the derivation chain. The framework is therefore independent of its own outputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Agents can be instructed to imitate the behavior of another specified agent during interactions
- domain assumption Indistinguishability by a distinguisher agent measures relative capability
Reference graph
Works this paper leans on
-
[1]
Artificial Analysis Intelligence Index , year =
-
[2]
International Conference on Learning Representations , year =
White, Colin and Dooley, Samuel and Roberts, Manley and Pal, Arka and Feuer, Benjamin and Jain, Siddhartha and Shwartz-Ziv, Ravid and Jain, Neel and Saifullah, Khalid and Dey, Sreemanti and Agrawal, Shubh and Sandha, Sandeep Singh and Naidu, Siddartha Venkat and Hegde, Chinmay and LeCun, Yann and Goldstein, Tom and Neiswanger, Willie and Goldblum, Micah ,...
-
[3]
GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks , author=. 2025 , eprint=
work page 2025
-
[4]
Phan, Long and Gatti, Alice and Li, Nathaniel and others , year=. A benchmark of expert-level academic questions to assess AI capabilities , volume=. Nature , publisher=. doi:10.1038/s41586-025-09962-4 , number=
work page internal anchor Pith review doi:10.1038/s41586-025-09962-4
-
[5]
International Conference on Learning Representations , year =
Measuring Massive Multitask Language Understanding , author =. International Conference on Learning Representations , year =
-
[6]
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
GPQA: A Graduate-Level Google-Proof Q&A Benchmark , author =. arXiv preprint arXiv:2311.12022 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , booktitle =. 2024 , url =
work page 2024
- [8]
-
[9]
arXiv preprint arXiv:2310.19736 , year =
Evaluating Large Language Models: A Comprehensive Survey , author =. arXiv preprint arXiv:2310.19736 , year =
- [10]
-
[11]
Chiang, Wei-Lin and Zheng, Lianmin and Sheng, Ying and Angelopoulos, Anastasios Nikolas and Li, Tianle and Li, Dacheng and Zhang, Hao and Zhu, Banghua and Jordan, Michael and Gonzalez, Joseph E. and Stoica, Ion , journal =. Chatbot Arena: An Open Platform for Evaluating. 2024 , url =
work page 2024
-
[12]
Gu, Jiawei and Jiang, Xuhui and Shi, Zhichao and Tan, Hexiang and Zhai, Xuehao and Xu, Chengjin and Li, Wei and Shen, Yinghan and Ma, Shengjie and Liu, Honghao and others , journal =. A Survey on. 2024 , url =
work page 2024
- [13]
-
[14]
arXiv preprint arXiv:2501.17858 , year =
Improving Your Model Ranking on Chatbot Arena by Vote Rigging , author =. arXiv preprint arXiv:2501.17858 , year =
-
[15]
Large Language Models are not Fair Evaluators , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =
-
[16]
Judging the Judges: A Systematic Study of Position Bias in
Shi, Lin and Ma, Chiyu and Liang, Wenhua and Diao, Xingjian and Ma, Weicheng and Vosoughi, Soroush , journal =. Judging the Judges: A Systematic Study of Position Bias in. 2024 , url =
work page 2024
-
[17]
Panickssery, Arjun and Bowman, Samuel R. and Feng, Shi , journal =. 2024 , url =
work page 2024
-
[18]
Wataoka, Koki and Takahashi, Tsubasa and Ri, Ryokan , journal =. Self-Preference Bias in. 2024 , url =
work page 2024
-
[19]
Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zhexin and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Gonzalez, Joseph E. and Stoica, Ion , journal =. Judging. 2023 , url =
work page 2023
-
[20]
Findings of the Association for Computational Linguistics: ACL 2024 , year =
Ranking Large Language Models without Ground Truth , author =. Findings of the Association for Computational Linguistics: ACL 2024 , year =
work page 2024
-
[21]
Preference Leakage: A Contamination Problem in
Li, Dawei and Sun, Renliang and Huang, Yue and Zhong, Ming and Jiang, Bohan and Han, Jiawei and Zhang, Xiangliang and Wang, Wei and Liu, Huan , journal =. Preference Leakage: A Contamination Problem in. 2025 , url =
work page 2025
-
[22]
Universal Intelligence: A Definition of Machine Intelligence , author =. Minds and Machines , volume =. 2007 , publisher =
work page 2007
-
[23]
Artificial Intelligence , volume =
Measuring Universal Intelligence: Towards an Anytime Intelligence Test , author =. Artificial Intelligence , volume =. 2010 , publisher =
work page 2010
-
[24]
On the Measure of Intelligence
On the Measure of Intelligence , author =. arXiv preprint arXiv:1911.01547 , year =
work page internal anchor Pith review arXiv 1911
-
[25]
Computing Machinery and Intelligence , author =. Mind , volume =. 1950 , url =
work page 1950
-
[26]
Jones, Cameron R. and Bergen, Benjamin K. , journal =. People cannot distinguish. 2024 , url =
work page 2024
-
[27]
Large Language Models Pass the Turing Test , author =. arXiv preprint arXiv:2503.23674 , year =
-
[28]
arXiv preprint arXiv:2205.05268 , year =
The Meta-Turing Test , author =. arXiv preprint arXiv:2205.05268 , year =
- [29]
-
[30]
Benchmark Data Contamination of Large Language Models: A Survey , author=. ArXiv , year=
-
[31]
Probabilistic Encryption , author=. J. Comput. Syst. Sci. , year=
-
[32]
Mapping global dynamics of benchmark creation and saturation in artificial intelligence , author=. Nature Communications , year=
-
[33]
Symposium on the Theory of Computing , year=
The knowledge complexity of interactive proof-systems , author=. Symposium on the Theory of Computing , year=
-
[34]
SIAM Journal on Computing , volume =
Garg, Sanjam and Gentry, Craig and Halevi, Shai and Raykova, Mariana and Sahai, Amit and Waters, Brent , title =. SIAM Journal on Computing , volume =. 2016 , doi =
work page 2016
-
[35]
The Collected Papers of Albert Einstein, Volume 2: The Swiss Years: Writings, 1900--1909 , editor =
Einstein, Albert , title =. The Collected Papers of Albert Einstein, Volume 2: The Swiss Years: Writings, 1900--1909 , editor =
work page 1900
-
[36]
arXiv preprint arXiv:2306.09194 , year =
Christ, Mischa and Gunn, Sam and Zamir, Or and Mahloujifar, Saeid , title =. arXiv preprint arXiv:2306.09194 , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.