pith. machine review for the scientific record. sign in

arxiv: 1911.01547 · v2 · submitted 2019-11-05 · 💻 cs.AI

Recognition: 3 theorem links

· Lean Theorem

On the Measure of Intelligence

Authors on Pith no claims yet

Pith reviewed 2026-05-12 12:59 UTC · model grok-4.3

classification 💻 cs.AI
keywords intelligence measurementskill acquisition efficiencyalgorithmic information theorygeneralizationabstraction and reasoning corpuspriorsAI benchmarksfluid intelligence
0
0 comments X

The pith

Intelligence is the efficiency of acquiring skills from limited experience, not performance on fixed tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that existing ways of assessing intelligence in AI focus on how well systems do at specific tasks such as games, yet this measure can be artificially raised by supplying unlimited prior knowledge or training data. A more accurate approach defines intelligence as skill-acquisition efficiency, which accounts for the range of tasks a system can address, the difficulty of generalizing to new ones, and the starting priors plus experience required. This definition draws from algorithmic information theory to separate the system's own generalization power from external advantages. If correct, it would allow direct, fair comparisons of intelligence between AI systems and humans without masking differences through data volume. The paper supplies concrete guidelines for such benchmarks and introduces the Abstraction and Reasoning Corpus built on priors intended to match human innate knowledge.

Core claim

Intelligence is formalized as skill-acquisition efficiency: the rate at which a system develops new capabilities given a defined scope of tasks, a level of generalization difficulty, and a quantity of experience, while incorporating its priors. This formulation, rooted in algorithmic information theory, treats skill at any single task as an insufficient proxy because priors and experience heavily modulate observed performance. The definition therefore directs evaluation toward how economically a system converts limited experience into broad competence.

What carries the argument

The formal definition of intelligence as skill-acquisition efficiency from algorithmic information theory, which isolates generalization power by controlling for priors and experience across tasks of varying difficulty.

If this is right

  • Benchmarks must limit the priors and experience supplied to systems so that measured performance reflects acquisition efficiency rather than external resources.
  • Comparisons between AI and humans become possible once both operate under comparable innate priors on the same task scope.
  • AI progress can be tracked by improvements in skill-acquisition efficiency rather than by gains on any single fixed task.
  • The Abstraction and Reasoning Corpus provides one concrete realization of these guidelines for measuring fluid, human-like intelligence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Systems optimized under this measure may generalize more readily to open-ended real-world problems than those trained on narrow, data-heavy tasks.
  • The definition could be applied to non-benchmark settings by defining new task scopes and measuring acquisition rates under controlled priors.
  • If the approach holds, large-scale pretraining on fixed datasets would be revealed as a limited path to general intelligence.

Load-bearing premise

The explicit priors chosen for the Abstraction and Reasoning Corpus are sufficiently close to innate human priors that performance differences on the benchmark reflect genuine differences in generalization power.

What would settle it

An AI system that reaches high scores on the Abstraction and Reasoning Corpus yet fails to acquire skills efficiently when tested on a fresh set of tasks with matched priors and experience would show that the benchmark does not isolate the intended form of intelligence.

read the original abstract

To make deliberate progress towards more intelligent and more human-like artificial systems, we need to be following an appropriate feedback signal: we need to be able to define and evaluate intelligence in a way that enables comparisons between two systems, as well as comparisons with humans. Over the past hundred years, there has been an abundance of attempts to define and measure intelligence, across both the fields of psychology and AI. We summarize and critically assess these definitions and evaluation approaches, while making apparent the two historical conceptions of intelligence that have implicitly guided them. We note that in practice, the contemporary AI community still gravitates towards benchmarking intelligence by comparing the skill exhibited by AIs and humans at specific tasks such as board games and video games. We argue that solely measuring skill at any given task falls short of measuring intelligence, because skill is heavily modulated by prior knowledge and experience: unlimited priors or unlimited training data allow experimenters to "buy" arbitrary levels of skills for a system, in a way that masks the system's own generalization power. We then articulate a new formal definition of intelligence based on Algorithmic Information Theory, describing intelligence as skill-acquisition efficiency and highlighting the concepts of scope, generalization difficulty, priors, and experience. Using this definition, we propose a set of guidelines for what a general AI benchmark should look like. Finally, we present a benchmark closely following these guidelines, the Abstraction and Reasoning Corpus (ARC), built upon an explicit set of priors designed to be as close as possible to innate human priors. We argue that ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper summarizes and critically assesses historical definitions of intelligence from psychology and AI, identifies two implicit conceptions guiding them, argues that task-specific skill benchmarks fail to measure intelligence because skill depends on priors and experience, articulates a new formal definition of intelligence grounded in Algorithmic Information Theory as skill-acquisition efficiency (incorporating scope, generalization difficulty, priors, and experience), proposes guidelines for general AI benchmarks, and introduces the Abstraction and Reasoning Corpus (ARC) built on an explicit set of priors designed to approximate innate human priors for measuring human-like fluid intelligence and enabling fair AI-human comparisons.

Significance. If the definition is sound and the ARC priors sufficiently match human innate priors, the work could meaningfully shift AI evaluation toward measuring generalization efficiency rather than acquired skill, providing a principled alternative to current task-specific benchmarks and influencing the design of future intelligence tests.

major comments (2)
  1. [Section on the new formal definition] The section articulating the new formal definition: intelligence is defined as skill-acquisition efficiency drawing on AIT, but the manuscript provides only a conceptual description without a precise mathematical formulation, derivation from core AIT quantities (such as Kolmogorov complexity), or operationalization that would allow quantitative computation or direct falsification of the definition.
  2. [ARC benchmark description] The ARC benchmark description and guidelines section: the central claim that ARC enables fair human-AI comparisons rests on the assumption that its enumerated priors (objectness, basic geometry, counting, etc.) are close enough to innate human priors that performance differences isolate generalization efficiency; however, no derivation, empirical calibration against human data, or sensitivity analysis is supplied to show that modifying any listed prior would not materially alter relative scores.
minor comments (2)
  1. [Abstract and introduction] The abstract and introduction reference 'two historical conceptions of intelligence' without naming or briefly characterizing them, which reduces clarity for readers unfamiliar with the cited psychology and AI literature.
  2. [Conclusions] The manuscript would benefit from an explicit statement of the scope of the proposed definition (e.g., whether it applies only to fluid intelligence or extends to other forms) to avoid overgeneralization in the conclusions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on the manuscript. We address each major comment below, indicating where revisions will be incorporated.

read point-by-point responses
  1. Referee: [Section on the new formal definition] The section articulating the new formal definition: intelligence is defined as skill-acquisition efficiency drawing on AIT, but the manuscript provides only a conceptual description without a precise mathematical formulation, derivation from core AIT quantities (such as Kolmogorov complexity), or operationalization that would allow quantitative computation or direct falsification of the definition.

    Authors: We acknowledge that the definition is presented conceptually, drawing on AIT to frame intelligence as skill-acquisition efficiency without supplying a closed-form mathematical expression or explicit derivation from Kolmogorov complexity. This choice was made to emphasize the definition's implications for evaluation and to keep it accessible across psychology and AI. In revision, we will expand the section with a more explicit mapping to AIT notions, such as relating efficiency to the incremental reduction in description length for novel tasks, and add discussion of possible operationalizations along with their limitations for direct falsification. revision: partial

  2. Referee: [ARC benchmark description] The ARC benchmark description and guidelines section: the central claim that ARC enables fair human-AI comparisons rests on the assumption that its enumerated priors (objectness, basic geometry, counting, etc.) are close enough to innate human priors that performance differences isolate generalization efficiency; however, no derivation, empirical calibration against human data, or sensitivity analysis is supplied to show that modifying any listed prior would not materially alter relative scores.

    Authors: The referee is correct that the fairness claim for human-AI comparisons depends on the priors approximating innate human ones, and that the manuscript supplies neither empirical calibration nor sensitivity analysis. We will revise the relevant section to elaborate the rationale for each prior with additional citations from cognitive science literature on core knowledge systems. We will also add an explicit limitations paragraph acknowledging the absence of sensitivity analysis and noting that full empirical calibration is an important avenue for subsequent work. revision: partial

Circularity Check

0 steps flagged

No circularity: formal AIT-based definition and benchmark guidelines are independent of fitted inputs or self-referential loops.

full rationale

The paper articulates a definition of intelligence as skill-acquisition efficiency drawing directly from established Algorithmic Information Theory concepts (scope, generalization difficulty, priors, experience) without any equations or derivations that reduce back to the paper's own data or assumptions by construction. Guidelines for benchmarks follow from this definition. ARC is presented as one implementation using an explicitly enumerated prior set chosen by the authors; while the claim of closeness to human priors is an unvalidated assumption rather than a derived result, it does not create a self-definitional loop, fitted-parameter prediction, or load-bearing self-citation chain. No steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about the applicability of AIT to intelligence and the design of human-like priors, with no free parameters or invented entities explicitly fitted or postulated in the abstract.

axioms (2)
  • domain assumption Intelligence can be formalized as skill-acquisition efficiency using concepts from Algorithmic Information Theory
    The paper bases its new definition directly on AIT without deriving it from more fundamental principles in the abstract.
  • domain assumption The priors in ARC are close to innate human priors
    This assumption is required for the claim that ARC enables fair human-AI comparisons.

pith-pipeline@v0.9.0 · 5589 in / 1455 out tokens · 63432 ms · 2026-05-12T12:59:33.747900+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation/LawOfExistence law_of_existence echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    We then articulate a new formal definition of intelligence based on Algorithmic Information Theory, describing intelligence as skill-acquisition efficiency and highlighting the concepts of scope, generalization difficulty, priors, and experience.

  • Foundation/HierarchyEmergence hierarchy_emergence_forces_phi echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    built upon an explicit set of priors designed to be as close as possible to innate human priors

  • Foundation/DiscretenessForcing discreteness_forcing_principle echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    skill is heavily modulated by prior knowledge and experience: unlimited priors or unlimited training data allow experimenters to buy arbitrary levels of skills

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 40 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Gradient-Based Program Synthesis with Neurally Interpreted Languages

    cs.LG 2026-04 unverdicted novelty 8.0

    NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prio...

  2. Are Flat Minima an Illusion?

    cs.LG 2026-03 unverdicted novelty 8.0

    Flat minima are illusory; generalization is driven by weakness, a reparameterization-invariant measure of compatible completions that predicts performance better than sharpness on MNIST and Fashion-MNIST.

  3. Test-Time Learning with an Evolving Library

    cs.LG 2026-05 unverdicted novelty 7.0

    EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without param...

  4. Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers

    cs.AI 2026-05 conditional novelty 7.0

    The Divergent Remote Association Test (DRAT) is the first creativity test that significantly predicts LLMs' scientific ideation ability, unlike prior tests such as DAT or RAT.

  5. Prospective Compression in Human Abstraction Learning

    cs.AI 2026-05 unverdicted novelty 7.0

    Humans exhibit abstraction learning consistent with prospective compression of future tasks in non-stationary domains, unlike retrospective compression algorithms or LLM-based approaches.

  6. When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning

    cs.AI 2026-05 conditional novelty 7.0

    State-conditioned commitment depth in a vision-language policy Pareto-dominates fixed-depth baselines on Sliding Puzzle and Sokoban, raising solve rates by up to 12.5 points while using 25% fewer actions and beating l...

  7. Lattice Deduction Transformers

    cs.LG 2026-05 unverdicted novelty 7.0

    An 800K-parameter Lattice Deduction Transformer reaches 100% accuracy on Sudoku-Extreme and Snowflake Sudoku and 99.9% on Maze-Hard by using lattice projections and abstract-interpretation supervision, while frontier ...

  8. Intervention Complexity as a Canonical Reward and a Measure of Intelligence

    cs.AI 2026-05 unverdicted novelty 7.0

    Intervention complexity provides a family of canonical rewards indexed by resource bias that completes the Legg-Hutter framework and enables a two-dimensional view of intelligence as competence plus learning efficiency.

  9. AI scientists produce results without reasoning scientifically

    cs.AI 2026-04 conditional novelty 7.0

    LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.

  10. Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning

    cs.CL 2026-04 unverdicted novelty 7.0

    CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.

  11. Yanasse: Finding New Proofs from Deep Vision's Analogies, Part 1

    cs.AI 2026-04 unverdicted novelty 7.0

    A domain-independent analogy engine transfers Lean tactic patterns from probability to representation theory, producing four new machine-verified proofs.

  12. Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs

    cs.AI 2026-04 accept novelty 7.0

    The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.

  13. Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism

    cs.LO 2026-04 unverdicted novelty 7.0

    ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.

  14. Factorization Regret mediates compositional generalization in latent space

    cs.LG 2026-03 unverdicted novelty 7.0

    Factorization Regret measures how latent variable interactions affect performance, and RCCs enable learning them to achieve compositional generalization in partially observable tasks.

  15. Less is More: Recursive Reasoning with Tiny Networks

    cs.LG 2025-10 unverdicted novelty 7.0

    TRM with 7M parameters achieves 45% accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, surpassing most LLMs with under 0.01% of their parameters.

  16. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    cs.LG 2025-02 unverdicted novelty 7.0

    A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

  17. GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

    cs.LG 2024-10 accept novelty 7.0

    LLMs display high variance and major accuracy drops on GSM-Symbolic variants of grade-school math problems, indicating they replicate training patterns rather than execute logical reasoning.

  18. The Evaluation Trap: Benchmark Design as Theoretical Commitment

    cs.AI 2026-05 unverdicted novelty 6.0

    AI benchmarks trap progress by operationalizing assumptions that redefine capabilities around the benchmarks themselves, and Epistematics provides an audit procedure to detect when evaluations cannot discriminate clai...

  19. The Generalized Turing Test: A Foundation for Comparing Intelligence

    cs.AI 2026-05 unverdicted novelty 6.0

    The Generalized Turing Test defines relative intelligence as the inability of one agent to distinguish an imitator from the original through interaction.

  20. Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

    cs.AI 2026-05 unverdicted novelty 6.0

    OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.

  21. Continuous Latent Diffusion Language Model

    cs.CL 2026-05 unverdicted novelty 6.0

    Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...

  22. Intervention Complexity as a Canonical Reward and a Measure of Intelligence

    cs.AI 2026-05 unverdicted novelty 6.0

    Intervention complexity provides a family of environment-derived universal rewards indexed by resource bias that completes the Legg-Hutter framework without external normative input.

  23. One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.

  24. Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

    cs.CV 2026-04 unverdicted novelty 6.0

    Evidence for cross-modal representational convergence weakens substantially at scale and in realistic many-to-many settings, indicating models learn rich but distinct representations.

  25. Representation-Guided Parameter-Efficient LLM Unlearning

    cs.CL 2026-04 unverdicted novelty 6.0

    REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.

  26. C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions

    cs.LG 2026-04 unverdicted novelty 6.0

    C-voting improves recurrent reasoning models by selecting among multiple latent trajectories the one with highest average top-1 probability, achieving 4.9% better Sudoku-hard accuracy than energy-based voting and outp...

  27. ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence

    cs.AI 2026-03 unverdicted novelty 6.0

    ARC-AGI-3 is a benchmark where humans solve 100% of tasks but frontier AI systems score below 1% as of March 2026, using efficiency-based scoring grounded in human baselines.

  28. Video models are zero-shot learners and reasoners

    cs.LG 2025-09 unverdicted novelty 6.0

    Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.

  29. Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    cs.CL 2025-06 conditional novelty 6.0

    High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.

  30. Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    cs.LG 2024-07 unverdicted novelty 6.0

    Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.

  31. Deep Vision: A Formal Proof of Wolstenholmes Theorem in Lean 4

    cs.LO 2026-04 accept novelty 5.0

    Wolstenholme's theorem is formally verified in Lean 4 via expansion of a shifted factorial product and vanishing power sums modulo p.

  32. The Rise and Fall of $G$ in AGI

    q-bio.NC 2026-04 unverdicted novelty 5.0

    PCA on AI model benchmarks reveals a general intelligence factor that rises then falls as specialized reasoning models appear, inverting the expected move toward parsimonious mechanisms.

  33. Kuramoto Oscillatory Phase Encoding: Neuro-inspired Synchronization for Improved Learning Efficiency

    cs.LG 2026-04 unverdicted novelty 5.0

    KoPE adds Kuramoto-based oscillatory phase states and synchronization to Vision Transformers, improving training, parameter, and data efficiency on structured vision tasks.

  34. From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments

    cs.AI 2026-03 unverdicted novelty 5.0

    An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.

  35. Hierarchical Reasoning Model

    cs.AI 2025-06 unverdicted novelty 5.0

    HRM is a recurrent architecture with high-level planning and low-level execution modules that reaches near-perfect accuracy on complex Sudoku, maze navigation, and ARC benchmarks using 27M parameters and 1000 samples ...

  36. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    cs.AI 2025-03 unverdicted novelty 5.0

    The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

  37. The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents

    cs.AI 2026-05 unverdicted novelty 4.0

    Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.

  38. Measuring AI Reasoning: A Guide for Researchers

    cs.AI 2026-05 unverdicted novelty 4.0

    Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.

  39. A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    cs.AI 2025-07 accept novelty 4.0

    The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

  40. Auto-Relational Reasoning

    cs.AI 2026-04 unverdicted novelty 3.0

    A system using auto-relational reasoning solves IQ test problems at 98.03% rate without any prior knowledge, reaching top 1% human performance.

Reference graph

Works this paper leans on

102 extracted references · 102 canonical work pages · cited by 39 Pith papers

  1. [1]

    I-athlon: Towards a mul- tidimensional turing test

    Sam S Adams, Guruduth Banavar, and Murray Campbell. I-athlon: Towards a mul- tidimensional turing test. AI Magazine, (1):78–84, 2016

  2. [2]

    Anderson and Christian Lebiere

    John R. Anderson and Christian Lebiere. The newell test for a theory of cognition. Behavioral and Brain Sciences, pages 587–601, 2003

  3. [3]

    De Anima

    Aristotle. De Anima. c. 350 BC

  4. [4]

    Cognitive developmental robotics: A survey.IEEE Transactions on Autonomous Mental Development, pages 12–34, 2009

    Minoru Asada et al. Cognitive developmental robotics: A survey.IEEE Transactions on Autonomous Mental Development, pages 12–34, 2009

  5. [5]

    ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst

    Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. Chauffeurnet: Learn- ing to drive by imitating the best and synthesizing the worst. arXiv preprint arXiv:1812.03079, 2018

  6. [6]

    Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling

    Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. J. Artif. Int. Res., (1):253–279, May 2013

  7. [7]

    The animal-ai environment: Training and testing animal- like artificial cognition, 2019

    Benjamin Beyret, Jos Hernndez-Orallo, Lucy Cheke, Marta Halina, Murray Shana- han, and Matthew Crosby. The animal-ai environment: Training and testing animal- like artificial cognition, 2019

  8. [8]

    Mthodes nouvelles pour le diagnostic du niveau intellectuel des anormaux

    Alfred Binet and Thodore Simon. Mthodes nouvelles pour le diagnostic du niveau intellectuel des anormaux. L’anne psychologique, pages 191–244, 1904

  9. [9]

    What is artificial intelligence? psycho- metric ai as an answer

    Selmer Bringsjord and Bettina Schimanski. What is artificial intelligence? psycho- metric ai as an answer. In Proceedings of the 18th International Joint Conference on Artificial Intelligence, IJCAI’03, pages 887–893, San Francisco, CA, USA, 2003. Morgan Kaufmann Publishers Inc

  10. [10]

    Sample-efficient reinforcement learning with stochastic ensemble value expansion, 2018

    Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sample-efficient reinforcement learning with stochastic ensemble value expansion, 2018

  11. [11]

    The 2005 DARPA Grand Chal- lenge: The Great Robot Race

    Martin Buehler, Karl Iagnemma, and Sanjiv Singh. The 2005 DARPA Grand Chal- lenge: The Great Robot Race . Springer Publishing Company, Incorporated, 1st edition, 2007

  12. [12]

    Joseph Hoane, Jr., and Feng-hsiung Hsu

    Murray Campbell, A. Joseph Hoane, Jr., and Feng-hsiung Hsu. Deep blue. Artif. Intell., (1-2):57–83, 2002

  13. [13]

    Raymond B. Cattell. Abilities: Their structure, growth, and action. 1971

  14. [14]

    G. Chaitin. Algorithmic Information Theory. Cambridge University Press, 1987. 58

  15. [15]

    A theory of program size formally identical to information theory

    Gregory J Chaitin. A theory of program size formally identical to information theory. Journal of the ACM (JACM), (3):329–340, 1975

  16. [16]

    Deep Learning with Python

    Francois Chollet. Deep Learning with Python. Manning Publications, 2017

  17. [17]

    Quantifying generalization in reinforcement learning

    Karl Cobbe, Oleg Klimov, Christopher Hesse, Taehoon Kim, and John Schulman. Quantifying generalization in reinforcement learning. CoRR, 2018

  18. [18]

    Cultural perceptions of human intelligence

    Ebinepre A Cocodia. Cultural perceptions of human intelligence. Journal of Intelli- gence, 2(4):180–196, 2014

  19. [19]

    Cosmides and J

    L. Cosmides and J. Tooby. Origins of domain specificity: the evolution of functional organization. page 85116, 1994

  20. [20]

    Introduction to classical and modern test theory

    Linda Crocker and James Algina. Introduction to classical and modern test theory. ERIC, 1986

  21. [21]

    The Origin of Species

    Charles Darwin. The Origin of Species. 1859

  22. [22]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large- Scale Hierarchical Image Database. In CVPR09, 2009

  23. [23]

    D. K. Detterman. A challenge to watson. Intelligence, page 7778, 2011

  24. [24]

    T.G. Evans. A program for the solution of a class of geometric-analogy intelligence- test questions. pages 271–353, 1968

  25. [25]

    What is intelligence?: Beyond the Flynn effect

    James R Flynn. What is intelligence?: Beyond the Flynn effect. Cambridge Univer- sity Press, 2007

  26. [26]

    A learning machine: Part i

    Richard M Friedberg. A learning machine: Part i. IBM Journal of Research and Development, 2(1):2–13, 1958

  27. [27]

    Beyond the Turing Test (workshop), 2014

    Manuela Veloso Gary Marcus, Francesca Rossi. Beyond the Turing Test (workshop), 2014

  28. [28]

    Goertzel and C

    B. Goertzel and C. Pennachin, editors. Artificial general intelligence. Springer, New York, 2007

  29. [29]

    Intelligence and computer simulation

    Bert F Green Jr. Intelligence and computer simulation. Transactions of the New York Academy of Sciences, 1964

  30. [30]

    Gr ¨unwald and Paul M

    Peter D. Gr ¨unwald and Paul M. B. Vit´anyi. Algorithmic information theory. 2008

  31. [31]

    Inductive programming meets the real world

    Sumit Gulwani, Jos ´e Hern´andez-Orallo, Emanuel Kitzelmann, Stephen H Muggle- ton, Ute Schmid, and Benjamin Zorn. Inductive programming meets the real world. Communications of the ACM, 58(11):90–99, 2015

  32. [32]

    Program Synthesis

    Sumit Gulwani, Alex Polozov, and Rishabh Singh. Program Synthesis. 2017

  33. [33]

    William H. Guss, Cayden Codel, Katja Hofmann, Brandon Houghton, Noburu Kuno, Stephanie Milani, Sharada Prasanna Mohanty, Diego Perez Liebana, Rus- lan Salakhutdinov, Nicholay Topin, Manuela Veloso, and Phillip Wang. The minerl competition on sample efficient reinforcement learning using human priors. CoRR, 2019. 59

  34. [34]

    Hambleton, H

    R. Hambleton, H. Swaminathan, and H. Rogers. Fundamentals of Item Response Theory. Sage Publications, Inc., 1991

  35. [35]

    Bachman P

    Islam R. Bachman P. Pineau J. Precup D. Henderson, P. and D. Meger. Deep rein- forcement learning that matters. 2018

  36. [36]

    Evaluation in artificial intelligence: from task-oriented to ability-oriented measurement

    Jos ´e Hern ´andez-Orallo. Evaluation in artificial intelligence: from task-oriented to ability-oriented measurement. Artificial Intelligence Review, pages 397–447, 2017

  37. [37]

    The Measure of All Minds: Evaluating Natural and Artificial Intelligence

    Jos ´e Hern´andez-Orallo. The Measure of All Minds: Evaluating Natural and Artificial Intelligence. Cambridge University Press, 2017

  38. [38]

    Measuring universal intelligence: To- wards an anytime intelligence test.Artificial Intelligence, 174(18):1508–1539, 2010

    Jos ´e Hern´andez-Orallo and David L Dowe. Measuring universal intelligence: To- wards an anytime intelligence test.Artificial Intelligence, 174(18):1508–1539, 2010

  39. [39]

    Dowe, and M.Victoria Hern ´andez-Lloreda

    Jos ´e Hern´andez-Orallo, David L. Dowe, and M.Victoria Hern ´andez-Lloreda. Uni- versal psychometrics. Cogn. Syst. Res., (C):50–74, March 2014

  40. [40]

    A formal definition of intelli- gence based on an intensional variant of algorithmic complexity

    Jos ´e Hern ´andez-Orallo and Neus Minaya-Collado. A formal definition of intelli- gence based on an intensional variant of algorithmic complexity. 1998

  41. [41]

    G.E. Hinton. How neural networks learn from experience. Mind and brain: Read- ings from the Scientific American magazine, page 113124, 1993

  42. [42]

    Human Nature: or The fundamental Elements of Policie

    Thomas Hobbes. Human Nature: or The fundamental Elements of Policie. 1650

  43. [43]

    Universal artificial intelligence: Sequential decisions based on al- gorithmic probability

    Marcus Hutter. Universal artificial intelligence: Sequential decisions based on al- gorithmic probability. Springer Science & Business Media, 2004

  44. [44]

    D.L. Dowe J. Hernndez-Orallo. Iq tests are not for machines, yet. Intelligence, page 7781, 2012

  45. [45]

    Predicting the generalization gap in deep networks with margin distributions

    Yiding Jiang, Dilip Krishnan, Hossein Mobahi, and Samy Bengio. Predicting the generalization gap in deep networks with margin distributions. ArXiv, 2018

  46. [46]

    Measuring the tendency of cnns to learn surface sta- tistical regularities

    Jason Jo and Yoshua Bengio. Measuring the tendency of cnns to learn surface sta- tistical regularities. ArXiv, 2017

  47. [47]

    Raven J. John. Raven Progressive Matrices. Springer, Boston, MA, 2003

  48. [48]

    The structure of human intelligence: It is verbal, perceptual, and image rotation (vpr), not fluid and crystallized

    Wendy Johnson and Thomas J.Bouchard Jr. The structure of human intelligence: It is verbal, perceptual, and image rotation (vpr), not fluid and crystallized. Intelligence, pages 393–416, 2005

  49. [49]

    Obstacle tower: A generalization challenge in vision, control, and planning.Proceedings of the Twenty- Eighth International Joint Conference on Artificial Intelligence, Aug 2019

    Arthur Juliani, Ahmed Khalifa, Vincent-Pierre Berges, Jonathan Harper, Ervin Teng, Hunter Henry, Adam Crespi, Julian Togelius, and Danny Lange. Obstacle tower: A generalization challenge in vision, control, and planning.Proceedings of the Twenty- Eighth International Joint Conference on Artificial Intelligence, Aug 2019

  50. [50]

    Illuminating generalization in deep reinforcement learning through procedural level generation

    Niels Justesen, Ruben Rodriguez Torrado, Philip Bontrager, Ahmed Khalifa, Ju- lian Togelius, and Sebastian Risi. Illuminating generalization in deep reinforcement learning through procedural level generation. arXiv preprint arXiv:1806.10729 , 2018. 60

  51. [51]

    Lake, Tomer D

    Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gersh- man. Building machines that learn and think like people. CoRR, 2016

  52. [52]

    Deep learning

    Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, (7553):436, 2015

  53. [53]

    A collection of definitions of intelligence

    Shane Legg and Marcus Hutter. A collection of definitions of intelligence. 2007

  54. [54]

    Universal intelligence: A definition of machine intelligence

    Shane Legg and Marcus Hutter. Universal intelligence: A definition of machine intelligence. Minds and machines, 17(4):391–444, 2007

  55. [55]

    An introduction to Kolmogorov complexity and its applications, volume 3

    Ming Li, Paul Vit ´anyi, et al. An introduction to Kolmogorov complexity and its applications, volume 3. Springer

  56. [56]

    An Essay Concerning Human Understanding

    John Locke. An Essay Concerning Human Understanding. 1689

  57. [57]

    Human performance on the traveling salesman and related problems: A review

    James Macgregor and Yun Chu. Human performance on the traveling salesman and related problems: A review. The Journal of Problem Solving, 3, 02 2011

  58. [58]

    Human performance on the traveling sales- man problem

    James Macgregor and Thomas Ormerod. Human performance on the traveling sales- man problem. Perception & psychophysics, 58:527–39, 06 1996

  59. [59]

    Deep Learning: A Critical Appraisal

    Gary Marcus. Deep learning: A critical appraisal. arXiv preprint arXiv:1801.00631, 2018

  60. [60]

    Generality in artificial intelligence

    John McCarthy. Generality in artificial intelligence. Communications of the ACM, 30(12):1030–1035, 1987

  61. [61]

    Machines Who Think: A Personal Inquiry into the History and Prospects of Artificial Intelligence

    Pamela McCorduck. Machines Who Think: A Personal Inquiry into the History and Prospects of Artificial Intelligence. AK Peters Ltd, 2004

  62. [62]

    The cattell-horn-carroll theory of cognitive abilities: Past, present, and future

    Kevin McGrew. The cattell-horn-carroll theory of cognitive abilities: Past, present, and future. Contemporary Intellectual Assessment: Theories, Tests, and Issues , 01 2005

  63. [63]

    Society of mind

    Marvin Minsky. Society of mind. Simon and Schuster, 1988

  64. [64]

    Place cells, grid cells, and memory

    May-Britt Moser, David C Rowland, and Edvard I Moser. Place cells, grid cells, and memory. Cold Spring Harbor perspectives in biology, 7(2):a021808, 2015

  65. [65]

    Shane Mueller, Matt Jones, Brandon Minnery, Ph Julia, and M Hiland. The bica cog- nitive decathlon: A test suite for biologically-inspired cognitive agents.Proceedings of the 16th Conference on Behavior Representation in Modeling and Simulation , 2007

  66. [66]

    A. Newell. You cant play 20 questions with nature and win: Projective comments on the papers of this symposium. 1973

  67. [67]

    Ex- ploring generalization in deep learning

    Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Ex- ploring generalization in deep learning. In Advances in Neural Information Process- ing Systems, pages 5947–5956, 2017

  68. [68]

    Behaviour suite for reinforce- ment learning.arXiv preprint arXiv:1908.03568,

    Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepezvari, Satinder Singh, et al. Behaviour suite for reinforcement learning. arXiv preprint arXiv:1908.03568, 2019. 61

  69. [69]

    A. E. Howe P. R. Cohen. How evaluation guides ai research: the message still counts more than the medium. AI Mag, page 35, 1988

  70. [70]

    Assessing generalization in deep reinforcement learning

    Charles Packer, Katelyn Gao, Jernej Kos, Philipp Kr ¨ahenb¨uhl, Vladlen Koltun, and Dawn Xiaodong Song. Assessing generalization in deep reinforcement learning. ArXiv, 2018

  71. [71]

    Gaina, and Daniel Ionita

    Diego Perez-Liebana, Katja Hofmann, Sharada Prasanna Mohanty, Noboru Sean Kuno, Andre Kramer, Sam Devlin, Raluca D. Gaina, and Daniel Ionita. The multi- agent reinforcement learning in malm (marl) competition. Technical report, 2019

  72. [72]

    General video game ai: a multi-track framework for evaluating agents, games and content generation algorithms

    Diego Perez-Liebana, Jialin Liu, Ahmed Khalifa, Raluca D Gaina, Julian Togelius, and Simon M Lucas. General video game ai: a multi-track framework for evaluating agents, games and content generation algorithms. arXiv preprint arXiv:1802.10363, 2018

  73. [73]

    Reproducible, Reusable, and Robust Reinforcement Learning, 2018

    Joelle Pineau. Reproducible, Reusable, and Robust Reinforcement Learning, 2018. Neural Information Processing Systems

  74. [74]

    S. Pinker. The blank slate: The modern denial of human nature. Viking, New York, 2002

  75. [75]

    David M. W. Powers. The total Turing test and the loebner prize. In New Methods in Language Processing and Computational Natural Language Learning, 1998

  76. [76]

    Todorov E

    Lowrey K. Todorov E. V . Rajeswaran, A. and S. M. Kakade. Towards generalization and simplicity in continuous control. 2017

  77. [77]

    Promise of AI not so bright, 2006

    Fred Reed. Promise of AI not so bright, 2006

  78. [78]

    Emile, or On Education

    Jean-Jacques Rousseau. Emile, or On Education. 1762

  79. [79]

    Rumelhart, D.E

    & McClelland J.L. Rumelhart, D.E. Distributed memory and the representation of general and specific information.Journal of Experimental Psychology, page 159188, 1985

  80. [80]

    Sanghi and D

    P. Sanghi and D. L. Dowe. A computer program capable of passing iq tests. page 570575, 2003

Showing first 80 references.