pith. machine review for the scientific record. sign in

arxiv: 2603.23964 · v2 · submitted 2026-03-25 · 💻 cs.AI

Recognition: no theorem link

From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:16 UTC · model grok-4.3

classification 💻 cs.AI
keywords reinforcement learningenvironmentstaxonomylarge language modelsparadigm shiftempirical studycognitive capabilitiesgeneralization
0
0 comments X

The pith

Reinforcement learning environments are splitting into LLM-driven semantic systems and domain-specific physical generalization systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper conducts a large-scale quantitative analysis of more than 2,000 RL publications to map how training environments have evolved from isolated physical simulations toward generalist language-driven agents. It introduces a multi-dimensional taxonomy that classifies environments according to their application domains and the specific cognitive capabilities they test. Automated semantic and statistical processing of the literature identifies a data-supported bifurcation of the field into a Semantic Prior ecosystem centered on large language models and a Domain-Specific Generalization ecosystem. The study further extracts cognitive fingerprints for each branch to explain patterns of skill transfer, interference, and zero-shot generalization. These results supply a concrete roadmap for building Embodied Semantic Simulators that link continuous physical control with high-level logical reasoning.

Core claim

Automated semantic and statistical analysis of a corpus of over 2,000 RL publications reveals a paradigm shift in which the field bifurcates into a Semantic Prior ecosystem dominated by Large Language Models and a Domain-Specific Generalization ecosystem; each ecosystem carries distinct cognitive fingerprints that govern cross-task synergy, multi-domain interference, and zero-shot generalization.

What carries the argument

A novel multi-dimensional taxonomy that classifies RL benchmarks by application domains and required cognitive capabilities, derived through automated processing of publication data.

If this is right

  • Designers of new agents can target the two ecosystems separately before combining their strengths.
  • Cognitive fingerprints offer a practical way to forecast and improve zero-shot transfer between tasks.
  • Embodied Semantic Simulators can be built by deliberately bridging pixel-level control with language-level reasoning.
  • Environment selection for training can be guided by the quantitative trends rather than qualitative judgment alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Separate training regimes for semantic and physical skills may become standard before integration into single agents.
  • Applying the same automated taxonomy method to other AI domains could expose parallel splits in research focus.
  • New environments released after the study can be classified under the taxonomy to test whether the bifurcation continues.

Load-bearing premise

That programmatically analyzing a large corpus of publications produces an unbiased map of RL environment trends without meaningful selection or interpretation bias in the pipeline.

What would settle it

A manual audit of several hundred recent papers that places the majority outside both the Semantic Prior and Domain-Specific Generalization categories or shows no statistical evidence of bifurcation.

Figures

Figures reproduced from arXiv: 2603.23964 by Alexey Gorbatovski, Lijing Luo, Sergey Kovalchuk, Xiaodan Liang, Yiben Luo.

Figure 1
Figure 1. Figure 1: The Evolution of Reinforcement Learning Environments: A chronological visual timeline illustrating [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Evolutionary Tree of Reinforcement Learning Environments: The Ascent of Cognitive Abstraction. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The taxonomy of multi-dimensional spectrum of reinforcement learning task types [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: WebArena: The Frontier of Vision-Language￾Action (VLA) Fusion. Representing the modern multi￾modal landscape, this environment requires agents to ground open-ended natural language instructions into dense visual interfaces. It forces a complex synthesis of image-based visual reasoning, structural analysis of HTML DOM trees, and auto-regressive text generation to execute executable actions. Source: webarena… view at source ↗
Figure 5
Figure 5. Figure 5: The multi-dimensional landscape of requisite agent capabilities. The diagram illustrates the diverse skill set [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A taxonomic overview of diverse reinforcement learning application domains. The progression illustrates [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The CARLA Autonomous Driving Simulator. Illustrating the pinnacle of the Autonomous Systems & Nav￾igation domain, CARLA forces agents to process multi￾modal, heterogeneous sensor streams (including RGB-D, LiDAR point clouds, and GPS). Operating under severe partial observability (POMDP) and stochastic weather con￾ditions, agents must execute high-frequency continuous control while adhering to strict safety… view at source ↗
Figure 10
Figure 10. Figure 10 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 9
Figure 9. Figure 9: AlphaStar Mastering StarCraft II. A landmark achievement in Real-Time Strategy, AlphaStar masters the immense complexity of StarCraft II. By integrating deep neural networks with a multi-agent reinforcement learning league, it overcomes imperfect information and a massive combinatorial action space (≈ 1026) to execute sophisticated macro-strategies and micro-tactics. Source: DeepMind Blog 3. Games & Compet… view at source ↗
Figure 11
Figure 11. Figure 11: Insilico Medicine’s Fully Automated Robotics Laboratory. Representing the frontier of AI￾driven drug discovery, this system automates complex wet-lab processes. It integrates reinforcement learning to autonomously optimize experimental strategies and pro￾cess control using massive biological datasets, ensuring strict reproducibility. Source: EurekAlert 6. Scientific & Real-world Applications The ultimate … view at source ↗
Figure 12
Figure 12. Figure 12: A retrospective analysis of the temporal distribution of RL environments by primary application domain. The [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The paradigm shift in modality distribution. The figure illustrates the breakdown of single-modality [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Temporal Evolution of Capability Requirements in LLM Environments. This alluvial plot tracks the shifting cognitive demands of environments utilizing LLMs as active agents, based on their inception dates. The data illustrates a rapid escalation from foundational language understanding in early sandboxes to higher-order faculties— such as deduction, long-horizon planning, and tool utilization—in the post-2… view at source ↗
Figure 15
Figure 15. Figure 15: The evolution of primary application domains in a broader field. The trajectory illustrates a shift from [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Evolutionary Trajectory of Agent Capabilities Across Four RL Eras. This citation-weighted alluvial plot illustrates the longitudinal shift in cognitive and physical requirements of RL environments from 2013 to the present. The temporal axis spans four major algorithmic epochs: (1) Classic DRL & Physics, (2) Scalable Games & MARL, (3) Offline & Pre-training, and (4) LLM Agents & Reasoning. The overarching … view at source ↗
Figure 17
Figure 17. Figure 17: Cognitive Fingerprints of LLMs engaged RL Application Subdomains. This row-normalized clustered heatmap illustrates the proportional distribution of required agent capabilities across various domains. Rows (application subdomains) are ordered via hierarchical clustering to reveal structural similarities in cognitive demands. Color intensity denotes the percentage of environments within a subdomain that re… view at source ↗
Figure 18
Figure 18. Figure 18: Evolutionary Trajectory of Agent Capabilities. This citation-weighted alluvial plot illustrates the shifting cognitive and physical requirements of RL environments across four major algorithmic epochs: Classic DRL & Physics, Scalable Games & MARL, Offline & Pre-training, and LLM Agents & Reasoning. The overarching macro-trend demonstrates a definitive migration from low-level continuous control in physica… view at source ↗
read the original abstract

The remarkable progress of reinforcement learning (RL) is intrinsically tied to the environments used to train and evaluate artificial agents. Moving beyond traditional qualitative reviews, this work presents a large-scale, data-driven empirical investigation into the evolution of RL environments. By programmatically processing a massive corpus of academic literature and rigorously distilling over 2,000 core publications, we propose a quantitative methodology to map the transition from isolated physical simulations to generalist, language-driven foundation agents. Implementing a novel, multi-dimensional taxonomy, we systematically analyze benchmarks against diverse application domains and requisite cognitive capabilities. Our automated semantic and statistical analysis reveals a profound, data-verified paradigm shift: the bifurcation of the field into a "Semantic Prior" ecosystem dominated by Large Language Models (LLMs) and a "Domain-Specific Generalization" ecosystem. Furthermore, we characterize the "cognitive fingerprints" of these distinct domains to uncover the underlying mechanisms of cross-task synergy, multi-domain interference, and zero-shot generalization. Ultimately, this study offers a rigorous, quantitative roadmap for designing the next generation of Embodied Semantic Simulators, bridging the gap between continuous physical control and high-level logical reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a large-scale empirical study analyzing over 2,000 publications on reinforcement learning environments. It proposes a multi-dimensional taxonomy to map the evolution from physical simulations to language-driven agents and claims a paradigm shift bifurcating the field into a 'Semantic Prior' ecosystem dominated by LLMs and a 'Domain-Specific Generalization' ecosystem, while characterizing 'cognitive fingerprints' for cross-task synergy and zero-shot generalization.

Significance. If the automated analysis is shown to be robust, the quantitative taxonomy could provide a valuable roadmap for designing next-generation embodied semantic simulators that bridge continuous physical control and high-level reasoning, moving the field beyond purely qualitative reviews.

major comments (2)
  1. [Methodology] The automated semantic and statistical analysis section provides no description of the embedding model, dimensionality reduction technique, clustering algorithm, or any other implementation details used to derive the multi-dimensional taxonomy and the bifurcation into Semantic Prior and Domain-Specific Generalization ecosystems. This prevents assessment of whether the reported paradigm shift is an artifact of the pipeline choices.
  2. [Results] No human-annotated ground-truth dataset, precision/recall metrics, or external benchmark comparisons are reported to validate the automated labels or the claimed cross-task synergy and zero-shot generalization patterns. The central claim of a 'data-verified paradigm shift' therefore rests on an unvalidated process.
minor comments (1)
  1. [Abstract] The abstract introduces 'cognitive fingerprints' without a concise operational definition, which should be clarified early to aid reader comprehension.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our empirical analysis of reinforcement learning environments. The comments highlight important areas for improving methodological transparency and validation, which we address below.

read point-by-point responses
  1. Referee: [Methodology] The automated semantic and statistical analysis section provides no description of the embedding model, dimensionality reduction technique, clustering algorithm, or any other implementation details used to derive the multi-dimensional taxonomy and the bifurcation into Semantic Prior and Domain-Specific Generalization ecosystems. This prevents assessment of whether the reported paradigm shift is an artifact of the pipeline choices.

    Authors: We agree that the original manuscript omits key implementation details of the automated pipeline. In the revised version, we will insert a dedicated 'Implementation Details' subsection describing the embedding model (all-MiniLM-L6-v2 via sentence-transformers), dimensionality reduction (UMAP with n_neighbors=15 and min_dist=0.1), clustering (HDBSCAN with min_cluster_size=5), and the statistical procedures used to detect the bifurcation and cognitive fingerprints. These additions will support reproducibility and allow readers to evaluate whether the observed paradigm shift depends on specific pipeline choices. revision: yes

  2. Referee: [Results] No human-annotated ground-truth dataset, precision/recall metrics, or external benchmark comparisons are reported to validate the automated labels or the claimed cross-task synergy and zero-shot generalization patterns. The central claim of a 'data-verified paradigm shift' therefore rests on an unvalidated process.

    Authors: We acknowledge the absence of quantitative validation in the submitted manuscript. The revised version will add a validation subsection that reports a human annotation study on a random subset of 150 papers, yielding precision/recall figures and inter-annotator agreement against the automated labels. We will also include direct comparisons with prior qualitative taxonomies from the RL literature. While exhaustive ground-truth labeling of the full corpus remains resource-intensive, these targeted validations will provide concrete support for the bifurcation, cross-task synergy, and zero-shot patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical taxonomy derived from corpus analysis

full rationale

The paper conducts a data-driven literature review by programmatically processing >2000 publications to propose a novel multi-dimensional taxonomy and then applies automated semantic/statistical analysis to identify a bifurcation into Semantic Prior and Domain-Specific Generalization ecosystems. This bifurcation is reported as an output of the analysis on the processed corpus rather than a definitional premise or fitted parameter renamed as a prediction. No equations, self-citations, uniqueness theorems, or ansatzes are invoked in the provided sections that would reduce the central claims to their own inputs by construction. The methodology remains self-contained as an empirical mapping exercise.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

Central claims rest on the representativeness of the selected literature corpus and the validity of automated semantic extraction; new categories are introduced without external falsifiable tests.

axioms (1)
  • domain assumption The corpus of over 2,000 core publications accurately represents the evolution of RL environments
    All quantitative findings and the paradigm-shift claim are derived from processing this corpus.
invented entities (3)
  • Semantic Prior ecosystem no independent evidence
    purpose: Label environments dominated by LLMs for semantic understanding
    Introduced as one branch of the bifurcation without independent validation data.
  • Domain-Specific Generalization ecosystem no independent evidence
    purpose: Label environments focused on task-specific generalization
    Introduced as the contrasting branch of the paradigm shift.
  • cognitive fingerprints no independent evidence
    purpose: Describe domain-specific mechanisms of synergy and interference
    New term coined to characterize analysis outputs.

pith-pipeline@v0.9.0 · 5514 in / 1395 out tokens · 57368 ms · 2026-05-15T01:16:04.541139+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

221 extracted references · 47 canonical work pages · 29 internal anchors

  1. [1]

    MIT press, 2018

    Richard S Sutton and Andrew G Barto.Reinforce- ment learning: An introduction. MIT press, 2018

  2. [2]

    Mastering the game of go with deep neural networks and tree search.Nature, 529(7587):484– 489, 2016

    David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Pan- neershelvam, Marc Lanctot, Sander Dieleman, Do- minik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Ko- ray Kavukcuoglu, Thore Graepel, and Demis Has- sabis. Masterin...

  3. [3]

    Grandmaster level in star- craft ii using multi-agent reinforcement learning

    Oriol Vinyals, Igor Babuschkin, Wojciech M Czar- necki, Michaël Mathieu, Andrew Dudzik, Juny- oung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Hor- gan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John P Agapiou, Max Jaderberg, Alexander S Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin Dalib...

  4. [4]

    End-to-end training of deep visuomo- tor policies.Journal of Machine Learning Research, 17(39):1–40, 2016

    Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomo- tor policies.Journal of Machine Learning Research, 17(39):1–40, 2016

  5. [5]

    Rusu, Joel Veness, Marc G

    V olodymyr Mnih, Koray Kavukcuoglu, David Sil- ver, Andrei A. Rusu, Joel Veness, Marc G. Belle- mare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level con- trol through deep reinforc...

  6. [6]

    Deep reinforcement learning that matters

    Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. InPro- ceedings of the AAAI Conference on Artificial Intel- ligence, volume 32, 2018

  7. [7]

    The arcade learning environment: An evaluation platform for general agents.Jour- nal of Artificial Intelligence Research, 47:253–279, 2013

    Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents.Jour- nal of Artificial Intelligence Research, 47:253–279, 2013

  8. [8]

    Mu- JoCo: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mu- JoCo: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelli- gent Robots and Systems, pages 5026–5033. IEEE, 2012

  9. [9]

    Unity: A general platform for intelligent agents.arXiv:1809.02627, 2018

    Arthur Juliani, Vincent-Pierre Berges, Ervin Teng, Andrew Cohen, Jonathan Harper, Chris Elion, Christopher Goy, Yuan Gao, Hunter Henry, Mar- wan Mattar, and Danny Lange. Unity: A general platform for intelligent agents.arXiv:1809.02627, 2018

  10. [10]

    Language models are few-shot learners.Advances in Neural Information Process- ing Systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. Language models are few-shot learners.Advances in Neural Information Process- ing Systems, 33:1877–1901, 2020

  11. [11]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report.arXiv:2303.08774, 2023

  12. [12]

    Neuronlike adaptive elements that can solve difficult learning control problems.IEEE Transactions on Systems, Man, and Cybernetics, (5):834–846, 1983

    Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that can solve difficult learning control problems.IEEE Transactions on Systems, Man, and Cybernetics, (5):834–846, 1983

  13. [13]

    Domain randomization for transferring deep neural networks from simulation to the real world

    Josh Tobin, Rachel Fong, Alex Ray, Jonas Schnei- der, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017

  14. [14]

    Leveraging procedural generation 34 An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments to benchmark reinforcement learning

    Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation 34 An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments to benchmark reinforcement learning. InInter- national Conference on Machine Learning, pages 2048–2056. PMLR, 2020

  15. [15]

    A markovian decision process

    Richard Bellman. A markovian decision process. Journal of Mathematics and Mechanics, pages 679– 684, 1957

  16. [16]

    Textworld: A learning environment for text-based games

    Marc-Alexandre Côté, Akos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew Hausknecht, Layla cross El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler. Textworld: A learning environment for text-based games. InWorkshop on Computer Games, pages 41–75, 2018

  17. [17]

    Re- act: Synergizing reasoning and acting in language models, 2023

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. Re- act: Synergizing reasoning and acting in language models, 2023

  18. [18]

    Transfer learning for reinforcement learning domains: A survey.Journal of Machine Learning Research, 10(Jul):1633–1685, 2009

    Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey.Journal of Machine Learning Research, 10(Jul):1633–1685, 2009

  19. [19]

    Model-agnostic meta-learning for fast adaptation of deep networks

    Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. InInternational Conference on Machine Learning, pages 1126–1135, 2017

  20. [20]

    The episodic buffer: a new com- ponent of working memory?Trends in Cognitive Sciences, 4(11):417–423, 2000

    Alan D Baddeley. The episodic buffer: a new com- ponent of working memory?Trends in Cognitive Sciences, 4(11):417–423, 2000

  21. [21]

    Darwin’s mistake: Explaining the discon- tinuity between human and nonhuman minds.Be- havioral and Brain Sciences, 31(2):109–130, 2008

    Derek C Penn, Keith J Holyoak, and Daniel J Povinelli. Darwin’s mistake: Explaining the discon- tinuity between human and nonhuman minds.Be- havioral and Brain Sciences, 31(2):109–130, 2008

  22. [22]

    How to grow a mind: Statistics, structure, and abstraction.Science, 331(6022):1279–1285, 2011

    Joshua B Tenenbaum, Charles Kemp, Thomas L Griffiths, and Noah D Goodman. How to grow a mind: Statistics, structure, and abstraction.Science, 331(6022):1279–1285, 2011

  23. [23]

    Does the chim- panzee have a theory of mind?Behavioral and Brain Sciences, 1(4):515–526, 1978

    David Premack and Guy Woodruff. Does the chim- panzee have a theory of mind?Behavioral and Brain Sciences, 1(4):515–526, 1978

  24. [24]

    Prospective memory: Theoretical considerations and opera- tional definitions.The Cognitive Neuroscience of Memory, pages 112–128, 2007

    Sam J Gilbert and Paul W Burgess. Prospective memory: Theoretical considerations and opera- tional definitions.The Cognitive Neuroscience of Memory, pages 112–128, 2007

  25. [25]

    Abstract representations of numbers in the animal and human brain.Trends in Neurosciences, 21(8):355–361, 1998

    Stanislas Dehaene, Ghislaine Dehaene-Lambertz, and Laurent Cohen. Abstract representations of numbers in the animal and human brain.Trends in Neurosciences, 21(8):355–361, 1998

  26. [26]

    The sensorimotor foundations of higher cognition

    Daniel M Wolpert, Zoubin Ghahramani, and J Ran- dall Flanagan. The sensorimotor foundations of higher cognition. InCommon Minds: Themes from the Philosophy of Philip Pettit. Oxford University Press, 2003

  27. [27]

    Three models for the description of language.IRE Transactions on Information Theory, 2(3):113–124, 1956

    Noam Chomsky. Three models for the description of language.IRE Transactions on Information Theory, 2(3):113–124, 1956

  28. [28]

    Planning and acting in partially observable stochastic domains.Artificial Intelli- gence, 101(1-2):99–134, 1998

    Leslie Pack Kaelbling, Michael L Littman, and An- thony R Cassandra. Planning and acting in partially observable stochastic domains.Artificial Intelli- gence, 101(1-2):99–134, 1998

  29. [29]

    Superhuman AI for multiplayer poker.Science, 365(6456):885– 890, 2019

    Noam Brown and Tuomas Sandholm. Superhuman AI for multiplayer poker.Science, 365(6456):885– 890, 2019

  30. [30]

    Continuous con- trol with deep reinforcement learning

    Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous con- trol with deep reinforcement learning. InInter- national Conference on Learning Representations (ICLR), 2016

  31. [31]

    Deep recur- rent q-learning for partially observable mdps

    Matthew Hausknecht and Peter Stone. Deep recur- rent q-learning for partially observable mdps. In 2015 AAAI Fall Symposium Series, 2015

  32. [32]

    Policy invariance under reward transforma- tions: Theory and application to reward shaping

    Andrew Y Ng, Daishi Harada, and Stuart Rus- sell. Policy invariance under reward transforma- tions: Theory and application to reward shaping. InInternational Conference on Machine Learning (ICML), pages 278–287, 1999

  33. [33]

    Exploration by random network dis- tillation

    Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network dis- tillation. InInternational Conference on Learning Representations (ICLR), 2019

  34. [34]

    A survey of multi- objective sequential decision-making.Journal of Artificial Intelligence Research, 48:67–113, 2013

    Diederik M Roijers, Peter Vamplew, Shimon White- son, and Richard Dazeley. A survey of multi- objective sequential decision-making.Journal of Artificial Intelligence Research, 48:67–113, 2013

  35. [35]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kel- ton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

  36. [36]

    OpenAI Gym

    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wo- jciech Zaremba. Openai gym.arXiv:1606.01540, 2016

  37. [37]

    SWE-bench: Can language models resolve real-world GitHub issues? InThe Twelfth International Conference on Learning Rep- resentations, 2024

    Carlos E Jimenez, John Yang Murphy, Alexander Shirinov, Kweon Chen, Austin McMillan, Guil- laume Lample, et al. SWE-bench: Can language models resolve real-world GitHub issues? InThe Twelfth International Conference on Learning Rep- resentations, 2024

  38. [38]

    We- bArena: A realistic web environment for building autonomous agents

    Shuyan Zhou, Frank F Hou, Yikang Cheng, Keisuke Hong, Graham Neubig, and Pengcheng Yin. We- bArena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations, 2024

  39. [39]

    A com- prehensive survey on safe reinforcement learning

    Javier García and Fernando Fernández. A com- prehensive survey on safe reinforcement learning. 35 An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments Journal of Machine Learning Research, 16(1):1437– 1480, 2015

  40. [40]

    On the Measure of Intelligence

    François Chollet. On the measure of intelligence. arXiv:1911.01547, 2019

  41. [41]

    Processbench: Identifying process errors in mathematical reasoning, 2025

    Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Processbench: Identifying process errors in mathematical reasoning, 2025

  42. [42]

    Reward is enough.Artificial Intelligence, 299:103535, 2021

    David Silver, Satinder Singh, Doina Precup, and Richard S Sutton. Reward is enough.Artificial Intelligence, 299:103535, 2021

  43. [43]

    Vizdoom: A doom-based ai research platform for visual re- inforcement learning

    Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja´skowski. Vizdoom: A doom-based ai research platform for visual re- inforcement learning. In2016 IEEE Conference on Computational Intelligence and Games (CIG), pages 1–8, 2016

  44. [44]

    Charles Beattie, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, Julian Schrittwieser, Keith Anderson, Sarah York, Max Cant, Adam Cain, Adrian Bolton, Stephen Gaffney, Helen King, Demis Hassabis, Shane Legg, and Stig Petersen. Deepmind lab. arXiv:1612.03801, 2016

  45. [45]

    Minimalistic gridworld environment for OpenAI Gym

    Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic gridworld environment for OpenAI Gym. GitHub repository, 2018

  46. [46]

    Habitat: A platform for embodied AI research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Ma- lik, et al. Habitat: A platform for embodied AI research. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 9339– 9347, 2019

  47. [47]

    Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning

    Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pages 1094–1100, 2020

  48. [48]

    ALFWorld: Aligning text and embod- ied environments for interactive learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embod- ied environments for interactive learning. InInter- national Conference on Learning Representations, 2021

  49. [49]

    Brax–a differentiable physics engine for large scale rigid body simulation

    C Daniel Freeman, Erik Frey, Anton Raichuk, Ser- tan Girgin, Igor Mordatch, and Olivier Bachem. Brax–a differentiable physics engine for large scale rigid body simulation. InThirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021

  50. [50]

    Minedojo: Building open-ended embodied agents with internet-scale knowledge, 2022

    Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Man- dlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge, 2022

  51. [51]

    WebShop: Towards scalable real- world web interaction with grounded language agents

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards scalable real- world web interaction with grounded language agents. InAdvances in Neural Information Pro- cessing Systems, volume 35, pages 20744–20757, 2022

  52. [52]

    Training software engineering agents and verifiers with swe-gym, 2025

    Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym, 2025

  53. [53]

    Sparks of artificial general intelli- gence: Early experiments with gpt-4, 2023

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelli- gence: Early experiments with gpt-4, 2023

  54. [54]

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xi- aochuan Li, Ruiyuan Zhao, Ruisheng Cao, et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. arXiv preprint arXiv:2404.07972, 2024

  55. [55]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android- world: A dynamic benchmarking environment for autonomous agents.arXiv:2405.14573, 2024

  56. [56]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander M ˛ adry. Mle-bench: Evaluating machine learning agents on machine learning engineering.arXiv:2410.07095, 2024

  57. [57]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  58. [58]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InProceed- ings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

  59. [59]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  60. [60]

    The lean 4 theorem prover and programming language

    Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. In Automated Deduction–CADE 28, pages 625–635, 2021. 36 An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments

  61. [61]

    miniF2F: A cross-system benchmark for for- mal olympiad-level mathematics

    Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. miniF2F: A cross-system benchmark for for- mal olympiad-level mathematics. InInternational Conference on Learning Representations, 2022

  62. [62]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023

  63. [63]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Live- codebench: Holistic and contamination free eval- uation of large language models for code, 2024. arXiv:2403.07974

  64. [64]

    Olympiad- bench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scien- tific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiad- bench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scien- tific problems. InProceedings of the 62nd Annual Meeting of the Association fo...

  65. [65]

    Tree of thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, volume 36, 2023

  66. [66]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning. arXiv:2501.12948, 2025

  67. [67]

    From local to global: A graph rag approach to query-focused summarization, 2025

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization, 2025

  68. [68]

    Can llm al- ready serve as a database interface? a big bench for large-scale database grounded text-to-sqls

    Jinyang Li, Binyuan Hui, Chengwei Qu, Binhua Li, Ruiying Geng, Bowen Li, Bailin Wang, Bowen Qin, Ruiyao Dong, Chenhao Zhang, et al. Can llm al- ready serve as a database interface? a big bench for large-scale database grounded text-to-sqls. InAd- vances in Neural Information Processing Systems, volume 36, 2023

  69. [69]

    Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

    Yujia Li, David Choi, Junyoung Chung, Nate Kush- man, Julian Schrittwieser, Rémi Leblond, Tom Ec- cles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Mas- son d’Autume, Igor Babuschkin, Xinyun Chen, Po- Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson...

  70. [70]

    Visualwebarena: Evaluating multi- modal agents on realistic visual web tasks, 2024

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multi- modal agents on realistic visual web tasks, 2024

  71. [71]

    Solving sokoban using hierarchi- cal reinforcement learning with landmarks, 2025

    Sergey Pastukhov. Solving sokoban using hierarchi- cal reinforcement learning with landmarks, 2025

  72. [72]

    Reasoning gym: Reason- ing environments for reinforcement learning with verifiable rewards, 2025

    Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kad- dour, and Andreas Köpf. Reasoning gym: Reason- ing environments for reinforcement learning with verifiable rewards, 2025

  73. [73]

    Athena scientific, 2012

    Dimitri P Bertsekas.Dynamic programming and optimal control: Vol I. Athena scientific, 2012

  74. [74]

    Foerster, Yannis M

    Jakob N. Foerster, Yannis M. Assael, Nando de Fre- itas, and Shimon Whiteson. Learning to communi- cate with deep multi-agent reinforcement learning, 2016

  75. [75]

    A general reinforcement learn- ing algorithm that masters chess, shogi, and go through self-play.Science, 362(6419):1140–1144, 2018

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learn- ing algorithm that masters chess, shogi, and go through self-play.Science, 362(6419):1140–1144, 2018

  76. [76]

    Multi-agent actor-critic for mixed cooperative-competitive environments, 2020

    Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments, 2020

  77. [77]

    Magent: A many- agent reinforcement learning platform for artificial collective intelligence, 2017

    Lianmin Zheng, Jiacheng Yang, Han Cai, Weinan Zhang, Jun Wang, and Yong Yu. Magent: A many- agent reinforcement learning platform for artificial collective intelligence, 2017

  78. [78]

    Foerster, Sarath Chandar, Neil Burch, Marc Lanctot, H

    Nolan Bard, Jakob N. Foerster, Sarath Chandar, Neil Burch, Marc Lanctot, H. Francis Song, Emilio Parisotto, Vincent Dumoulin, Subhodeep Moitra, Edward Hughes, Iain Dunning, Shibl Mourad, Hugo Larochelle, Marc G. Bellemare, and Michael Bowl- ing. The hanabi challenge: A new frontier for ai research.Artificial Intelligence, 280:103216, 2020

  79. [79]

    Application of self-play rein- forcement learning to a four-player game of imper- fect information, 2018

    Henry Charlesworth. Application of self-play rein- forcement learning to a four-player game of imper- fect information, 2018

  80. [80]

    Google re- search football: A novel reinforcement learning en- vironment

    Karol Kurach, Anton Raichuk, Piotr Sta ´nczyk, Michał Zaj ˛ ac, Olivier Bachem, Lasse Espeholt, Car- los Riquelme, Damien Vincent, Marcin Michalski, Olivier Bousquet, and Sylvain Gelly. Google re- search football: A novel reinforcement learning en- vironment. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4501– 4510, 2020

Showing first 80 references.