pith. sign in

arxiv: 2607.00527 · v1 · pith:4ELUVY7Bnew · submitted 2026-07-01 · 💻 cs.AI

AI Native Games: A Survey and Roadmap

Pith reviewed 2026-07-02 12:57 UTC · model grok-4.3

classification 💻 cs.AI
keywords AI-native gamesgenerative AIgame designcore gameplay looptaxonomynarrative gamesprocedural generationmechanical invariants
0
0 comments X

The pith

AI-native games are those where runtime generative AI is essential to the core play loop, as removing it would collapse or alter the central form of play.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines AI-native games by a counterfactual test: runtime generative AI must be constitutive of the core loop, such that its removal or trivial replacement would make the central form of play collapse or change fundamentally. This criterion is used to screen candidates and analyze 53 publicly available games and prototypes. A dual-axis taxonomy is introduced to classify them by game type on one axis and dominant AI mechanic on the other. The analysis shows concentration in language-forward designs, identifies the problem of turning semantic openness into stable gameplay, and provides a roadmap for future work.

Core claim

AI-native games are defined by whether runtime generative AI is constitutive of the core loop, separated from AI-augmented games and other forms by the test that removing or trivially replacing the AI component would collapse or fundamentally change the central form of play. Screening yields 53 examples that cluster around language-forward designs such as narrative adventure, epistemic interaction, and generative narrative, while other categories remain less represented. The central design problem is organizing semantic openness into stable gameplay through mechanical invariants of goals, rules, state, feedback, pacing, and player agency.

What carries the argument

The counterfactual criterion that determines whether runtime generative AI is constitutive of the core loop by checking if its removal or trivial replacement would collapse or fundamentally change the central form of play.

If this is right

  • The current corpus is concentrated in language-forward designs such as narrative adventure and generative narrative.
  • Categories such as semantic adjudication, multi-agent simulation, generative construction, and relationship play are underrepresented.
  • Mechanical invariants of goals, rules, state, feedback, pacing, and player agency are required to make open-ended AI outputs interpretable and consequential.
  • Development priorities include controllable generation, AI-as-mechanic design, multimodal and multi-agent systems, inference economics, evaluation, safety, and regulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The definition could guide developers on when generative AI should be deeply integrated rather than added as a feature.
  • The focus on semantic openness may apply to interactive systems outside games that rely on runtime generation.
  • The roadmap's emphasis on inference economics points to practical barriers for widespread adoption of these games.
  • Safety and regulation issues could affect public deployment of games that depend on open-ended AI outputs.

Load-bearing premise

The counterfactual test of removing or trivially replacing the AI can be applied consistently across games without subjective judgment or selection bias.

What would settle it

A set of games where independent analysts reach conflicting conclusions about whether removing the generative AI changes the core play loop would show the test cannot be applied reliably.

Figures

Figures reproduced from arXiv: 2607.00527 by Clark Verbrugge, Fandi Meng, Jian Zhao, Kaijie Xu, Simon Lucas, Zhiyue Xu.

Figure 1
Figure 1. Figure 1: Representative roadmap of AI-native games and adjacent artifacts from early interactive drama to runtime generative AI systems. [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of the dated corpus (n=53) by game type (G). [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of the dated corpus (n=53) by dominant AI mechanic [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross matrix of game type (G, columns) and dominant AI mechanic (N, rows) over the 53 dated artifacts. Cell values are game counts; [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Generative AI now enables games to produce dialogue, quests, characters, images, and worlds at runtime. Yet generation alone does not make a game AI-native, nor does it guarantee playability. This paper defines AI-native games by whether runtime generative AI is constitutive of the core loop: if the AI component were removed or trivially replaced, the central form of play would collapse or become fundamentally different. This counterfactual criterion separates AI-native games from AI-augmented games, boundary artifacts, chatbots, tavern-style role-play, procedural content generation, and AI-assisted production. Using this definition, we screen candidate artifacts and analyze 53 publicly available AI-native games and prototypes. We introduce a dual-axis G/N taxonomy: the G-axis captures player-facing game type, while the N-axis captures the dominant AI mechanic that makes generative AI indispensable to play. The corpus is concentrated around language-forward designs, especially narrative adventure, epistemic interaction, and generative narrative, while categories such as semantic adjudication, multi-agent simulation, generative construction, and relationship/companion play remain less represented. We argue that the central design problem is organizing semantic openness into stable gameplay. AI-native design depends on mechanical invariants: goals, rules, state, feedback, pacing, and player agency that make open-ended AI outputs interpretable and consequential. We conclude with a roadmap for controllable generation, AI-as-mechanic design, multimodal and multi-agent systems, inference economics, evaluation, safety, and regulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper defines AI-native games via a counterfactual test: runtime generative AI is constitutive of the core loop if its removal or trivial replacement would collapse or fundamentally alter the central form of play. This separates AI-native games from AI-augmented ones and related categories. The authors apply the definition to screen and analyze a corpus of 53 publicly available games/prototypes, introduce a dual-axis G/N taxonomy (G-axis for player-facing game type, N-axis for dominant AI mechanic), observe concentration in language-forward designs (narrative adventure, epistemic interaction, generative narrative), and outline a roadmap addressing controllable generation, AI-as-mechanic design, multimodal/multi-agent systems, inference economics, evaluation, safety, and regulation.

Significance. If the definition and taxonomy can be applied reproducibly, the work supplies a needed conceptual boundary for an emerging subfield, grounds it in an empirical corpus of 53 artifacts, and identifies the core design challenge of turning semantic openness into stable, interpretable gameplay. The roadmap surfaces concrete open problems (e.g., mechanical invariants for agency and feedback) that could guide subsequent technical and design research.

major comments (2)
  1. [Definition and screening process] Definition and screening process (abstract; the section introducing the counterfactual criterion): the test for whether generative AI is 'constitutive of the core loop' is not accompanied by explicit, reproducible operational criteria for identifying the core loop, assessing 'trivial replacement,' or determining when play 'collapses or becomes fundamentally different.' This judgment is load-bearing for the classification of all 53 games and for the subsequent claim that the corpus is concentrated in particular G/N categories.
  2. [Corpus construction and taxonomy application] Corpus construction and taxonomy application (the section describing the 53-game analysis and G/N taxonomy): no details are provided on how boundary cases were resolved, whether multiple annotators were used, or what inter-rater agreement was obtained when assigning games to G-axis and N-axis categories. Without such information the reported concentration around language-forward designs cannot be evaluated for selection or assignment bias.
minor comments (2)
  1. [Abstract / screening description] The abstract states that the definition 'separates AI-native games from AI-augmented games, boundary artifacts, chatbots, tavern-style role-play, procedural content generation, and AI-assisted production,' but the manuscript does not include a dedicated table or appendix listing the screened-but-excluded candidates and the reasons for exclusion.
  2. [Taxonomy introduction] Notation for the G/N taxonomy is introduced without an accompanying figure that visually maps the two axes and populates them with example games from the corpus.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important opportunities to strengthen the reproducibility of our definition and analysis. We address each major comment below and commit to revisions that improve transparency without altering the core contributions.

read point-by-point responses
  1. Referee: [Definition and screening process] Definition and screening process (abstract; the section introducing the counterfactual criterion): the test for whether generative AI is 'constitutive of the core loop' is not accompanied by explicit, reproducible operational criteria for identifying the core loop, assessing 'trivial replacement,' or determining when play 'collapses or becomes fundamentally different.' This judgment is load-bearing for the classification of all 53 games and for the subsequent claim that the corpus is concentrated in particular G/N categories.

    Authors: We agree that the counterfactual criterion would benefit from explicit operational criteria to support reproducible application. The manuscript presents the definition conceptually, which is typical for introducing a new category in a survey, but the referee correctly identifies that this leaves room for subjective judgment in screening. In the revised manuscript we will expand the relevant section with operational guidelines: (1) core loop identification via primary player actions, goals, and win/lose conditions; (2) trivial replacement assessment by testing whether static or rule-based substitutes preserve equivalent play dynamics; (3) collapse determination via loss of essential features such as runtime semantic adaptation or branching. We will include worked examples from the corpus to illustrate application. revision: yes

  2. Referee: [Corpus construction and taxonomy application] Corpus construction and taxonomy application (the section describing the 53-game analysis and G/N taxonomy): no details are provided on how boundary cases were resolved, whether multiple annotators were used, or what inter-rater agreement was obtained when assigning games to G-axis and N-axis categories. Without such information the reported concentration around language-forward designs cannot be evaluated for selection or assignment bias.

    Authors: We acknowledge the lack of methodological detail on the annotation process. Screening and G/N assignment were conducted by the author team via iterative discussion and consensus, with boundary cases resolved by returning to the counterfactual test. No formal multi-annotator protocol or inter-rater statistics were used, as this is an author-driven survey rather than a crowdsourced annotation effort. In revision we will add a dedicated 'Screening and Taxonomy Methodology' subsection describing the process, boundary-case resolution examples, the complete classification table for all 53 artifacts, and an explicit discussion of potential biases. The full list of games and assignments will be released publicly to enable external verification. While we cannot retroactively compute inter-rater agreement, these additions will substantially improve evaluability of the concentration claims. revision: partial

Circularity Check

0 steps flagged

No circularity: purely definitional survey with no derivations, fits, or self-referential predictions

full rationale

The paper advances a counterfactual definition of AI-native games and applies it to classify a corpus of 53 examples. No equations, fitted parameters, or predictive claims appear; the work is classificatory and taxonomic. The definition is stated explicitly rather than derived from prior results, and the screening process is presented as an application of that definition without reduction to self-citation chains or constructed inputs. This matches the default expectation of no significant circularity for non-mathematical survey papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper that introduces a definition and taxonomy based on analysis of existing games. No free parameters, mathematical axioms, or invented entities are present.

pith-pipeline@v0.9.1-grok · 5801 in / 1025 out tokens · 16199 ms · 2026-07-02T12:57:36.464230+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

104 extracted references · 65 canonical work pages · 27 internal anchors

  1. [1]

    Shaker, J

    N. Shaker, J. Togelius, and M. J. Nelson,Procedural Content Generation in Games. Springer, 2016. [Online]. Available: https://link.springer.com/book/10.1007/978-3-319-42716-4

  2. [2]

    Large language models and games: A survey and roadmap,

    R. Gallotta, G. Todd, M. Zammit, S. Earle, A. Liapis, J. Togelius, and G. N. Yannakakis, “Large language models and games: A survey and roadmap,”IEEE Transactions on Games,

  3. [3]

    Available: https://ieeexplore.ieee.org/abstract/ document/10680313/

    [Online]. Available: https://ieeexplore.ieee.org/abstract/ document/10680313/

  4. [4]

    Gpt for games: A scoping review (2020–2023),

    D. Yang, E. Kleinman, and C. Harteveld, “Gpt for games: A scoping review (2020–2023),” in2024 IEEE Conference on Games. IEEE, 2024, pp. 1–8. [Online]. Available: https: //ieeexplore.ieee.org/abstract/document/10645548/

  5. [5]

    Procedural content generation in games: A survey with insights on emerging llm integration,

    M. F. Maleki and R. Zhao, “Procedural content generation in games: A survey with insights on emerging llm integration,” inProceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, vol. 20, no. 1, 2024, pp. 167–178

  6. [6]

    Language as reality: A co-creative storytelling game experience in 1001 nights using generative ai,

    Y. Sun, Z. Li, K. Fang, C. H. Lee, and A. Asadipour, “Language as reality: A co-creative storytelling game experience in 1001 nights using generative ai,” 2023

  7. [7]

    Mda: A formal approach to game design and game research,

    R. Hunicke, M. LeBlanc, and R. Zubek, “Mda: A formal approach to game design and game research,” inProceedings of the AAAI Workshop on Challenges in Game AI, 2004. [Online]. Available: https://users.cs.northwestern.edu/~hunicke/MDA.pdf

  8. [8]

    Gameflow: A model for evaluating player enjoyment in games,

    P . Sweetser and P . Wyeth, “Gameflow: A model for evaluating player enjoyment in games,”Computers in Entertainment, vol. 3, no. 3, pp. 3–3, 2005

  9. [9]

    Large language models and video games: A prelim- inary scoping review,

    P . Sweetser, “Large language models and video games: A prelim- inary scoping review,” inACM Conversational User Interfaces 2024. ACM, 2024, pp. 1–8

  10. [10]

    G. N. Yannakakis and J. Togelius,Artificial Intelligence and Games. Springer, 2018. [Online]. Available: https://link.springer.com/ book/10.1007/978-3-319-63519-4

  11. [11]

    On mixed-initiative content creation for video games,

    G. Lai, F. F. Leymarie, and W. Latham, “On mixed-initiative content creation for video games,”IEEE Transactions on Games, vol. 14, no. 4, pp. 543–557, 2022

  12. [12]

    Gpt for games: An updated scoping review (2020–2024),

    D. Yang, E. Kleinman, and C. Harteveld, “Gpt for games: An updated scoping review (2020–2024),”IEEE Transactions on Games,

  13. [13]

    Available: https://ieeexplore.ieee.org/abstract/ document/10974629/

    [Online]. Available: https://ieeexplore.ieee.org/abstract/ document/10974629/

  14. [14]

    Millington and J

    I. Millington and J. Funge,Artificial Intelligence for Games, 2nd ed. Morgan Kaufmann, 2009. [Online]. Available: https://www.routledge.com/Artificial-Intelligence-for-Games/ Millington-Funge/p/book/9780123747310

  15. [15]

    The case for dynamic difficulty adjustment in games,

    R. Hunicke, “The case for dynamic difficulty adjustment in games,” inProceedings of the International Conference on Advances in Computer Entertainment Technology, 2005, pp. 429–433

  16. [16]

    Interactive narrative: An intelligent systems approach,

    M. O. Riedl and V . Bulitko, “Interactive narrative: An intelligent systems approach,”AI Magazine, vol. 34, no. 1, pp. 67–77, 2013. [Online]. Available: https://ojs.aaai.org/aimagazine/index.php/ aimagazine/article/view/2449

  17. [17]

    Structuring content in the façade interactive drama architecture,

    M. Mateas and A. Stern, “Structuring content in the façade interactive drama architecture,” inProceedings of the First AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment. AAAI Press, 2005, pp. 93–98. [Online]. Available: https://cdn.aaai.org/AIIDE/2005/AIIDE05-016.pdf

  18. [18]

    A behavior language for story-based believable agents,

    ——, “A behavior language for story-based believable agents,” inIEEE Intelligent Systems, 2002. [Online]. Available: https: //doi.org/10.1109/MIS.2002.1024751 13

  19. [19]

    Search-based procedural content generation: A taxonomy and survey,

    J. Togelius, G. N. Yannakakis, K. O. Stanley, and C. Browne, “Search-based procedural content generation: A taxonomy and survey,” inIEEE Transactions on Computational Intelligence and AI in Games, vol. 3, no. 3, 2011, pp. 172–186

  20. [20]

    Procedural content generation for games: A survey,

    M. Hendrikx, S. Meijer, J. V . D. Velden, and A. Iosup, “Procedural content generation for games: A survey,”ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 9, no. 1, pp. 1–22, 2013

  21. [21]

    Procedural Content Generation via Machine Learning (PCGML)

    A. Summerville, S. Snodgrass, M. Guzdial, C. Holmgård, A. K. Hoover, A. Isaksen, A. Nealen, and J. Togelius, “Procedural content generation via machine learning (pcgml),” 2018. [Online]. Available: https://arxiv.org/abs/1702.00539

  22. [22]

    Experience-driven procedu- ral content generation,

    G. N. Yannakakis and J. Togelius, “Experience-driven procedu- ral content generation,”IEEE Transactions on Affective Computing, vol. 2, no. 3, pp. 147–161, 2011

  23. [23]

    The ai systems of left 4 dead,

    Valve, “The ai systems of left 4 dead,” 2009, valve developer materials on the AI Director and pacing systems. [Online]. Available: https://steamcdn-a.akamaihd.net/apps/valve/2009/ ai_systems_of_l4d_mike_booth.pdf

  24. [24]

    Expressive ai: A hybrid art and science practice,

    M. Mateas, “Expressive ai: A hybrid art and science practice,” Leonardo, vol. 34, no. 2, pp. 137–139, 2001. [Online]. Available: https://doi.org/10.1162/002409401750184690

  25. [25]

    Façade: An experiment in building a fully-realized interactive drama,

    M. Mateas and A. Stern, “Façade: An experiment in building a fully-realized interactive drama,” inGame Developers Conference, 2003

  26. [26]

    Narrative planning: Balancing plot and character,

    M. O. Riedl and R. M. Young, “Narrative planning: Balancing plot and character,”Journal of Artificial Intelligence Research, vol. 39, pp. 217–268, 2010. [Online]. Available: https://doi.org/10.1613/jair. 2989

  27. [27]

    Pcg-based game design patterns,

    M. Cook, M. Eladhari, A. Nealen, M. Treanor, E. Boxerman, A. Jaffe, P . Sottosanti, and S. Swink, “Pcg-based game design patterns,” 2016. [Online]. Available: https://arxiv.org/abs/1610. 03138

  28. [28]

    Level generation through large language models,

    G. Todd, S. Earle, M. U. Nasir, M. C. Green, and J. Togelius, “Level generation through large language models,” inProceedings of the International Conference on the Foundations of Digital Games,

  29. [29]

    Available: https://arxiv.org/abs/2302.05817

    [Online]. Available: https://arxiv.org/abs/2302.05817

  30. [30]

    Mariogpt: Open-ended text2level generation through large language models,

    S. Sudhakaran, M. González-Duque, C. Glanois, M. Freiberger, E. Najarro, and S. Risi, “Mariogpt: Open-ended text2level generation through large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2302.05981

  31. [31]

    Ai dungeon 2,

    Latitude, “Ai dungeon 2,” 2019. [Online]. Available: https: //github.com/latitudegames/AIDungeon

  32. [32]

    Infinite craft,

    N. Agarwal, “Infinite craft,” 2024. [Online]. Available: https: //neal.fun/infinite-craft/

  33. [33]

    Generative agents: Interactive simulacra of human behavior,

    J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P . Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” 2023. [Online]. Available: https://arxiv.org/abs/2304. 03442

  34. [34]

    Player-driven emergence in llm-driven game narrative,

    X. Peng, J. Quaye, S. Rao, W. Xu, P . Botchway, C. Brockett, N. Jojic, G. DesGarennes, K. Lobb, M. Xu, J. Leandro, C. Jin, and B. Dolan, “Player-driven emergence in llm-driven game narrative,” 2024. [Online]. Available: https://arxiv.org/abs/2404.17027

  35. [35]

    Hacc-man: An arcade game for jailbreaking llms,

    M. Valentim, J. Falk, and N. Inie, “Hacc-man: An arcade game for jailbreaking llms,” inDesigning Interactive Systems Conference, ser. DIS ’24. ACM, 2024, p. 338–341. [Online]. Available: http://dx.doi.org/10.1145/3656156.3665432

  36. [36]

    Exploring presence in interactions with llm-driven npcs: A comparative study of speech recognition and dialogue options,

    F. R. Christiansen, L. N. Hollensberg, N. B. Jensen, K. Julsgaard, K. N. Jespersen, and I. Nikolov, “Exploring presence in interactions with llm-driven npcs: A comparative study of speech recognition and dialogue options,” inProceedings of the 30th ACM Symposium on Virtual Reality Software and Technology, ser. VRST ’24. New York, NY, USA: Association for ...

  37. [37]

    Uncover the smoking gun,

    ReLU Games, “Uncover the smoking gun,” 2024. [On- line]. Available: https://store.steampowered.com/app/2492290/ Uncover_the_Smoking_Gun/

  38. [38]

    friendsfables,

    Side Quest Labs, “friendsfables,” 2023. [Online]. Available: https://fables.gg/

  39. [39]

    gandalf,

    Lakera, “gandalf,” 2023. [Online]. Available: https://gandalf. lakera.ai/baseline

  40. [40]

    Historical simulator:chongzhen,

    Qinggan Workshop, “Historical simulator:chongzhen,” 2026. [On- line]. Available: https://store.steampowered.com/app/4304230/

  41. [41]

    aivilization,

    HKUST, “aivilization,” 2025. [Online]. Available: https:// aivilization.ai

  42. [42]

    artimpostor,

    Pocketpair, “artimpostor,” 2022. [Online]. Available: https: //store.steampowered.com/app/2154230/

  43. [43]

    More than words,

    Soul Shell, “More than words,” 2023. [Online]. Available: https: //store.steampowered.com/app/2285280/More_than_words/

  44. [44]

    devnullstower,

    aieuo, “devnullstower,” 2026. [Online]. Available: https://store. steampowered.com/app/4350940/Dev_Nulls_Tower/

  45. [45]

    Why video game genres fail: A classificatory analysis,

    R. Clarke, J. Lee, and N. Clark, “Why video game genres fail: A classificatory analysis,”Games and Culture, vol. 12, 07 2015

  46. [46]

    Using thematic analysis in psychology,

    V . Braun and V . Clarke, “Using thematic analysis in psychology,” Qualitative Research in Psychology, vol. 3, no. 2, pp. 77–101, 2006. [Online]. Available: https://doi.org/10.1191/1478088706qp063oa

  47. [47]

    Vaudeville,

    Bumblebee Studios, “Vaudeville,” 2023. [Online]. Available: https://store.steampowered.com/app/2240920/Vaudeville/

  48. [48]

    [Online]

    Proxima, “suckup,” 2023. [Online]. Available: https://store. steampowered.com/app/2726370/Suck_Up/

  49. [49]

    Hidden door,

    Hidden Door, “Hidden door,” 2025. [Online]. Available: https://www.hiddendoor.co/

  50. [50]

    High-resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P . Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,”

  51. [51]

    High-Resolution Image Synthesis with Latent Diffusion Models

    [Online]. Available: https://arxiv.org/abs/2112.10752

  52. [52]

    Robust Speech Recognition via Large-Scale Weak Supervision

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large- scale weak supervision,” 2022. [Online]. Available: https: //arxiv.org/abs/2212.04356

  53. [53]

    whispersstar,

    Anuttacon, “whispersstar,” 2025. [Online]. Avail- able: https://store.steampowered.com/app/3730100/Whispers_ from_the_Star/

  54. [54]

    Self-refine: Iterative refinement with self-feedback,

    A. Madaan, N. Tandon, P . Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Gupta, B. P . Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P . Clark, “Self-refine: Iterative refinement with self-feedback,”

  55. [55]

    Self-Refine: Iterative Refinement with Self-Feedback

    [Online]. Available: https://arxiv.org/abs/2303.17651

  56. [56]

    Efficient Guided Generation for Large Language Models

    B. T. Willard and R. Louf, “Efficient guided generation for large language models,” 2023. [Online]. Available: https: //arxiv.org/abs/2307.09702

  57. [57]

    Grammar- constrained decoding for structured nlp tasks without finetuning,

    S. Geng, M. Josifoski, M. Peyrard, and R. West, “Grammar- constrained decoding for structured nlp tasks without finetuning,”

  58. [58]

    Available: https://arxiv.org/abs/2305.13971

    [Online]. Available: https://arxiv.org/abs/2305.13971

  59. [59]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” 2023. [Online]. Available: https://arxiv.org/abs/2302.04761

  60. [60]

    React: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” 2023. [Online]. Available: https://arxiv.org/abs/2210. 03629

  61. [61]

    Answer set programming for procedural content generation: A design space approach,

    A. M. Smith and M. Mateas, “Answer set programming for procedural content generation: A design space approach,”IEEE Transactions on Computational Intelligence and AI in Games, vol. 3, no. 3, pp. 187–200, 2011

  62. [62]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    P . Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. tau Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive nlp tasks,” 2021. [Online]. Available: https://arxiv.org/abs/2005.11401

  63. [63]

    Training language models to follow instructions with human feedback

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P . Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P . Welinder, P . Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” 2022. [Online]. Available: https://arxiv.org/abs/2203.02155

  64. [64]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” 2024. [Online]. Available: https://arxiv.org/abs/2305.18290

  65. [65]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P . Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging llm-as-a-judge with mt-bench and chatbot arena,” 2023. [Online]. Available: https: //arxiv.org/abs/2306.05685

  66. [66]

    Human-level performance in no-press diplomacy via equilibrium search,

    J. Gray, A. Lerer, A. Bakhtin, and N. Brown, “Human-level performance in no-press diplomacy via equilibrium search,” 2021. [Online]. Available: https://arxiv.org/abs/2010.02923

  67. [67]

    onespellfitsall,

    YenR, “onespellfitsall,” 2024. [Online]. Available: https://github. com/YenR/OneSpellFitsAll

  68. [68]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar, “Voyager: An open-ended embodied agent with large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2305.16291 14

  69. [69]

    Memgpt: Towards llms as operating systems,

    C. Packer, S. Wooders, K. Lin, V . Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez, “Memgpt: Towards llms as operating systems,”

  70. [70]

    MemGPT: Towards LLMs as Operating Systems

    [Online]. Available: https://arxiv.org/abs/2310.08560

  71. [71]

    aisociety,

    b8ve, “aisociety,” 2026. [Online]. Available: https://store. steampowered.com/app/4468180/AI_Society/

  72. [72]

    Distilling the Knowledge in a Neural Network

    G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” 2015. [Online]. Available: https://arxiv.org/ abs/1503.02531

  73. [73]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,” 2023. [Online]. Available: https://arxiv.org/abs/ 2210.17323

  74. [74]

    QLoRA: Efficient Finetuning of Quantized LLMs

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” 2023. [Online]. Available: https://arxiv.org/abs/2305.14314

  75. [75]

    Fast Inference from Transformers via Speculative Decoding

    Y. Leviathan, M. Kalman, and Y. Matias, “Fast inference from transformers via speculative decoding,” 2023. [Online]. Available: https://arxiv.org/abs/2211.17192

  76. [76]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” 2023. [Online]. Available: https://arxiv.org/abs/2309.06180

  77. [77]

    airoguelite,

    Max Loh, “airoguelite,” 2024. [Online]. Available: https: //store.steampowered.com/app/1889620/AI_Roguelite/

  78. [78]

    skaldsong,

    Fenris Labs, “skaldsong,” 2025. [Online]. Available: https: //store.steampowered.com/app/3808550/Skaldsong/

  79. [79]

    How is chatgpt’s behavior changing over time?

    L. Chen, M. Zaharia, and J. Zou, “How is chatgpt’s behavior changing over time?” 2023. [Online]. Available: https://arxiv.org/abs/2307.09009

  80. [80]

    Holistic Evaluation of Language Models

    P . Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Kumar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cosgrove, C. D. Manning, C. Ré, D. Acosta-Navas, D. A. Hudson, E. Zelikman, E. Durmus, F. Ladhak, F. Rong, H. Ren, H. Yao, J. Wang, K. Santhanam, L. Orr, L. Zheng, M. Yuksekgonul, M. Suzgun, N. Kim, N. Guha,...

Showing first 80 references.