pith. sign in

arxiv: 2606.03979 · v1 · pith:R3S7Z76Pnew · submitted 2026-06-02 · 💻 cs.LG · cs.AI

Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories

Pith reviewed 2026-06-28 10:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords language modelscontinual learningmemory consolidationself-improvementdistillationreinforcement learningsynthetic datasleep paradigm
0
0 comments X

The pith

Language models can consolidate short-term memories into long-term parameters and self-improve during a simulated sleep phase.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a sleep paradigm that lets models move beyond in-context learning by distilling fragile recent memories into stable long-term knowledge. This happens through an upward process called knowledge seeding, where a smaller model trains a larger one using a mix of on-policy distillation and reinforcement learning imitation. A second dreaming stage then lets the model generate its own synthetic data curriculum via reinforcement learning to rehearse and refine what it has learned. The approach draws from human sleep cycles to support continual learning, knowledge incorporation, and few-shot generalization without ongoing human supervision. Experiments on long-horizon tasks indicate that these stages help models retain and build on new information over time.

Core claim

A sleep paradigm with two stages enables language models to continually learn: memory consolidation distills short-term memories from a smaller self into a larger network via generalized distillation (on-policy distillation combined with RL-based imitation), while a dreaming phase uses RL to create a curriculum of synthetic data for unsupervised rehearsal and capability refinement.

What carries the argument

The sleep paradigm, consisting of knowledge seeding via generalized distillation for upward memory consolidation and an RL-driven dreaming stage for self-generated curriculum improvement.

If this is right

  • Models gain the ability to transfer temporal in-context knowledge into permanent parameter updates across extended sequences of tasks.
  • Self-improvement occurs recursively as the dreaming stage generates new training signals without external labels.
  • Knowledge incorporation tasks show reduced forgetting because short-term memories are stabilized through the consolidation stage.
  • Few-shot generalization improves because the model rehearses capabilities on its own synthetic data.
  • The process supports long-horizon continual learning by alternating between active use and sleep-based refinement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployed models could periodically enter sleep phases to update themselves from ongoing interactions.
  • The upward distillation step suggests a path for scaling model capacity while preserving prior knowledge.
  • If the dreaming mechanism scales, it could reduce reliance on human-curated datasets for ongoing model development.
  • The paradigm might extend to non-language domains where replay and self-generated curricula could stabilize learning.

Load-bearing premise

The described combination of on-policy distillation with RL imitation and RL-driven synthetic data generation will reliably consolidate memories and produce self-improvement in practice.

What would settle it

Running the proposed sleep stages on a continual learning benchmark and finding no measurable gain in long-term retention or task performance compared with standard fine-tuning or replay baselines.

Figures

Figures reproduced from arXiv: 2606.03979 by Ali Behrouz, Farnoosh Hashemi, Vahab Mirrokni.

Figure 1
Figure 1. Figure 1: (Conventional Machine Learning vs. Continual Learning) While in conventional machine learning often the lifespan of the model is divided to test and training time, continual learning setup does not have these phases. We suggest that a continual learner need to have different stages of activeness in learning, which we refer to as: (i) Active or Wake Time, and (ii) Sleep Time. Sleep time is not a passive sta… view at source ↗
Figure 2
Figure 2. Figure 2: An overview of Memory Consolidation. The model increases its own number of parameters to enhance its [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Class-incremental learning for text classification is evaluated on the (Left) CLINC dataset (Larson et al. [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of memory levels on in-context learning performance for (Left) MK-NIAH from RULER (Hsieh et al. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Results on the BABILong benchmark. Red points correspond to fine-tuned models, whereas blue points correspond to zero-shot eval￾uations of large-scale models [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Multi-frequency memory hierarchy. Updates enter the High-Frequency FFN via repeated Parameter Expansion; [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Memory consolidation by routed expert updates. Across Sleep cycles (left [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
read the original abstract

The past few decades have witnessed significant advances in the design of machine learning algorithms, from early studies on task-specific shallow models to more general deep Large Language Models (LLMs). Despite showing promising results in tasks that require instant prediction or in-context learning, existing models lack the ability to continually learn and effectively transfer their temporal in-context knowledge to their long-term parameters. Inspired by human learning process, we introduce a ''Sleep'' paradigm that allows the models to continually learn, distill their short-term fragile memories into stable long-term knowledge with replay, and recursively improve themselves with ''Dreaming'' process. In more detail, sleep consists of two stages: (1) Memory Consolidation: an upward distillation process, called Knowledge Seeding, where the memories of a smaller-self are distilled into a larger network to provide more capacity while preserving the knowledge. As a proof of concept, we present a new Generalized Distillation process for {Knowledge Seeding} (i.e., the combination of on-policy distillation with Reinforcement Learning (RL)-based imitation learning); (2) Dreaming: a self-improvement phase, where the model uses RL to generate a curriculum of synthetic data to rehearse new knowledge and refine existing capabilities without human supervision. Our experiments on long-horizon, continual learning, knowledge incorporation, and few-shot generalization tasks support the importance of the sleep stage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a 'Sleep' paradigm for LLMs consisting of two stages: (1) Memory Consolidation via Knowledge Seeding, which uses a Generalized Distillation process (on-policy distillation combined with RL-based imitation learning) to distill short-term memories from a smaller model into a larger one; and (2) Dreaming, an RL-driven phase that generates a synthetic curriculum for rehearsal and self-improvement without human supervision. The approach is claimed to enable continual learning, knowledge consolidation, and recursive capability gains, with supporting experiments on long-horizon continual learning, knowledge incorporation, and few-shot generalization tasks.

Significance. If the central claims hold with concrete evidence, the work would address a key limitation of current LLMs (inability to consolidate in-context knowledge into parameters and achieve unsupervised self-improvement) by providing an explicit mechanism inspired by human sleep. The integration of distillation with RL for on-policy consolidation and synthetic data generation is a potentially useful direction, though the manuscript supplies no quantitative results, baselines, or reward details to assess whether the gains exceed standard continual-learning techniques.

major comments (2)
  1. [Abstract] Abstract: the claim that the Dreaming stage enables 'recursive self-improvement ... without human supervision' is load-bearing for the central contribution, yet the reward signal or objective used by the RL agent is never specified (no mention of intrinsic motivation, consistency loss, task metric, or other concrete formulation). Because RL outcomes are known to be highly sensitive to reward design, this omission prevents evaluation of whether the process is truly unsupervised or implicitly relies on external signals.
  2. [Abstract] Abstract: the manuscript states that 'our experiments ... support the importance of the sleep stage' but reports no quantitative results, error bars, baselines, or ablation studies. Without these, it is impossible to determine whether the observed gains on continual-learning tasks are attributable to the proposed stages or to standard fine-tuning effects.
minor comments (2)
  1. The term 'Generalized Distillation' is introduced without a clear positioning against prior distillation methods used in continual learning or knowledge distillation literature.
  2. Notation for the two stages (Knowledge Seeding, Dreaming) is introduced in the abstract but never formalized with equations or pseudocode, making the precise algorithmic flow difficult to reconstruct.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We agree that the current manuscript version requires additional detail on the RL reward formulation and quantitative experimental reporting. We will revise the paper to address both points.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the Dreaming stage enables 'recursive self-improvement ... without human supervision' is load-bearing for the central contribution, yet the reward signal or objective used by the RL agent is never specified (no mention of intrinsic motivation, consistency loss, task metric, or other concrete formulation). Because RL outcomes are known to be highly sensitive to reward design, this omission prevents evaluation of whether the process is truly unsupervised or implicitly relies on external signals.

    Authors: We agree the reward signal is not specified in the current manuscript. In the revised version we will add an explicit formulation of the RL objective in the Dreaming stage, including the precise reward function (a weighted combination of self-consistency with consolidated knowledge and an intrinsic novelty term), to allow readers to assess whether the process remains unsupervised. revision: yes

  2. Referee: [Abstract] Abstract: the manuscript states that 'our experiments ... support the importance of the sleep stage' but reports no quantitative results, error bars, baselines, or ablation studies. Without these, it is impossible to determine whether the observed gains on continual-learning tasks are attributable to the proposed stages or to standard fine-tuning effects.

    Authors: We agree that the manuscript currently contains no quantitative results, baselines, error bars, or ablations. The existing text presents the experimental tasks at a conceptual level only. We will revise the Experiments section to include quantitative metrics, standard continual-learning baselines, multiple-run error bars, and stage-specific ablations. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations present; conceptual proposal with experimental support

full rationale

The provided text (abstract and description) introduces a high-level 'Sleep' paradigm consisting of Knowledge Seeding via Generalized Distillation and an RL-based Dreaming stage for self-improvement. No mathematical derivations, equations, fitted parameters presented as predictions, or self-citations are visible. Claims rest on experimental results for continual learning tasks rather than any closed-form reduction to inputs. This is a standard case of a proposed framework without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

Review based on abstract only; no specific free parameters, axioms, or invented entities can be extracted beyond the high-level concepts introduced. Full paper would be needed to audit these.

invented entities (3)
  • Sleep paradigm no independent evidence
    purpose: Framework for continual learning via consolidation and self-improvement
    Newly introduced concept in the paper.
  • Knowledge Seeding no independent evidence
    purpose: Upward distillation process from smaller to larger network
    Described as a new generalized distillation process.
  • Dreaming process no independent evidence
    purpose: RL-based generation of synthetic data for rehearsal and refinement
    New self-improvement phase without human supervision.

pith-pipeline@v0.9.1-grok · 5779 in / 1150 out tokens · 27122 ms · 2026-06-28T10:53:55.044482+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

145 extracted references · 4 canonical work pages

  1. [1]

    Loss of recent memory after bilateral hippocampal lesions

    William Beecher Scoville and Brenda Milner. “Loss of recent memory after bilateral hippocampal lesions”. In: Journal of neurology, neurosurgery, and psychiatry20.1 (1957), p. 11

  2. [2]

    1987.url: https://people.idsia.ch/ ~juergen/diploma1987ocr.pdf

    Jürgen Schmidhuber.Evolutionary Principles in Self-Referential Learning. 1987.url: https://people.idsia.ch/ ~juergen/diploma1987ocr.pdf

  3. [3]

    Self-improving reactive agents based on reinforcement learning, planning and teaching

    Long-Ji Lin. “Self-improving reactive agents based on reinforcement learning, planning and teaching”. In:Machine Learning8.3–4 (1992), pp. 293–321

  4. [4]

    Learning to control fast-weight memories: An alternative to recurrent nets. Accepted for publication in

    Juergen Schmidhuber. “Learning to control fast-weight memories: An alternative to recurrent nets. Accepted for publication in”. In:Neural Computation(1992)

  5. [5]

    Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory

    James L. McClelland, Bruce L. McNaughton, and Randall C. O’Reilly. “Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory”. In:Psychological Review102.3 (1995), pp. 419–457

  6. [6]

    Retrograde amnesia and memory consolidation: a neurobiological perspective

    Larry R Squire and Pablo Alvarez. “Retrograde amnesia and memory consolidation: a neurobiological perspective”. In:Current opinion in neurobiology5.2 (1995), pp. 169–177

  7. [7]

    Synaptic tagging and long-term potentiation

    Uwe Frey and Richard GM Morris. “Synaptic tagging and long-term potentiation”. In:Nature385.6616 (1997), pp. 533–536

  8. [8]

    The plastic human brain cortex

    Alvaro Pascual-Leone, Amir Amedi, Felipe Fregni, and Lotfi B Merabet. “The plastic human brain cortex”. In:Annu. Rev. Neurosci.28.1 (2005), pp. 377–401

  9. [9]

    Sleep-dependent memory consolidation

    Robert Stickgold. “Sleep-dependent memory consolidation”. In:Nature437.7063 (2005), pp. 1272–1278

  10. [10]

    Reverse replay of behavioural sequences in hippocampal place cells during the awake state

    David J Foster and Matthew A Wilson. “Reverse replay of behavioural sequences in hippocampal place cells during the awake state”. In:Nature440.7084 (2006), pp. 680–683

  11. [11]

    Sleep function and synaptic homeostasis

    Giulio Tononi and Chiara Cirelli. “Sleep function and synaptic homeostasis”. In:Sleep medicine reviews10.1 (2006), pp. 49–62

  12. [12]

    Dbpedia: A nucleus for a web of open data

    Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. “Dbpedia: A nucleus for a web of open data”. In:international semantic web conference. Springer. 2007, pp. 722–735

  13. [13]

    Coordinated memory replay in the visual cortex and hippocampus during sleep

    Daoyun Ji and Matthew A Wilson. “Coordinated memory replay in the visual cortex and hippocampus during sleep”. In:Nature neuroscience10.1 (2007), pp. 100–107

  14. [14]

    Plasticity in the developing brain: implications for rehabilitation

    Michael V Johnston. “Plasticity in the developing brain: implications for rehabilitation”. In:Developmental disabilities research reviews15.2 (2009), pp. 94–101

  15. [15]

    Replay of rule-learning related neural patterns in the prefrontal cortex during sleep

    Adrien Peyrache, Mehdi Khamassi, Karim Benchenane, Sidney I Wiener, and Francesco P Battaglia. “Replay of rule-learning related neural patterns in the prefrontal cortex during sleep”. In:Nature neuroscience12.7 (2009), pp. 919–926

  16. [16]

    Memory, sleep and dreaming: experiencing consolidation

    Erin J Wamsley and Robert Stickgold. “Memory, sleep and dreaming: experiencing consolidation”. In:Sleep medicine clinics6.1 (2011), p. 97

  17. [17]

    About sleep’s role in memory

    Björn Rasch and Jan Born. “About sleep’s role in memory”. In:Physiological reviews(2013)

  18. [18]

    The role of sleep in emotional brain function

    Andrea N Goldstein and Matthew P Walker. “The role of sleep in emotional brain function”. In:Annual review of clinical psychology10.1 (2014), pp. 679–708

  19. [19]

    Distilling the knowledge in a neural network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network”. In:arXiv preprint arXiv:1503.02531(2015)

  20. [20]

    Human-level control through deep reinforcement learning

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, et al. “Human-level control through deep reinforcement learning”. In:Nature518.7540 (2015), pp. 529–533

  21. [21]

    Memory consolidation

    Larry R Squire, Lisa Genzel, John T Wixted, and Richard G Morris. “Memory consolidation”. In:Cold Spring Harbor perspectives in biology7.8 (2015), a021766

  22. [22]

    Sequence-level knowledge distillation

    Yoon Kim and Alexander M Rush. “Sequence-level knowledge distillation”. In:Proceedings of the 2016 conference on empirical methods in natural language processing. 2016, pp. 1317–1327

  23. [23]

    What learning systems do intelligent agents need? Complementary learning systems theory updated

    Dharshan Kumaran, Demis Hassabis, and James L. McClelland. “What learning systems do intelligent agents need? Complementary learning systems theory updated”. In:Trends in Cognitive Sciences20.7 (2016), pp. 512–534

  24. [24]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. “SQuAD: 100,000+ Questions for Machine Comprehension of Text”. In:Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Ed. by Jian Su, Kevin Duh, and Xavier Carreras. Association for Computational Linguistics, 2016.url: https: //aclanthology.org/D16-1264/. 14

  25. [25]

    Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

    Chelsea Finn, Pieter Abbeel, and Sergey Levine. “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks”. In:Proceedings of the 34th International Conference on Machine Learning. Ed. by Doina Precup and Yee Whye Teh. Proceedings of Machine Learning Research. PMLR, 2017.url: https://proceedings.mlr.press/ v70/finn17a.html

  26. [26]

    Neuroscience-inspired artificial intelligence

    Demis Hassabis, Dharshan Kumaran, Christopher Summerfield, and Matthew Botvinick. “Neuroscience-inspired artificial intelligence”. In:Neuron95.2 (2017), pp. 245–258

  27. [27]

    Overcoming catastrophic forgetting in neural networks

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. “Overcoming catastrophic forgetting in neural networks”. In:Proceedings of the national academy of sciences114.13 (2017), pp. 3521–3526

  28. [28]

    REM sleep selectively prunes and maintains new synapses in development and learning

    Wei Li, Lei Ma, Guang Yang, and Wen-Biao Gan. “REM sleep selectively prunes and maintains new synapses in development and learning”. In:Nature neuroscience20.3 (2017), pp. 427–437

  29. [29]

    Attention is All you Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is All you Need”. In:Advances in Neural Information Processing Systems. Ed. by I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Cur- ran Associates, Inc., 2017.url: htt...

  30. [30]

    Recurrent world models facilitate policy evolution

    David Ha and Jürgen Schmidhuber. “Recurrent world models facilitate policy evolution”. In:Advances in Neural Information Processing Systems (NeurIPS). Vol. 31. 2018, pp. 2451–2463

  31. [31]

    World models

    David Ha and Jürgen Schmidhuber. “World models”. In:arXiv preprint arXiv:1803.101222.3 (2018), p. 440

  32. [32]

    Measuring catastrophic forgetting in neural networks

    Ronald Kemker, Marc McClure, Angelina Abitino, Tyler Hayes, and Christopher Kanan. “Measuring catastrophic forgetting in neural networks”. In:Proceedings of the AAAI conference on artificial intelligence. Vol. 32. 1. 2018

  33. [33]

    An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction

    Stefan Larson, Anish Mahendran, Joseph J Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K Kummerfeld, Kevin Leach, Michael A Laurenzano, Lingjia Tang, et al. “An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction”. In:Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inte...

  34. [34]

    Fast transformer decoding: One write-head is all you need

    Noam Shazeer. “Fast transformer decoding: One write-head is all you need”. In:arXiv preprint arXiv:1911.02150 (2019)

  35. [35]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language models are few-shot learners”. In:Advances in neural information processing systems33 (2020), pp. 1877–1901

  36. [36]

    Efficient Intent Detection with Dual Sentence Encoders

    Inigo Casanueva, Tadas Temcinas, Daniela Gerz, Matthew Henderson, and Ivan Vulic. “Efficient Intent Detection with Dual Sentence Encoders”. In:ACL 2020(2020), p. 38

  37. [37]

    Can sleep protect memories from catastrophic forgetting?

    Oscar C. González, Yury Sokolov, Giri P. Krishnan, Jean Erik Delanois, and Maxim Bazhenov. “Can sleep protect memories from catastrophic forgetting?” In:eLife9 (2020), e51005

  38. [38]

    Dream to control: Learning behaviors by latent imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. “Dream to control: Learning behaviors by latent imagination”. In:International Conference on Learning Representations (ICLR). 2020

  39. [39]

    Transformers are rnns: Fast au- toregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. “Transformers are rnns: Fast au- toregressive transformers with linear attention”. In:International conference on machine learning. PMLR. 2020, pp. 5156–5165

  40. [40]

    A dataset of information- seeking questions and answers anchored in research papers

    Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. “A dataset of information- seeking questions and answers anchored in research papers”. In:arXiv preprint arXiv:2105.03011(2021)

  41. [41]

    Stepwise synaptic plasticity events drive the early phase of memory consolidation

    Akihiro Goto, Ayaka Bota, Ken Miya, Jingbo Wang, Suzune Tsukamoto, Xinzhi Jiang, Daichi Hirai, Masanori Murayama, Tomoki Matsuda, Thomas J. McHugh, Takeharu Nagai, and Yasunori Hayashi. “Stepwise synaptic plasticity events drive the early phase of memory consolidation”. In:Science374.6569 (2021), pp. 857–863.doi: 10.1126/science.abj9195 . eprint: https://...

  42. [42]

    McGraw-Hill, 2021

    Eric R Kandell, Jojhn D Koester, Sarah H Mack, and Steven Siegelbaum.Principles of neural science. McGraw-Hill, 2021

  43. [43]

    The power of scale for parameter-efficient prompt tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. “The power of scale for parameter-efficient prompt tuning”. In: arXiv preprint arXiv:2104.08691(2021)

  44. [44]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    Xiang Lisa Li and Percy Liang. “Prefix-Tuning: Optimizing Continuous Prompts for Generation”. In:Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Ed. by Chengqing Zong, Fei Xia, Wenjie Li, and Roberto 15 Navigli. Onlin...

  45. [45]

    What learning algorithm is in-context learning? investigations with linear models

    Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. “What learning algorithm is in-context learning? investigations with linear models”. In:arXiv preprint arXiv:2211.15661(2022)

  46. [46]

    Recurrent memory transformer

    Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. “Recurrent memory transformer”. In:Advances in Neural Information Processing Systems35 (2022), pp. 11079–11091

  47. [47]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. “LoRA: Low-Rank Adaptation of Large Language Models”. In:International Conference on Learning Representations. 2022.url:https://openreview.net/forum?id=nZeVKeeFYf9

  48. [48]

    A modern self-referential weight matrix that learns to modify itself

    Kazuki Irie, Imanol Schlag, Róbert Csordás, and Jürgen Schmidhuber. “A modern self-referential weight matrix that learns to modify itself”. In:International Conference on Machine Learning. PMLR. 2022.url: https://proceedings. mlr.press/v162/irie22b.html

  49. [49]

    Sleep-like unsupervised replay reduces catastrophic forgetting in artificial neural networks

    Timothy Tadros, Giri P. Krishnan, Ramyaa Ramyaa, and Maxim Bazhenov. “Sleep-like unsupervised replay reduces catastrophic forgetting in artificial neural networks”. In:Nature Communications13.1 (2022), p. 7742

  50. [50]

    STaR: Bootstrapping Reasoning With Reasoning

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. “STaR: Bootstrapping Reasoning With Reasoning”. In: Advances in Neural Information Processing Systems. Ed. by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh. Curran Associates, Inc., 2022.url: https://proceedings.neurips.cc/paper_files/paper/2022/ file/639a9a172c044fbb64175b5fad42e9a...

  51. [51]

    Gpt-4 technical report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. “Gpt-4 technical report”. In:arXiv preprint arXiv:2303.08774(2023)

  52. [52]

    Gqa: Training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. “Gqa: Training generalized multi-query transformer models from multi-head checkpoints”. In:arXiv preprint arXiv:2305.13245(2023)

  53. [53]

    Adapting language models to compress contexts

    Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. “Adapting language models to compress contexts”. In:arXiv preprint arXiv:2305.14788(2023)

  54. [54]

    In-context autoencoder for context compression in a large language model

    Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei. “In-context autoencoder for context compression in a large language model”. In:arXiv preprint arXiv:2307.06945(2023)

  55. [55]

    Mixture of cluster-conditional lora experts for vision-language instruction tuning

    Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James T Kwok, and Yu Zhang. “Mixture of cluster-conditional lora experts for vision-language instruction tuning”. In:arXiv preprint arXiv:2312.12379(2023)

  56. [56]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. “Mamba: Linear-time sequence modeling with selective state spaces”. In:arXiv preprint arXiv:2312.00752(2023)

  57. [57]

    Lorahub: Efficient cross-task generalization via dynamic lora composition

    Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. “Lorahub: Efficient cross-task generalization via dynamic lora composition”. In:arXiv preprint arXiv:2307.13269(2023)

  58. [58]

    Llmlingua: Compressing prompts for accelerated inference of large language models

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. “Llmlingua: Compressing prompts for accelerated inference of large language models”. In:arXiv preprint arXiv:2310.05736(2023)

  59. [59]

    Unlocking context constraints of llms: Enhancing context efficiency of llms with self-information-based content filtering

    Yucheng Li. “Unlocking context constraints of llms: Enhancing context efficiency of llms with self-information-based content filtering”. In:arXiv preprint arXiv:2304.12102(2023)

  60. [60]

    CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

    Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. “CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis”. In:The Eleventh Interna- tional Conference on Learning Representations. 2023.url:https://openreview.net/forum?id=iaYcJKpY2B_

  61. [61]

    Memgpt: Towards llms as operating systems

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. “Memgpt: Towards llms as operating systems”. In:arXiv preprint arXiv:2310.08560(2023)

  62. [62]

    Are emergent abilities of large language models a mirage?

    Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. “Are emergent abilities of large language models a mirage?” In:Advances in neural information processing systems36 (2023), pp. 55565–55581

  63. [63]

    Visionllm: Large language model is also an open-ended decoder for vision-centric tasks

    Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. “Visionllm: Large language model is also an open-ended decoder for vision-centric tasks”. In:Advances in Neural Information Processing Systems36 (2023), pp. 61501–61513

  64. [64]

    H2o: Heavy-hitter oracle for efficient generative inference of large language models

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. “H2o: Heavy-hitter oracle for efficient generative inference of large language models”. In:Advances in Neural Information Processing Systems36 (2023), pp. 34661–34710. 16

  65. [65]

    Phi-4 technical report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Rus- sell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. “Phi-4 technical report”. In:arXiv preprint arXiv:2412.08905 (2024)

  66. [66]

    On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. “On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes”. In:The Twelfth International Conference on Learning Representations. 2024.url: https://openreview.net/forum?id= 3zKtaqxLhW

  67. [67]

    The Surprising Effectiveness of Test-Time Training for Few-Shot Learning

    Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, and Jacob Andreas. “The Surprising Effectiveness of Test-Time Training for Few-Shot Learning”. In:Forty-second International Conference on Machine Learning. 2024

  68. [68]

    In-context language learning: Architectures and algorithms

    Ekin Akyürek, Bailin Wang, Yoon Kim, and Jacob Andreas. “In-context language learning: Architectures and algorithms”. In:arXiv preprint arXiv:2401.12973(2024)

  69. [69]

    Simple linear attention language models balance the recall-throughput tradeoff

    Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, James Zou, Atri Rudra, and Christo- pher Re. “Simple linear attention language models balance the recall-throughput tradeoff”. In:Forty-first International Conference on Machine Learning. 2024.url:https://openreview.net/forum?id=e93ffDcpH3

  70. [70]

    xLSTM: Extended Long Short-Term Memory

    Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. “xLSTM: Extended Long Short-Term Memory”. In: arXiv preprint arXiv:2405.04517(2024)

  71. [71]

    Titans: Learning to memorize at test time

    Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. “Titans: Learning to memorize at test time”. In:arXiv preprint arXiv:2501.00663(2024)

  72. [72]

    Dated Data: Tracing Knowledge Cutoffs in Large Language Models

    Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme. “Dated Data: Tracing Knowledge Cutoffs in Large Language Models”. In:First Conference on Language Modeling. 2024.url: https://openreview.net/forum?id=wS7PxDjy6m

  73. [73]

    V-DPO: Mitigating hallucination in large vision language models via vision-guided direct preference optimization

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. “A Survey on In-context Learning”. In:Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Ed. by Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen. Miami, Florida, USA: Asso...

  74. [74]

    The llama 3 herd of models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. “The llama 3 herd of models”. In:arXiv e-prints(2024), arXiv–2407

  75. [75]

    RULER: What’s the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. “RULER: What’s the Real Context Size of Your Long-Context Language Models?” In:First Conference on Language Modeling. 2024.url:https://openreview.net/forum?id=kIoBbc76Sy

  76. [76]

    Simple and Scalable Strategies to Continually Pre-train Large Language Models

    Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats Leon Richter, Quentin Gregory Anthony, Eugene Belilovsky, Timothée Lesort, and Irina Rish. “Simple and Scalable Strategies to Continually Pre-train Large Language Models”. In:Transactions on Machine Learning Research(2024).issn: 2835-8856.url: https://openreview.net/forum?id= DimPeeCxKO

  77. [77]

    Knowledge injection via prompt distillation

    Kalle Kujanpää, Harri Valpola, and Alexander Ilin. “Knowledge injection via prompt distillation”. In:arXiv preprint arXiv:2412.14964(2024)

  78. [78]

    Babilong: Testing the limits of llms with long context reasoning-in-a-haystack

    Yury Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, and Mikhail Burtsev. “Babilong: Testing the limits of llms with long context reasoning-in-a-haystack”. In:Advances in Neural Information Processing Systems37 (2024), pp. 106519–106554

  79. [79]

    Mixlora: Enhancing large language models fine-tuning with lora-based mixture of experts

    Dengchun Li, Yingzi Ma, Naizheng Wang, Zhengmao Ye, Zhiyuan Cheng, Yinghao Tang, Yan Zhang, Lei Duan, Jie Zuo, Cal Yang, et al. “Mixlora: Enhancing large language models fine-tuning with lora-based mixture of experts”. In:arXiv preprint arXiv:2404.15159(2024)

  80. [80]

    Snapkv: Llm knows what you are looking for before generation

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. “Snapkv: Llm knows what you are looking for before generation”. In:Advances in Neural Information Processing Systems37 (2024), pp. 22947–22970

Showing first 80 references.