Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories

Ali Behrouz; Farnoosh Hashemi; Vahab Mirrokni

arxiv: 2606.03979 · v1 · pith:R3S7Z76Pnew · submitted 2026-06-02 · 💻 cs.LG · cs.AI

Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories

Ali Behrouz , Farnoosh Hashemi , Vahab Mirrokni This is my paper

Pith reviewed 2026-06-28 10:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords language modelscontinual learningmemory consolidationself-improvementdistillationreinforcement learningsynthetic datasleep paradigm

0 comments

The pith

Language models can consolidate short-term memories into long-term parameters and self-improve during a simulated sleep phase.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a sleep paradigm that lets models move beyond in-context learning by distilling fragile recent memories into stable long-term knowledge. This happens through an upward process called knowledge seeding, where a smaller model trains a larger one using a mix of on-policy distillation and reinforcement learning imitation. A second dreaming stage then lets the model generate its own synthetic data curriculum via reinforcement learning to rehearse and refine what it has learned. The approach draws from human sleep cycles to support continual learning, knowledge incorporation, and few-shot generalization without ongoing human supervision. Experiments on long-horizon tasks indicate that these stages help models retain and build on new information over time.

Core claim

A sleep paradigm with two stages enables language models to continually learn: memory consolidation distills short-term memories from a smaller self into a larger network via generalized distillation (on-policy distillation combined with RL-based imitation), while a dreaming phase uses RL to create a curriculum of synthetic data for unsupervised rehearsal and capability refinement.

What carries the argument

The sleep paradigm, consisting of knowledge seeding via generalized distillation for upward memory consolidation and an RL-driven dreaming stage for self-generated curriculum improvement.

If this is right

Models gain the ability to transfer temporal in-context knowledge into permanent parameter updates across extended sequences of tasks.
Self-improvement occurs recursively as the dreaming stage generates new training signals without external labels.
Knowledge incorporation tasks show reduced forgetting because short-term memories are stabilized through the consolidation stage.
Few-shot generalization improves because the model rehearses capabilities on its own synthetic data.
The process supports long-horizon continual learning by alternating between active use and sleep-based refinement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployed models could periodically enter sleep phases to update themselves from ongoing interactions.
The upward distillation step suggests a path for scaling model capacity while preserving prior knowledge.
If the dreaming mechanism scales, it could reduce reliance on human-curated datasets for ongoing model development.
The paradigm might extend to non-language domains where replay and self-generated curricula could stabilize learning.

Load-bearing premise

The described combination of on-policy distillation with RL imitation and RL-driven synthetic data generation will reliably consolidate memories and produce self-improvement in practice.

What would settle it

Running the proposed sleep stages on a continual learning benchmark and finding no measurable gain in long-term retention or task performance compared with standard fine-tuning or replay baselines.

Figures

Figures reproduced from arXiv: 2606.03979 by Ali Behrouz, Farnoosh Hashemi, Vahab Mirrokni.

**Figure 1.** Figure 1: (Conventional Machine Learning vs. Continual Learning) While in conventional machine learning often the lifespan of the model is divided to test and training time, continual learning setup does not have these phases. We suggest that a continual learner need to have different stages of activeness in learning, which we refer to as: (i) Active or Wake Time, and (ii) Sleep Time. Sleep time is not a passive sta… view at source ↗

**Figure 2.** Figure 2: An overview of Memory Consolidation. The model increases its own number of parameters to enhance its [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Class-incremental learning for text classification is evaluated on the (Left) CLINC dataset (Larson et al. [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of memory levels on in-context learning performance for (Left) MK-NIAH from RULER (Hsieh et al. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 6.** Figure 6: Results on the BABILong benchmark. Red points correspond to fine-tuned models, whereas blue points correspond to zero-shot evaluations of large-scale models [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Multi-frequency memory hierarchy. Updates enter the High-Frequency FFN via repeated Parameter Expansion; [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Memory consolidation by routed expert updates. Across Sleep cycles (left [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

read the original abstract

The past few decades have witnessed significant advances in the design of machine learning algorithms, from early studies on task-specific shallow models to more general deep Large Language Models (LLMs). Despite showing promising results in tasks that require instant prediction or in-context learning, existing models lack the ability to continually learn and effectively transfer their temporal in-context knowledge to their long-term parameters. Inspired by human learning process, we introduce a ''Sleep'' paradigm that allows the models to continually learn, distill their short-term fragile memories into stable long-term knowledge with replay, and recursively improve themselves with ''Dreaming'' process. In more detail, sleep consists of two stages: (1) Memory Consolidation: an upward distillation process, called Knowledge Seeding, where the memories of a smaller-self are distilled into a larger network to provide more capacity while preserving the knowledge. As a proof of concept, we present a new Generalized Distillation process for {Knowledge Seeding} (i.e., the combination of on-policy distillation with Reinforcement Learning (RL)-based imitation learning); (2) Dreaming: a self-improvement phase, where the model uses RL to generate a curriculum of synthetic data to rehearse new knowledge and refine existing capabilities without human supervision. Our experiments on long-horizon, continual learning, knowledge incorporation, and few-shot generalization tasks support the importance of the sleep stage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The sleep paradigm frames continual learning via distillation plus RL dreaming, but the reward mechanism for unsupervised self-improvement stays underspecified even in the full text.

read the letter

The paper introduces a two-stage sleep process for LLMs: Knowledge Seeding distills from a smaller model to a larger one using on-policy distillation combined with RL imitation, then Dreaming uses RL to create synthetic data for rehearsal and refinement. Experiments are reported on continual learning, knowledge incorporation, and few-shot tasks.

What is actually new is the explicit pairing of capacity-increasing distillation with an RL-driven curriculum generator that claims to operate without human supervision. The framing correctly identifies the gap between fragile in-context knowledge and stable parameter updates, and the biological analogy supplies a coherent way to organize replay and self-modification.

The experiments are said to support the stages, which is a start, but the abstract and main sections give no numbers, baselines, or ablation details, so effect sizes remain unclear. The generalized distillation step is described at a workable level of detail.

The soft spot is the Dreaming stage. The reward signal that drives the RL curriculum is not laid out concretely; without an explicit intrinsic objective or consistency term that can be inspected, it is hard to verify that the self-improvement is truly unsupervised rather than dependent on task-specific signals. RL brittleness makes this the load-bearing assumption, and the paper does not close it.

This is for people working on memory and continual adaptation in LLMs. A reader already thinking about replay or distillation methods will find the structure useful to consider. It deserves peer review because the problem is real and the proposal is distinct, though the reward specification and quantitative reporting will need tightening.

Referee Report

2 major / 2 minor

Summary. The paper proposes a 'Sleep' paradigm for LLMs consisting of two stages: (1) Memory Consolidation via Knowledge Seeding, which uses a Generalized Distillation process (on-policy distillation combined with RL-based imitation learning) to distill short-term memories from a smaller model into a larger one; and (2) Dreaming, an RL-driven phase that generates a synthetic curriculum for rehearsal and self-improvement without human supervision. The approach is claimed to enable continual learning, knowledge consolidation, and recursive capability gains, with supporting experiments on long-horizon continual learning, knowledge incorporation, and few-shot generalization tasks.

Significance. If the central claims hold with concrete evidence, the work would address a key limitation of current LLMs (inability to consolidate in-context knowledge into parameters and achieve unsupervised self-improvement) by providing an explicit mechanism inspired by human sleep. The integration of distillation with RL for on-policy consolidation and synthetic data generation is a potentially useful direction, though the manuscript supplies no quantitative results, baselines, or reward details to assess whether the gains exceed standard continual-learning techniques.

major comments (2)

[Abstract] Abstract: the claim that the Dreaming stage enables 'recursive self-improvement ... without human supervision' is load-bearing for the central contribution, yet the reward signal or objective used by the RL agent is never specified (no mention of intrinsic motivation, consistency loss, task metric, or other concrete formulation). Because RL outcomes are known to be highly sensitive to reward design, this omission prevents evaluation of whether the process is truly unsupervised or implicitly relies on external signals.
[Abstract] Abstract: the manuscript states that 'our experiments ... support the importance of the sleep stage' but reports no quantitative results, error bars, baselines, or ablation studies. Without these, it is impossible to determine whether the observed gains on continual-learning tasks are attributable to the proposed stages or to standard fine-tuning effects.

minor comments (2)

The term 'Generalized Distillation' is introduced without a clear positioning against prior distillation methods used in continual learning or knowledge distillation literature.
Notation for the two stages (Knowledge Seeding, Dreaming) is introduced in the abstract but never formalized with equations or pseudocode, making the precise algorithmic flow difficult to reconstruct.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We agree that the current manuscript version requires additional detail on the RL reward formulation and quantitative experimental reporting. We will revise the paper to address both points.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the Dreaming stage enables 'recursive self-improvement ... without human supervision' is load-bearing for the central contribution, yet the reward signal or objective used by the RL agent is never specified (no mention of intrinsic motivation, consistency loss, task metric, or other concrete formulation). Because RL outcomes are known to be highly sensitive to reward design, this omission prevents evaluation of whether the process is truly unsupervised or implicitly relies on external signals.

Authors: We agree the reward signal is not specified in the current manuscript. In the revised version we will add an explicit formulation of the RL objective in the Dreaming stage, including the precise reward function (a weighted combination of self-consistency with consolidated knowledge and an intrinsic novelty term), to allow readers to assess whether the process remains unsupervised. revision: yes
Referee: [Abstract] Abstract: the manuscript states that 'our experiments ... support the importance of the sleep stage' but reports no quantitative results, error bars, baselines, or ablation studies. Without these, it is impossible to determine whether the observed gains on continual-learning tasks are attributable to the proposed stages or to standard fine-tuning effects.

Authors: We agree that the manuscript currently contains no quantitative results, baselines, error bars, or ablations. The existing text presents the experimental tasks at a conceptual level only. We will revise the Experiments section to include quantitative metrics, standard continual-learning baselines, multiple-run error bars, and stage-specific ablations. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations present; conceptual proposal with experimental support

full rationale

The provided text (abstract and description) introduces a high-level 'Sleep' paradigm consisting of Knowledge Seeding via Generalized Distillation and an RL-based Dreaming stage for self-improvement. No mathematical derivations, equations, fitted parameters presented as predictions, or self-citations are visible. Claims rest on experimental results for continual learning tasks rather than any closed-form reduction to inputs. This is a standard case of a proposed framework without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

Review based on abstract only; no specific free parameters, axioms, or invented entities can be extracted beyond the high-level concepts introduced. Full paper would be needed to audit these.

invented entities (3)

Sleep paradigm no independent evidence
purpose: Framework for continual learning via consolidation and self-improvement
Newly introduced concept in the paper.
Knowledge Seeding no independent evidence
purpose: Upward distillation process from smaller to larger network
Described as a new generalized distillation process.
Dreaming process no independent evidence
purpose: RL-based generation of synthetic data for rehearsal and refinement
New self-improvement phase without human supervision.

pith-pipeline@v0.9.1-grok · 5779 in / 1150 out tokens · 27122 ms · 2026-06-28T10:53:55.044482+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

145 extracted references · 4 canonical work pages

[1]

Loss of recent memory after bilateral hippocampal lesions

William Beecher Scoville and Brenda Milner. “Loss of recent memory after bilateral hippocampal lesions”. In: Journal of neurology, neurosurgery, and psychiatry20.1 (1957), p. 11

1957
[2]

1987.url: https://people.idsia.ch/ ~juergen/diploma1987ocr.pdf

Jürgen Schmidhuber.Evolutionary Principles in Self-Referential Learning. 1987.url: https://people.idsia.ch/ ~juergen/diploma1987ocr.pdf

1987
[3]

Self-improving reactive agents based on reinforcement learning, planning and teaching

Long-Ji Lin. “Self-improving reactive agents based on reinforcement learning, planning and teaching”. In:Machine Learning8.3–4 (1992), pp. 293–321

1992
[4]

Learning to control fast-weight memories: An alternative to recurrent nets. Accepted for publication in

Juergen Schmidhuber. “Learning to control fast-weight memories: An alternative to recurrent nets. Accepted for publication in”. In:Neural Computation(1992)

1992
[5]

Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory

James L. McClelland, Bruce L. McNaughton, and Randall C. O’Reilly. “Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory”. In:Psychological Review102.3 (1995), pp. 419–457

1995
[6]

Retrograde amnesia and memory consolidation: a neurobiological perspective

Larry R Squire and Pablo Alvarez. “Retrograde amnesia and memory consolidation: a neurobiological perspective”. In:Current opinion in neurobiology5.2 (1995), pp. 169–177

1995
[7]

Synaptic tagging and long-term potentiation

Uwe Frey and Richard GM Morris. “Synaptic tagging and long-term potentiation”. In:Nature385.6616 (1997), pp. 533–536

1997
[8]

The plastic human brain cortex

Alvaro Pascual-Leone, Amir Amedi, Felipe Fregni, and Lotfi B Merabet. “The plastic human brain cortex”. In:Annu. Rev. Neurosci.28.1 (2005), pp. 377–401

2005
[9]

Sleep-dependent memory consolidation

Robert Stickgold. “Sleep-dependent memory consolidation”. In:Nature437.7063 (2005), pp. 1272–1278

2005
[10]

Reverse replay of behavioural sequences in hippocampal place cells during the awake state

David J Foster and Matthew A Wilson. “Reverse replay of behavioural sequences in hippocampal place cells during the awake state”. In:Nature440.7084 (2006), pp. 680–683

2006
[11]

Sleep function and synaptic homeostasis

Giulio Tononi and Chiara Cirelli. “Sleep function and synaptic homeostasis”. In:Sleep medicine reviews10.1 (2006), pp. 49–62

2006
[12]

Dbpedia: A nucleus for a web of open data

Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. “Dbpedia: A nucleus for a web of open data”. In:international semantic web conference. Springer. 2007, pp. 722–735

2007
[13]

Coordinated memory replay in the visual cortex and hippocampus during sleep

Daoyun Ji and Matthew A Wilson. “Coordinated memory replay in the visual cortex and hippocampus during sleep”. In:Nature neuroscience10.1 (2007), pp. 100–107

2007
[14]

Plasticity in the developing brain: implications for rehabilitation

Michael V Johnston. “Plasticity in the developing brain: implications for rehabilitation”. In:Developmental disabilities research reviews15.2 (2009), pp. 94–101

2009
[15]

Replay of rule-learning related neural patterns in the prefrontal cortex during sleep

Adrien Peyrache, Mehdi Khamassi, Karim Benchenane, Sidney I Wiener, and Francesco P Battaglia. “Replay of rule-learning related neural patterns in the prefrontal cortex during sleep”. In:Nature neuroscience12.7 (2009), pp. 919–926

2009
[16]

Memory, sleep and dreaming: experiencing consolidation

Erin J Wamsley and Robert Stickgold. “Memory, sleep and dreaming: experiencing consolidation”. In:Sleep medicine clinics6.1 (2011), p. 97

2011
[17]

About sleep’s role in memory

Björn Rasch and Jan Born. “About sleep’s role in memory”. In:Physiological reviews(2013)

2013
[18]

The role of sleep in emotional brain function

Andrea N Goldstein and Matthew P Walker. “The role of sleep in emotional brain function”. In:Annual review of clinical psychology10.1 (2014), pp. 679–708

2014
[19]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network”. In:arXiv preprint arXiv:1503.02531(2015)

Pith/arXiv arXiv 2015
[20]

Human-level control through deep reinforcement learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, et al. “Human-level control through deep reinforcement learning”. In:Nature518.7540 (2015), pp. 529–533

2015
[21]

Memory consolidation

Larry R Squire, Lisa Genzel, John T Wixted, and Richard G Morris. “Memory consolidation”. In:Cold Spring Harbor perspectives in biology7.8 (2015), a021766

2015
[22]

Sequence-level knowledge distillation

Yoon Kim and Alexander M Rush. “Sequence-level knowledge distillation”. In:Proceedings of the 2016 conference on empirical methods in natural language processing. 2016, pp. 1317–1327

2016
[23]

What learning systems do intelligent agents need? Complementary learning systems theory updated

Dharshan Kumaran, Demis Hassabis, and James L. McClelland. “What learning systems do intelligent agents need? Complementary learning systems theory updated”. In:Trends in Cognitive Sciences20.7 (2016), pp. 512–534

2016
[24]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. “SQuAD: 100,000+ Questions for Machine Comprehension of Text”. In:Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Ed. by Jian Su, Kevin Duh, and Xavier Carreras. Association for Computational Linguistics, 2016.url: https: //aclanthology.org/D16-1264/. 14

2016
[25]

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks”. In:Proceedings of the 34th International Conference on Machine Learning. Ed. by Doina Precup and Yee Whye Teh. Proceedings of Machine Learning Research. PMLR, 2017.url: https://proceedings.mlr.press/ v70/finn17a.html

2017
[26]

Neuroscience-inspired artificial intelligence

Demis Hassabis, Dharshan Kumaran, Christopher Summerfield, and Matthew Botvinick. “Neuroscience-inspired artificial intelligence”. In:Neuron95.2 (2017), pp. 245–258

2017
[27]

Overcoming catastrophic forgetting in neural networks

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. “Overcoming catastrophic forgetting in neural networks”. In:Proceedings of the national academy of sciences114.13 (2017), pp. 3521–3526

2017
[28]

REM sleep selectively prunes and maintains new synapses in development and learning

Wei Li, Lei Ma, Guang Yang, and Wen-Biao Gan. “REM sleep selectively prunes and maintains new synapses in development and learning”. In:Nature neuroscience20.3 (2017), pp. 427–437

2017
[29]

Attention is All you Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is All you Need”. In:Advances in Neural Information Processing Systems. Ed. by I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Cur- ran Associates, Inc., 2017.url: htt...

2017
[30]

Recurrent world models facilitate policy evolution

David Ha and Jürgen Schmidhuber. “Recurrent world models facilitate policy evolution”. In:Advances in Neural Information Processing Systems (NeurIPS). Vol. 31. 2018, pp. 2451–2463

2018
[31]

World models

David Ha and Jürgen Schmidhuber. “World models”. In:arXiv preprint arXiv:1803.101222.3 (2018), p. 440

Pith/arXiv arXiv 2018
[32]

Measuring catastrophic forgetting in neural networks

Ronald Kemker, Marc McClure, Angelina Abitino, Tyler Hayes, and Christopher Kanan. “Measuring catastrophic forgetting in neural networks”. In:Proceedings of the AAAI conference on artificial intelligence. Vol. 32. 1. 2018

2018
[33]

An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction

Stefan Larson, Anish Mahendran, Joseph J Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K Kummerfeld, Kevin Leach, Michael A Laurenzano, Lingjia Tang, et al. “An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction”. In:Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inte...

2019
[34]

Fast transformer decoding: One write-head is all you need

Noam Shazeer. “Fast transformer decoding: One write-head is all you need”. In:arXiv preprint arXiv:1911.02150 (2019)

Pith/arXiv arXiv 1911
[35]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language models are few-shot learners”. In:Advances in neural information processing systems33 (2020), pp. 1877–1901

2020
[36]

Efficient Intent Detection with Dual Sentence Encoders

Inigo Casanueva, Tadas Temcinas, Daniela Gerz, Matthew Henderson, and Ivan Vulic. “Efficient Intent Detection with Dual Sentence Encoders”. In:ACL 2020(2020), p. 38

2020
[37]

Can sleep protect memories from catastrophic forgetting?

Oscar C. González, Yury Sokolov, Giri P. Krishnan, Jean Erik Delanois, and Maxim Bazhenov. “Can sleep protect memories from catastrophic forgetting?” In:eLife9 (2020), e51005

2020
[38]

Dream to control: Learning behaviors by latent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. “Dream to control: Learning behaviors by latent imagination”. In:International Conference on Learning Representations (ICLR). 2020

2020
[39]

Transformers are rnns: Fast au- toregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. “Transformers are rnns: Fast au- toregressive transformers with linear attention”. In:International conference on machine learning. PMLR. 2020, pp. 5156–5165

2020
[40]

A dataset of information- seeking questions and answers anchored in research papers

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. “A dataset of information- seeking questions and answers anchored in research papers”. In:arXiv preprint arXiv:2105.03011(2021)

arXiv 2021
[41]

Stepwise synaptic plasticity events drive the early phase of memory consolidation

Akihiro Goto, Ayaka Bota, Ken Miya, Jingbo Wang, Suzune Tsukamoto, Xinzhi Jiang, Daichi Hirai, Masanori Murayama, Tomoki Matsuda, Thomas J. McHugh, Takeharu Nagai, and Yasunori Hayashi. “Stepwise synaptic plasticity events drive the early phase of memory consolidation”. In:Science374.6569 (2021), pp. 857–863.doi: 10.1126/science.abj9195 . eprint: https://...

work page doi:10.1126/science.abj9195 2021
[42]

McGraw-Hill, 2021

Eric R Kandell, Jojhn D Koester, Sarah H Mack, and Steven Siegelbaum.Principles of neural science. McGraw-Hill, 2021

2021
[43]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. “The power of scale for parameter-efficient prompt tuning”. In: arXiv preprint arXiv:2104.08691(2021)

Pith/arXiv arXiv 2021
[44]

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Xiang Lisa Li and Percy Liang. “Prefix-Tuning: Optimizing Continuous Prompts for Generation”. In:Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Ed. by Chengqing Zong, Fei Xia, Wenjie Li, and Roberto 15 Navigli. Onlin...

work page doi:10.18653/v1/2021.acl- 2021
[45]

What learning algorithm is in-context learning? investigations with linear models

Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. “What learning algorithm is in-context learning? investigations with linear models”. In:arXiv preprint arXiv:2211.15661(2022)

Pith/arXiv arXiv 2022
[46]

Recurrent memory transformer

Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. “Recurrent memory transformer”. In:Advances in Neural Information Processing Systems35 (2022), pp. 11079–11091

2022
[47]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. “LoRA: Low-Rank Adaptation of Large Language Models”. In:International Conference on Learning Representations. 2022.url:https://openreview.net/forum?id=nZeVKeeFYf9

2022
[48]

A modern self-referential weight matrix that learns to modify itself

Kazuki Irie, Imanol Schlag, Róbert Csordás, and Jürgen Schmidhuber. “A modern self-referential weight matrix that learns to modify itself”. In:International Conference on Machine Learning. PMLR. 2022.url: https://proceedings. mlr.press/v162/irie22b.html

2022
[49]

Sleep-like unsupervised replay reduces catastrophic forgetting in artificial neural networks

Timothy Tadros, Giri P. Krishnan, Ramyaa Ramyaa, and Maxim Bazhenov. “Sleep-like unsupervised replay reduces catastrophic forgetting in artificial neural networks”. In:Nature Communications13.1 (2022), p. 7742

2022
[50]

STaR: Bootstrapping Reasoning With Reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. “STaR: Bootstrapping Reasoning With Reasoning”. In: Advances in Neural Information Processing Systems. Ed. by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh. Curran Associates, Inc., 2022.url: https://proceedings.neurips.cc/paper_files/paper/2022/ file/639a9a172c044fbb64175b5fad42e9a...

2022
[51]

Gpt-4 technical report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. “Gpt-4 technical report”. In:arXiv preprint arXiv:2303.08774(2023)

Pith/arXiv arXiv 2023
[52]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. “Gqa: Training generalized multi-query transformer models from multi-head checkpoints”. In:arXiv preprint arXiv:2305.13245(2023)

Pith/arXiv arXiv 2023
[53]

Adapting language models to compress contexts

Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. “Adapting language models to compress contexts”. In:arXiv preprint arXiv:2305.14788(2023)

arXiv 2023
[54]

In-context autoencoder for context compression in a large language model

Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei. “In-context autoencoder for context compression in a large language model”. In:arXiv preprint arXiv:2307.06945(2023)

arXiv 2023
[55]

Mixture of cluster-conditional lora experts for vision-language instruction tuning

Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James T Kwok, and Yu Zhang. “Mixture of cluster-conditional lora experts for vision-language instruction tuning”. In:arXiv preprint arXiv:2312.12379(2023)

arXiv 2023
[56]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. “Mamba: Linear-time sequence modeling with selective state spaces”. In:arXiv preprint arXiv:2312.00752(2023)

Pith/arXiv arXiv 2023
[57]

Lorahub: Efficient cross-task generalization via dynamic lora composition

Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. “Lorahub: Efficient cross-task generalization via dynamic lora composition”. In:arXiv preprint arXiv:2307.13269(2023)

arXiv 2023
[58]

Llmlingua: Compressing prompts for accelerated inference of large language models

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. “Llmlingua: Compressing prompts for accelerated inference of large language models”. In:arXiv preprint arXiv:2310.05736(2023)

arXiv 2023
[59]

Unlocking context constraints of llms: Enhancing context efficiency of llms with self-information-based content filtering

Yucheng Li. “Unlocking context constraints of llms: Enhancing context efficiency of llms with self-information-based content filtering”. In:arXiv preprint arXiv:2304.12102(2023)

arXiv 2023
[60]

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. “CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis”. In:The Eleventh Interna- tional Conference on Learning Representations. 2023.url:https://openreview.net/forum?id=iaYcJKpY2B_

2023
[61]

Memgpt: Towards llms as operating systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. “Memgpt: Towards llms as operating systems”. In:arXiv preprint arXiv:2310.08560(2023)

Pith/arXiv arXiv 2023
[62]

Are emergent abilities of large language models a mirage?

Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. “Are emergent abilities of large language models a mirage?” In:Advances in neural information processing systems36 (2023), pp. 55565–55581

2023
[63]

Visionllm: Large language model is also an open-ended decoder for vision-centric tasks

Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. “Visionllm: Large language model is also an open-ended decoder for vision-centric tasks”. In:Advances in Neural Information Processing Systems36 (2023), pp. 61501–61513

2023
[64]

H2o: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. “H2o: Heavy-hitter oracle for efficient generative inference of large language models”. In:Advances in Neural Information Processing Systems36 (2023), pp. 34661–34710. 16

2023
[65]

Phi-4 technical report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Rus- sell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. “Phi-4 technical report”. In:arXiv preprint arXiv:2412.08905 (2024)

Pith/arXiv arXiv 2024
[66]

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. “On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes”. In:The Twelfth International Conference on Learning Representations. 2024.url: https://openreview.net/forum?id= 3zKtaqxLhW

2024
[67]

The Surprising Effectiveness of Test-Time Training for Few-Shot Learning

Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, and Jacob Andreas. “The Surprising Effectiveness of Test-Time Training for Few-Shot Learning”. In:Forty-second International Conference on Machine Learning. 2024

2024
[68]

In-context language learning: Architectures and algorithms

Ekin Akyürek, Bailin Wang, Yoon Kim, and Jacob Andreas. “In-context language learning: Architectures and algorithms”. In:arXiv preprint arXiv:2401.12973(2024)

arXiv 2024
[69]

Simple linear attention language models balance the recall-throughput tradeoff

Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, James Zou, Atri Rudra, and Christo- pher Re. “Simple linear attention language models balance the recall-throughput tradeoff”. In:Forty-first International Conference on Machine Learning. 2024.url:https://openreview.net/forum?id=e93ffDcpH3

2024
[70]

xLSTM: Extended Long Short-Term Memory

Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. “xLSTM: Extended Long Short-Term Memory”. In: arXiv preprint arXiv:2405.04517(2024)

arXiv 2024
[71]

Titans: Learning to memorize at test time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. “Titans: Learning to memorize at test time”. In:arXiv preprint arXiv:2501.00663(2024)

Pith/arXiv arXiv 2024
[72]

Dated Data: Tracing Knowledge Cutoffs in Large Language Models

Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme. “Dated Data: Tracing Knowledge Cutoffs in Large Language Models”. In:First Conference on Language Modeling. 2024.url: https://openreview.net/forum?id=wS7PxDjy6m

2024
[73]

V-DPO: Mitigating hallucination in large vision language models via vision-guided direct preference optimization

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. “A Survey on In-context Learning”. In:Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Ed. by Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen. Miami, Florida, USA: Asso...

work page doi:10.18653/v1/ 2024
[74]

The llama 3 herd of models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. “The llama 3 herd of models”. In:arXiv e-prints(2024), arXiv–2407

2024
[75]

RULER: What’s the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. “RULER: What’s the Real Context Size of Your Long-Context Language Models?” In:First Conference on Language Modeling. 2024.url:https://openreview.net/forum?id=kIoBbc76Sy

2024
[76]

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats Leon Richter, Quentin Gregory Anthony, Eugene Belilovsky, Timothée Lesort, and Irina Rish. “Simple and Scalable Strategies to Continually Pre-train Large Language Models”. In:Transactions on Machine Learning Research(2024).issn: 2835-8856.url: https://openreview.net/forum?id= DimPeeCxKO

2024
[77]

Knowledge injection via prompt distillation

Kalle Kujanpää, Harri Valpola, and Alexander Ilin. “Knowledge injection via prompt distillation”. In:arXiv preprint arXiv:2412.14964(2024)

arXiv 2024
[78]

Babilong: Testing the limits of llms with long context reasoning-in-a-haystack

Yury Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, and Mikhail Burtsev. “Babilong: Testing the limits of llms with long context reasoning-in-a-haystack”. In:Advances in Neural Information Processing Systems37 (2024), pp. 106519–106554

2024
[79]

Mixlora: Enhancing large language models fine-tuning with lora-based mixture of experts

Dengchun Li, Yingzi Ma, Naizheng Wang, Zhengmao Ye, Zhiyuan Cheng, Yinghao Tang, Yan Zhang, Lei Duan, Jie Zuo, Cal Yang, et al. “Mixlora: Enhancing large language models fine-tuning with lora-based mixture of experts”. In:arXiv preprint arXiv:2404.15159(2024)

arXiv 2024
[80]

Snapkv: Llm knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. “Snapkv: Llm knows what you are looking for before generation”. In:Advances in Neural Information Processing Systems37 (2024), pp. 22947–22970

2024

Showing first 80 references.

[1] [1]

Loss of recent memory after bilateral hippocampal lesions

William Beecher Scoville and Brenda Milner. “Loss of recent memory after bilateral hippocampal lesions”. In: Journal of neurology, neurosurgery, and psychiatry20.1 (1957), p. 11

1957

[2] [2]

1987.url: https://people.idsia.ch/ ~juergen/diploma1987ocr.pdf

Jürgen Schmidhuber.Evolutionary Principles in Self-Referential Learning. 1987.url: https://people.idsia.ch/ ~juergen/diploma1987ocr.pdf

1987

[3] [3]

Self-improving reactive agents based on reinforcement learning, planning and teaching

Long-Ji Lin. “Self-improving reactive agents based on reinforcement learning, planning and teaching”. In:Machine Learning8.3–4 (1992), pp. 293–321

1992

[4] [4]

Learning to control fast-weight memories: An alternative to recurrent nets. Accepted for publication in

Juergen Schmidhuber. “Learning to control fast-weight memories: An alternative to recurrent nets. Accepted for publication in”. In:Neural Computation(1992)

1992

[5] [5]

Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory

James L. McClelland, Bruce L. McNaughton, and Randall C. O’Reilly. “Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory”. In:Psychological Review102.3 (1995), pp. 419–457

1995

[6] [6]

Retrograde amnesia and memory consolidation: a neurobiological perspective

Larry R Squire and Pablo Alvarez. “Retrograde amnesia and memory consolidation: a neurobiological perspective”. In:Current opinion in neurobiology5.2 (1995), pp. 169–177

1995

[7] [7]

Synaptic tagging and long-term potentiation

Uwe Frey and Richard GM Morris. “Synaptic tagging and long-term potentiation”. In:Nature385.6616 (1997), pp. 533–536

1997

[8] [8]

The plastic human brain cortex

Alvaro Pascual-Leone, Amir Amedi, Felipe Fregni, and Lotfi B Merabet. “The plastic human brain cortex”. In:Annu. Rev. Neurosci.28.1 (2005), pp. 377–401

2005

[9] [9]

Sleep-dependent memory consolidation

Robert Stickgold. “Sleep-dependent memory consolidation”. In:Nature437.7063 (2005), pp. 1272–1278

2005

[10] [10]

Reverse replay of behavioural sequences in hippocampal place cells during the awake state

David J Foster and Matthew A Wilson. “Reverse replay of behavioural sequences in hippocampal place cells during the awake state”. In:Nature440.7084 (2006), pp. 680–683

2006

[11] [11]

Sleep function and synaptic homeostasis

Giulio Tononi and Chiara Cirelli. “Sleep function and synaptic homeostasis”. In:Sleep medicine reviews10.1 (2006), pp. 49–62

2006

[12] [12]

Dbpedia: A nucleus for a web of open data

Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. “Dbpedia: A nucleus for a web of open data”. In:international semantic web conference. Springer. 2007, pp. 722–735

2007

[13] [13]

Coordinated memory replay in the visual cortex and hippocampus during sleep

Daoyun Ji and Matthew A Wilson. “Coordinated memory replay in the visual cortex and hippocampus during sleep”. In:Nature neuroscience10.1 (2007), pp. 100–107

2007

[14] [14]

Plasticity in the developing brain: implications for rehabilitation

Michael V Johnston. “Plasticity in the developing brain: implications for rehabilitation”. In:Developmental disabilities research reviews15.2 (2009), pp. 94–101

2009

[15] [15]

Replay of rule-learning related neural patterns in the prefrontal cortex during sleep

Adrien Peyrache, Mehdi Khamassi, Karim Benchenane, Sidney I Wiener, and Francesco P Battaglia. “Replay of rule-learning related neural patterns in the prefrontal cortex during sleep”. In:Nature neuroscience12.7 (2009), pp. 919–926

2009

[16] [16]

Memory, sleep and dreaming: experiencing consolidation

Erin J Wamsley and Robert Stickgold. “Memory, sleep and dreaming: experiencing consolidation”. In:Sleep medicine clinics6.1 (2011), p. 97

2011

[17] [17]

About sleep’s role in memory

Björn Rasch and Jan Born. “About sleep’s role in memory”. In:Physiological reviews(2013)

2013

[18] [18]

The role of sleep in emotional brain function

Andrea N Goldstein and Matthew P Walker. “The role of sleep in emotional brain function”. In:Annual review of clinical psychology10.1 (2014), pp. 679–708

2014

[19] [19]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network”. In:arXiv preprint arXiv:1503.02531(2015)

Pith/arXiv arXiv 2015

[20] [20]

Human-level control through deep reinforcement learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, et al. “Human-level control through deep reinforcement learning”. In:Nature518.7540 (2015), pp. 529–533

2015

[21] [21]

Memory consolidation

Larry R Squire, Lisa Genzel, John T Wixted, and Richard G Morris. “Memory consolidation”. In:Cold Spring Harbor perspectives in biology7.8 (2015), a021766

2015

[22] [22]

Sequence-level knowledge distillation

Yoon Kim and Alexander M Rush. “Sequence-level knowledge distillation”. In:Proceedings of the 2016 conference on empirical methods in natural language processing. 2016, pp. 1317–1327

2016

[23] [23]

What learning systems do intelligent agents need? Complementary learning systems theory updated

Dharshan Kumaran, Demis Hassabis, and James L. McClelland. “What learning systems do intelligent agents need? Complementary learning systems theory updated”. In:Trends in Cognitive Sciences20.7 (2016), pp. 512–534

2016

[24] [24]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. “SQuAD: 100,000+ Questions for Machine Comprehension of Text”. In:Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Ed. by Jian Su, Kevin Duh, and Xavier Carreras. Association for Computational Linguistics, 2016.url: https: //aclanthology.org/D16-1264/. 14

2016

[25] [25]

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks”. In:Proceedings of the 34th International Conference on Machine Learning. Ed. by Doina Precup and Yee Whye Teh. Proceedings of Machine Learning Research. PMLR, 2017.url: https://proceedings.mlr.press/ v70/finn17a.html

2017

[26] [26]

Neuroscience-inspired artificial intelligence

Demis Hassabis, Dharshan Kumaran, Christopher Summerfield, and Matthew Botvinick. “Neuroscience-inspired artificial intelligence”. In:Neuron95.2 (2017), pp. 245–258

2017

[27] [27]

Overcoming catastrophic forgetting in neural networks

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. “Overcoming catastrophic forgetting in neural networks”. In:Proceedings of the national academy of sciences114.13 (2017), pp. 3521–3526

2017

[28] [28]

REM sleep selectively prunes and maintains new synapses in development and learning

Wei Li, Lei Ma, Guang Yang, and Wen-Biao Gan. “REM sleep selectively prunes and maintains new synapses in development and learning”. In:Nature neuroscience20.3 (2017), pp. 427–437

2017

[29] [29]

Attention is All you Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is All you Need”. In:Advances in Neural Information Processing Systems. Ed. by I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Cur- ran Associates, Inc., 2017.url: htt...

2017

[30] [30]

Recurrent world models facilitate policy evolution

David Ha and Jürgen Schmidhuber. “Recurrent world models facilitate policy evolution”. In:Advances in Neural Information Processing Systems (NeurIPS). Vol. 31. 2018, pp. 2451–2463

2018

[31] [31]

World models

David Ha and Jürgen Schmidhuber. “World models”. In:arXiv preprint arXiv:1803.101222.3 (2018), p. 440

Pith/arXiv arXiv 2018

[32] [32]

Measuring catastrophic forgetting in neural networks

Ronald Kemker, Marc McClure, Angelina Abitino, Tyler Hayes, and Christopher Kanan. “Measuring catastrophic forgetting in neural networks”. In:Proceedings of the AAAI conference on artificial intelligence. Vol. 32. 1. 2018

2018

[33] [33]

An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction

Stefan Larson, Anish Mahendran, Joseph J Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K Kummerfeld, Kevin Leach, Michael A Laurenzano, Lingjia Tang, et al. “An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction”. In:Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inte...

2019

[34] [34]

Fast transformer decoding: One write-head is all you need

Noam Shazeer. “Fast transformer decoding: One write-head is all you need”. In:arXiv preprint arXiv:1911.02150 (2019)

Pith/arXiv arXiv 1911

[35] [35]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language models are few-shot learners”. In:Advances in neural information processing systems33 (2020), pp. 1877–1901

2020

[36] [36]

Efficient Intent Detection with Dual Sentence Encoders

Inigo Casanueva, Tadas Temcinas, Daniela Gerz, Matthew Henderson, and Ivan Vulic. “Efficient Intent Detection with Dual Sentence Encoders”. In:ACL 2020(2020), p. 38

2020

[37] [37]

Can sleep protect memories from catastrophic forgetting?

Oscar C. González, Yury Sokolov, Giri P. Krishnan, Jean Erik Delanois, and Maxim Bazhenov. “Can sleep protect memories from catastrophic forgetting?” In:eLife9 (2020), e51005

2020

[38] [38]

Dream to control: Learning behaviors by latent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. “Dream to control: Learning behaviors by latent imagination”. In:International Conference on Learning Representations (ICLR). 2020

2020

[39] [39]

Transformers are rnns: Fast au- toregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. “Transformers are rnns: Fast au- toregressive transformers with linear attention”. In:International conference on machine learning. PMLR. 2020, pp. 5156–5165

2020

[40] [40]

A dataset of information- seeking questions and answers anchored in research papers

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. “A dataset of information- seeking questions and answers anchored in research papers”. In:arXiv preprint arXiv:2105.03011(2021)

arXiv 2021

[41] [41]

Stepwise synaptic plasticity events drive the early phase of memory consolidation

Akihiro Goto, Ayaka Bota, Ken Miya, Jingbo Wang, Suzune Tsukamoto, Xinzhi Jiang, Daichi Hirai, Masanori Murayama, Tomoki Matsuda, Thomas J. McHugh, Takeharu Nagai, and Yasunori Hayashi. “Stepwise synaptic plasticity events drive the early phase of memory consolidation”. In:Science374.6569 (2021), pp. 857–863.doi: 10.1126/science.abj9195 . eprint: https://...

work page doi:10.1126/science.abj9195 2021

[42] [42]

McGraw-Hill, 2021

Eric R Kandell, Jojhn D Koester, Sarah H Mack, and Steven Siegelbaum.Principles of neural science. McGraw-Hill, 2021

2021

[43] [43]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. “The power of scale for parameter-efficient prompt tuning”. In: arXiv preprint arXiv:2104.08691(2021)

Pith/arXiv arXiv 2021

[44] [44]

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Xiang Lisa Li and Percy Liang. “Prefix-Tuning: Optimizing Continuous Prompts for Generation”. In:Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Ed. by Chengqing Zong, Fei Xia, Wenjie Li, and Roberto 15 Navigli. Onlin...

work page doi:10.18653/v1/2021.acl- 2021

[45] [45]

What learning algorithm is in-context learning? investigations with linear models

Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. “What learning algorithm is in-context learning? investigations with linear models”. In:arXiv preprint arXiv:2211.15661(2022)

Pith/arXiv arXiv 2022

[46] [46]

Recurrent memory transformer

Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. “Recurrent memory transformer”. In:Advances in Neural Information Processing Systems35 (2022), pp. 11079–11091

2022

[47] [47]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. “LoRA: Low-Rank Adaptation of Large Language Models”. In:International Conference on Learning Representations. 2022.url:https://openreview.net/forum?id=nZeVKeeFYf9

2022

[48] [48]

A modern self-referential weight matrix that learns to modify itself

Kazuki Irie, Imanol Schlag, Róbert Csordás, and Jürgen Schmidhuber. “A modern self-referential weight matrix that learns to modify itself”. In:International Conference on Machine Learning. PMLR. 2022.url: https://proceedings. mlr.press/v162/irie22b.html

2022

[49] [49]

Sleep-like unsupervised replay reduces catastrophic forgetting in artificial neural networks

Timothy Tadros, Giri P. Krishnan, Ramyaa Ramyaa, and Maxim Bazhenov. “Sleep-like unsupervised replay reduces catastrophic forgetting in artificial neural networks”. In:Nature Communications13.1 (2022), p. 7742

2022

[50] [50]

STaR: Bootstrapping Reasoning With Reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. “STaR: Bootstrapping Reasoning With Reasoning”. In: Advances in Neural Information Processing Systems. Ed. by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh. Curran Associates, Inc., 2022.url: https://proceedings.neurips.cc/paper_files/paper/2022/ file/639a9a172c044fbb64175b5fad42e9a...

2022

[51] [51]

Gpt-4 technical report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. “Gpt-4 technical report”. In:arXiv preprint arXiv:2303.08774(2023)

Pith/arXiv arXiv 2023

[52] [52]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. “Gqa: Training generalized multi-query transformer models from multi-head checkpoints”. In:arXiv preprint arXiv:2305.13245(2023)

Pith/arXiv arXiv 2023

[53] [53]

Adapting language models to compress contexts

Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. “Adapting language models to compress contexts”. In:arXiv preprint arXiv:2305.14788(2023)

arXiv 2023

[54] [54]

In-context autoencoder for context compression in a large language model

Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei. “In-context autoencoder for context compression in a large language model”. In:arXiv preprint arXiv:2307.06945(2023)

arXiv 2023

[55] [55]

Mixture of cluster-conditional lora experts for vision-language instruction tuning

Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James T Kwok, and Yu Zhang. “Mixture of cluster-conditional lora experts for vision-language instruction tuning”. In:arXiv preprint arXiv:2312.12379(2023)

arXiv 2023

[56] [56]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. “Mamba: Linear-time sequence modeling with selective state spaces”. In:arXiv preprint arXiv:2312.00752(2023)

Pith/arXiv arXiv 2023

[57] [57]

Lorahub: Efficient cross-task generalization via dynamic lora composition

Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. “Lorahub: Efficient cross-task generalization via dynamic lora composition”. In:arXiv preprint arXiv:2307.13269(2023)

arXiv 2023

[58] [58]

Llmlingua: Compressing prompts for accelerated inference of large language models

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. “Llmlingua: Compressing prompts for accelerated inference of large language models”. In:arXiv preprint arXiv:2310.05736(2023)

arXiv 2023

[59] [59]

Unlocking context constraints of llms: Enhancing context efficiency of llms with self-information-based content filtering

Yucheng Li. “Unlocking context constraints of llms: Enhancing context efficiency of llms with self-information-based content filtering”. In:arXiv preprint arXiv:2304.12102(2023)

arXiv 2023

[60] [60]

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. “CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis”. In:The Eleventh Interna- tional Conference on Learning Representations. 2023.url:https://openreview.net/forum?id=iaYcJKpY2B_

2023

[61] [61]

Memgpt: Towards llms as operating systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. “Memgpt: Towards llms as operating systems”. In:arXiv preprint arXiv:2310.08560(2023)

Pith/arXiv arXiv 2023

[62] [62]

Are emergent abilities of large language models a mirage?

Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. “Are emergent abilities of large language models a mirage?” In:Advances in neural information processing systems36 (2023), pp. 55565–55581

2023

[63] [63]

Visionllm: Large language model is also an open-ended decoder for vision-centric tasks

Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. “Visionllm: Large language model is also an open-ended decoder for vision-centric tasks”. In:Advances in Neural Information Processing Systems36 (2023), pp. 61501–61513

2023

[64] [64]

H2o: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. “H2o: Heavy-hitter oracle for efficient generative inference of large language models”. In:Advances in Neural Information Processing Systems36 (2023), pp. 34661–34710. 16

2023

[65] [65]

Phi-4 technical report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Rus- sell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. “Phi-4 technical report”. In:arXiv preprint arXiv:2412.08905 (2024)

Pith/arXiv arXiv 2024

[66] [66]

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. “On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes”. In:The Twelfth International Conference on Learning Representations. 2024.url: https://openreview.net/forum?id= 3zKtaqxLhW

2024

[67] [67]

The Surprising Effectiveness of Test-Time Training for Few-Shot Learning

Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, and Jacob Andreas. “The Surprising Effectiveness of Test-Time Training for Few-Shot Learning”. In:Forty-second International Conference on Machine Learning. 2024

2024

[68] [68]

In-context language learning: Architectures and algorithms

Ekin Akyürek, Bailin Wang, Yoon Kim, and Jacob Andreas. “In-context language learning: Architectures and algorithms”. In:arXiv preprint arXiv:2401.12973(2024)

arXiv 2024

[69] [69]

Simple linear attention language models balance the recall-throughput tradeoff

Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, James Zou, Atri Rudra, and Christo- pher Re. “Simple linear attention language models balance the recall-throughput tradeoff”. In:Forty-first International Conference on Machine Learning. 2024.url:https://openreview.net/forum?id=e93ffDcpH3

2024

[70] [70]

xLSTM: Extended Long Short-Term Memory

Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. “xLSTM: Extended Long Short-Term Memory”. In: arXiv preprint arXiv:2405.04517(2024)

arXiv 2024

[71] [71]

Titans: Learning to memorize at test time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. “Titans: Learning to memorize at test time”. In:arXiv preprint arXiv:2501.00663(2024)

Pith/arXiv arXiv 2024

[72] [72]

Dated Data: Tracing Knowledge Cutoffs in Large Language Models

Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme. “Dated Data: Tracing Knowledge Cutoffs in Large Language Models”. In:First Conference on Language Modeling. 2024.url: https://openreview.net/forum?id=wS7PxDjy6m

2024

[73] [73]

V-DPO: Mitigating hallucination in large vision language models via vision-guided direct preference optimization

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. “A Survey on In-context Learning”. In:Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Ed. by Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen. Miami, Florida, USA: Asso...

work page doi:10.18653/v1/ 2024

[74] [74]

The llama 3 herd of models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. “The llama 3 herd of models”. In:arXiv e-prints(2024), arXiv–2407

2024

[75] [75]

RULER: What’s the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. “RULER: What’s the Real Context Size of Your Long-Context Language Models?” In:First Conference on Language Modeling. 2024.url:https://openreview.net/forum?id=kIoBbc76Sy

2024

[76] [76]

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats Leon Richter, Quentin Gregory Anthony, Eugene Belilovsky, Timothée Lesort, and Irina Rish. “Simple and Scalable Strategies to Continually Pre-train Large Language Models”. In:Transactions on Machine Learning Research(2024).issn: 2835-8856.url: https://openreview.net/forum?id= DimPeeCxKO

2024

[77] [77]

Knowledge injection via prompt distillation

Kalle Kujanpää, Harri Valpola, and Alexander Ilin. “Knowledge injection via prompt distillation”. In:arXiv preprint arXiv:2412.14964(2024)

arXiv 2024

[78] [78]

Babilong: Testing the limits of llms with long context reasoning-in-a-haystack

Yury Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, and Mikhail Burtsev. “Babilong: Testing the limits of llms with long context reasoning-in-a-haystack”. In:Advances in Neural Information Processing Systems37 (2024), pp. 106519–106554

2024

[79] [79]

Mixlora: Enhancing large language models fine-tuning with lora-based mixture of experts

Dengchun Li, Yingzi Ma, Naizheng Wang, Zhengmao Ye, Zhiyuan Cheng, Yinghao Tang, Yan Zhang, Lei Duan, Jie Zuo, Cal Yang, et al. “Mixlora: Enhancing large language models fine-tuning with lora-based mixture of experts”. In:arXiv preprint arXiv:2404.15159(2024)

arXiv 2024

[80] [80]

Snapkv: Llm knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. “Snapkv: Llm knows what you are looking for before generation”. In:Advances in Neural Information Processing Systems37 (2024), pp. 22947–22970

2024