arxiv: 2603.22241 · v2 · submitted 2026-03-23 · 💻 cs.CL

Recognition: unknown

MemDLM: Memory-Enhanced DLM Training

Bei Yu, Hui-Ling Zhen, Mingxuan Yuan, Sinno Jialin Pan, Weizhe Lin, Yunhe Wang, Zehua Pei

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:39 UTC · model grok-4.3

classification 💻 cs.CL

keywords diffusion language modelsbi-level optimizationparametric memorydenoising trajectorylong-context modelingmemory-enhanced training

0 comments

The pith

MemDLM offloads part of the memorization burden in diffusion language models from token attention to model parameters using bi-level optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard DLM training relies on a fixed masked prediction task that keeps all context in attention, which grows diluted over longer sequences. MemDLM adds a second memory channel by running an inner optimization loop that learns fast weights capturing the progressive denoising trajectory, then conditions the outer base-model update on those weights. The split transfers trajectory experience directly into the permanent parameters. This produces faster convergence, lower loss, and stronger long-context representations that remain even after the fast weights are removed at inference. Re-activating the inner loop during generation adds prompt-specific adaptation on difficult retrieval problems.

Core claim

MemDLM introduces bi-level optimization in which an inner loop maintains fast weights that form a Parametric Memory encoding the local denoising trajectory, while the outer loop updates the base model conditioned on this memory; offloading contextual information from token-space attention into parameter space improves training dynamics and yields representations usable without the fast weights at inference.

What carries the argument

Bi-level optimization that creates Parametric Memory by updating fast weights on the denoising trajectory in the inner loop and conditioning the base-model update on those weights in the outer loop.

If this is right

Training converges faster than standard DLM training.
Long-context representations become stronger.
Overall training loss decreases.
Re-enabling the inner loop at inference creates an emergent in-weight retrieval effect on needle-in-a-haystack tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same split between attention and parameter memory could be tested in autoregressive models facing context-length limits.
Scaling the length or complexity of the simulated trajectory inside the inner loop might further reduce dependence on attention for very long inputs.
The method points toward hybrid memory designs that combine static parameters with lightweight per-prompt adaptation across other generative architectures.

Load-bearing premise

The bi-level optimization transfers useful denoising trajectory information into the base model parameters without introducing instability or requiring the fast weights to stay present at inference.

What would settle it

Train a standard DLM and a MemDLM on identical long-context data, then compare final loss and convergence speed after discarding fast weights at inference; if the MemDLM shows no advantage, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2603.22241 by Bei Yu, Hui-Ling Zhen, Mingxuan Yuan, Sinno Jialin Pan, Weizhe Lin, Yunhe Wang, Zehua Pei.

**Figure 1.** Figure 1: Needle-in-a-Haystack results overview. Gray bars denote Standard MDLM and blue bars denote MemDLM. Left: detailed results on RULER-MV, RULER-VT, RULER-CWE, and BABILong for the LLaDA-MoE-7B-A1B-Base and LLaDA2.1-mini backbones. Right: mean absolute improvement of MemDLM over Standard MDLM for each task, averaged across the evaluated context lengths within each backbone. Preprint. arXiv:2603.22241v2 [cs.CL]… view at source ↗

**Figure 2.** Figure 2: Overview of MemDLM. Left: standard MDLM training uses a static single-step denoising objective from xt to x0. Right: MemDLM uses Bi-level Optimization in which an inner loop updates fast weights ϕ along an anchor-consistent local trajectory (xtpre → xt → x0), and the outer loop updates the base model θ on the anchor state xt conditioned on this parametric memory. Legend: dark tokens denote mask tokens, lig… view at source ↗

**Figure 3.** Figure 3: Comparison with the untuned pretrained LLaDA-MoE-7B-A1B-Base model across context lengths [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Training dynamics on LLaDA-MoE and LLaDA2.1. Faint train-loss curves show raw [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Inner-loop supervision analysis on the LLaDA-MoE, evaluated on BABILong-1K. 0 200 400 600 800 1000 Training step 0.5 1.0 1.5 2.0 2.5 3.0 Train loss Train Loss During Adaptation 0.05 0.10 0.25 0.50 +Attn Full 0.616 0.684 0.626 0.574 0.648 0.602 BABILong-1K FFN 0.05 FFN 0.10 FFN 0.25 FFN 0.50 FFN+Attn 0.10 Full Param [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7 [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 9.** Figure 9: Role of the two inner-loop stages. 0 200 400 600 800 1000 Training step 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 Train loss Multiple Pre-Anchor Steps 2-step 3-step 4-step 0.684 0.644 0.590 BABILong-1K 2-step 3-step 4-step [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 8.** Figure 8: Consistency of the trajectory. We also perform ablations that test which components of MemDLM are necessary for the method to work. Consistency of the trajectory design. One central hypothesis of MemDLM is that the inner loop should remain consistent with the anchor-centered outer objective [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 11.** Figure 11: Exposure Bias Ratio (REB) across denoising steps. Standard MDLM degrades rapidly, while MemDLM remains substantially flatter [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

read the original abstract

Diffusion Language Models (DLMs) offer attractive advantages over Auto-Regressive (AR) models, such as full-attention parallel decoding and flexible generation. However, standard DLM training uses a static, single-step masked prediction objective that never exposes the model to the progressive denoising dynamics of inference, and forces all contextual information to be maintained purely through token-space attention, which becomes increasingly diluted as context length grows. We propose MemDLM (Memory-Enhanced DLM), which introduces a second memory channel by embedding a simulated denoising trajectory into training via Bi-level Optimization. An inner loop updates a set of fast weights, forming a Parametric Memory that captures the local trajectory experience, while an outer loop updates the base model conditioned on this memory. By offloading part of the memorization burden from token-space attention to parameter space, MemDLM yields faster convergence, stronger long-context representations, and lower training loss, even when the fast weights are discarded at inference time. Re-enabling the inner loop at inference provides an additional prompt-specific adaptation effect, where the Parametric Memory acts as an emergent in-weight retrieval mechanism on challenging Needle-in-a-Haystack tasks. Code: https://github.com/JarvisPei/MemDLM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MemDLM uses bi-level optimization to bake denoising trajectories into a temporary parametric memory for DLMs, but the evidence that the base model keeps the long-context gains after the memory is dropped is still thin.

read the letter

The main takeaway is that this paper tries to fix a real training-inference mismatch in diffusion language models by running an inner loop to build fast weights that capture local denoising steps, then using an outer loop to update the base model conditioned on that memory. The claim is that this offloads some context handling from attention to parameters, so performance improves even when the fast weights are discarded at inference, with an optional re-enable for prompt adaptation on hard retrieval tasks.

Referee Report

3 major / 1 minor

Summary. The paper proposes MemDLM, a training method for Diffusion Language Models that uses bi-level optimization to embed simulated denoising trajectories. An inner loop updates fast weights forming a Parametric Memory that captures local trajectory experience, while an outer loop updates the base model conditioned on this memory. The approach offloads memorization from token-space attention to parameter space, claiming faster convergence, stronger long-context representations, and lower training loss that persist even after discarding fast weights at inference; re-enabling the inner loop enables prompt-specific adaptation on tasks like Needle-in-a-Haystack.

Significance. If the empirical claims hold, the method could offer a practical way to improve DLM training dynamics and long-context performance without permanent inference overhead, by transferring trajectory information into base parameters via bi-level optimization. The code release aids reproducibility, and the distinction between training-time memory and inference-time discard is a clear strength. However, the absence of quantitative results or ablations in the provided description limits assessment of practical impact relative to standard DLM training.

major comments (3)

[Abstract] Abstract: The central claims of faster convergence, stronger long-context representations, and lower training loss (even after discarding fast weights) are asserted without any quantitative results, ablation details, or experimental setup description, preventing verification of the bi-level optimization's effectiveness.
[Method] Method section (bi-level optimization description): The transfer of denoising-trajectory information from inner-loop fast weights to outer-loop base parameters lacks explicit equations or analysis confirming stable gradient flow and independence from the memory channel; without this, it is unclear whether gains are due to auxiliary dynamics rather than encoded trajectory knowledge.
[Experiments] Experiments: No ablation isolating base-model performance (post-training, fast weights discarded) is described to substantiate that improvements in convergence and long-context handling persist independently of the Parametric Memory, which is load-bearing for the main claim.

minor comments (1)

The introduction of 'Parametric Memory' as a new entity would benefit from explicit comparison to related concepts such as fast weights in meta-learning or adapter modules, with appropriate citations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of our results and methods.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of faster convergence, stronger long-context representations, and lower training loss (even after discarding fast weights) are asserted without any quantitative results, ablation details, or experimental setup description, preventing verification of the bi-level optimization's effectiveness.

Authors: We agree that the abstract would benefit from including key quantitative highlights to better support the claims. In the revised version, we have updated the abstract to report specific metrics from our experiments, including the reduction in training steps required for convergence, the improvement in long-context perplexity, and the persistent loss reduction after discarding fast weights, with full experimental details remaining in the Experiments section. revision: yes
Referee: [Method] Method section (bi-level optimization description): The transfer of denoising-trajectory information from inner-loop fast weights to outer-loop base parameters lacks explicit equations or analysis confirming stable gradient flow and independence from the memory channel; without this, it is unclear whether gains are due to auxiliary dynamics rather than encoded trajectory knowledge.

Authors: We appreciate this suggestion for greater rigor in the method description. We have revised the Method section to include the complete bi-level optimization equations, along with a gradient-flow analysis showing that trajectory information is stably encoded into the base parameters. This analysis confirms that the observed gains arise from the transferred knowledge rather than auxiliary training dynamics, and that performance improvements hold independently of the memory channel at inference. revision: yes
Referee: [Experiments] Experiments: No ablation isolating base-model performance (post-training, fast weights discarded) is described to substantiate that improvements in convergence and long-context handling persist independently of the Parametric Memory, which is load-bearing for the main claim.

Authors: The manuscript already reports base-model results after discarding fast weights, with direct comparisons to standard DLM training showing retained gains. To make this isolation explicit, we have added a dedicated ablation subsection in the Experiments section that compares the post-training base model (fast weights removed) against vanilla DLM baselines on convergence speed and long-context tasks, confirming the improvements are independent of the Parametric Memory at test time. revision: yes

Circularity Check

0 steps flagged

No significant circularity: new bi-level optimization structure is independent of fitted inputs

full rationale

The paper introduces an explicit bi-level optimization procedure (inner-loop fast weights for parametric memory, outer-loop base model updates) as a novel training mechanism for DLMs. The central claims of faster convergence, stronger long-context representations, and persistent gains after discarding fast weights at inference are framed as empirical outcomes of this new structure rather than re-derivations, predictions from fitted parameters, or self-citations. No equations or steps in the provided description reduce a claimed result to its own inputs by construction. The derivation chain is self-contained against external benchmarks, with the method's independence from fast weights at inference presented as a testable property of the outer-loop training rather than a definitional tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The method assumes bi-level optimization can separate trajectory-specific information into fast weights without destabilizing the outer-loop training of the base model.

free parameters (1)

inner-loop learning rate
Controls how quickly the fast weights adapt to each denoising trajectory; must be chosen to balance capture of local information against instability.

axioms (1)

domain assumption Bi-level optimization can embed denoising dynamics into parameter updates without requiring the fast weights at inference
Central to the claim that performance gains persist after discarding the memory.

invented entities (1)

Parametric Memory no independent evidence
purpose: Captures local denoising trajectory experience via fast weights
New construct introduced to offload memorization from attention.

pith-pipeline@v0.9.0 · 5535 in / 1235 out tokens · 23836 ms · 2026-05-15T00:39:30.608418+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 12 internal anchors

[1]

Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

work page 2021
[2]

Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

work page 2024
[3]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 37:103131–103167, 2024

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 37:103131–103167, 2024

work page 2024
[5]

Your absorbing discrete diffusion secretly models the conditional distributions of clean data

Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024

work page arXiv 2024
[6]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling.arXiv preprint arXiv:2409.02908, 2024

Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling.arXiv preprint arXiv:2409.02908, 2024

work page arXiv 2024
[7]

A continuous time framework for discrete denoising models.Advances in Neural Information Processing Systems, 35:28266–28279, 2022

Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models.Advances in Neural Information Processing Systems, 35:28266–28279, 2022

work page 2022
[8]

Score-based continuous-time discrete diffusion models

Haoran Sun, Lijun Yu, Bo Dai, Dale Schuurmans, and Hanjun Dai. Score-based continuous-time discrete diffusion models.arXiv preprint arXiv:2211.16750, 2022

work page arXiv 2022
[9]

Concrete score matching: Generalized score matching for discrete data.Advances in Neural Information Processing Systems, 35:34532–34545, 2022

Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Ermon. Concrete score matching: Generalized score matching for discrete data.Advances in Neural Information Processing Systems, 35:34532–34545, 2022

work page 2022
[10]

Mdpo: Overcoming the training- inference divide of masked diffusion language models.arXiv preprint arXiv:2508.13148, 2025

Haoyu He, Katrin Renz, Yong Cao, and Andreas Geiger. Mdpo: Overcoming the training- inference divide of masked diffusion language models.arXiv preprint arXiv:2508.13148, 2025

work page arXiv 2025
[11]

Revolutioniz- ing reinforcement learning framework for diffusion large language models.arXiv preprint arXiv:2509.06949, 2025

Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, and Mengdi Wang. Revolutioniz- ing reinforcement learning framework for diffusion large language models.arXiv preprint arXiv:2509.06949, 2025

work page arXiv 2025
[12]

Reinforcing the diffu- sion chain of lateral thought with diffusion language models.arXiv preprint arXiv:2505.10446, 2025

Zemin Huang, Zhiyang Chen, Zijun Wang, Tiancheng Li, and Guo-Jun Qi. Reinforcing the diffu- sion chain of lateral thought with diffusion language models.arXiv preprint arXiv:2505.10446, 2025

work page arXiv 2025
[13]

Planner aware path learning in diffusion language models training.arXiv preprint arXiv:2509.23405, 2025

Fred Zhangzhi Peng, Zachary Bezemek, Jarrid Rector-Brooks, Shuibai Zhang, Anru R Zhang, Michael Bronstein, Avishek Joey Bose, and Alexander Tong. Planner aware path learning in diffusion language models training.arXiv preprint arXiv:2509.23405, 2025

work page arXiv 2025
[14]

Scalable-softmax is superior for attention.arXiv preprint arXiv:2501.19399, 2025

Ken M Nakanishi. Scalable-softmax is superior for attention.arXiv preprint arXiv:2501.19399, 2025

work page arXiv 2025
[15]

DySCO: Dynamic Attention-Scaling Decoding for Long-Context Language Models

Xi Ye, Wuwei Zhang, Fangcong Yin, Howard Yen, and Danqi Chen. Dysco: Dynamic attention- scaling decoding for long-context lms.arXiv preprint arXiv:2602.22175, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

work page 2024
[17]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Using fast weights to improve persistent contrastive divergence

Tijmen Tieleman and Geoffrey Hinton. Using fast weights to improve persistent contrastive divergence. InProceedings of the 26th annual international conference on machine learning, pages 1033–1040, 2009

work page 2009
[19]

Using fast weights to attend to the recent past.Advances in neural information processing systems, 29, 2016

Jimmy Ba, Geoffrey E Hinton, V olodymyr Mnih, Joel Z Leibo, and Catalin Ionescu. Using fast weights to attend to the recent past.Advances in neural information processing systems, 29, 2016

work page 2016
[20]

Using fast weights to deblur old memories

Geoffrey E Hinton and David C Plaut. Using fast weights to deblur old memories. InProceedings of the ninth annual conference of the Cognitive Science Society, pages 177–186, 1987

work page 1987
[21]

Memory-based Parameter Adaptation

Pablo Sprechmann, Siddhant M Jayakumar, Jack W Rae, Alexander Pritzel, Adria Puig- domenech Badia, Benigno Uria, Oriol Vinyals, Demis Hassabis, Razvan Pascanu, and Charles Blundell. Memory-based parameter adaptation.arXiv preprint arXiv:1802.10542, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[22]

Llada-moe: A sparse moe diffusion language model.arXiv preprint arXiv:2509.24389, 2025

Fengqi Zhu, Zebin You, Yipeng Xing, Zenan Huang, Lin Liu, Yihong Zhuang, Guoshan Lu, Kangyu Wang, Xudong Wang, Lanning Wei, et al. Llada-moe: A sparse moe diffusion language model.arXiv preprint arXiv:2509.24389, 2025

work page arXiv 2025
[23]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Tiwei Bie, Maosong Cao, Xiang Cao, Bingsen Chen, Fuyuan Chen, Kun Chen, Lun Du, Daozhuo Feng, Haibo Feng, Mingliang Gong, et al. Llada2. 1: Speeding up text diffusion via token editing.arXiv preprint arXiv:2602.08676, 2026

work page arXiv 2026
[25]

Babilong: Testing the limits of llms with long context reasoning-in-a-haystack

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, and Mikhail Burtsev. Babilong: Testing the limits of llms with long context reasoning-in-a-haystack. Advances in Neural Information Processing Systems, 37:106519–106554, 2024

work page 2024
[26]

Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

work page 2019
[27]

dllm: Simple diffusion language modeling, 2026

Zhanhui Zhou, Lingjie Chen, Hanghang Tong, and Dawn Song. dllm: Simple diffusion language modeling, 2026

work page 2026
[28]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page 2024
[29]

Long alpaca: Long-context instruction-following models

Yukang Chen, Shaozuo Yu, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Long alpaca: Long-context instruction-following models. https://github.com/ dvlab-research/LongLoRA, 2023

work page 2023
[30]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[31]

Longbench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024

work page 2024
[32]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[33]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Dllm agent: See farther, run faster.arXiv preprint arXiv:2602.07451, 2026

Huiling Zhen, Weizhe Lin, Renxi Liu, Kai Han, Yiming Li, Yuchuan Tian, Hanting Chen, Xiaoguang Li, Xiaosong Li, Chen Chen, et al. Dllm agent: See farther, run faster.arXiv preprint arXiv:2602.07451, 2026

work page arXiv 2026
[35]

Top 10 open challenges steering the future of diffusion language model and its variants.arXiv preprint arXiv:2601.14041, 2026

Yunhe Wang, Kai Han, Huiling Zhen, Yuchuan Tian, Hanting Chen, Yongbing Huang, Yufei Cui, Yingte Shu, Shan Gao, Ismail Elezi, et al. Top 10 open challenges steering the future of diffusion language model and its variants.arXiv preprint arXiv:2601.14041, 2026

work page arXiv 2026
[36]

Fast-weight product key memory.arXiv preprint arXiv:2601.00671, 2026

Tianyu Zhao and Llion Jones. Fast-weight product key memory.arXiv preprint arXiv:2601.00671, 2026

work page arXiv 2026
[37]

Online adaptation of language models with a memory of amortized contexts.Advances in Neural Information Processing Systems, 37:130109–130135, 2024

Jihoon Tack, Jaehyung Kim, Eric Mitchell, Jinwoo Shin, Yee Whye Teh, and Jonathan Richard Schwarz. Online adaptation of language models with a memory of amortized contexts.Advances in Neural Information Processing Systems, 37:130109–130135, 2024

work page 2024
[38]

Mass-Editing Memory in a Transformer

Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass- editing memory in a transformer.arXiv preprint arXiv:2210.07229, 2022

work page internal anchor Pith review arXiv 2022
[39]

Fast model editing at scale.arXiv preprint arXiv:2110.11309, 2021

Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D Manning. Fast model editing at scale.arXiv preprint arXiv:2110.11309, 2021

work page arXiv 2021
[40]

Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, Jingbo Shang, and Julian J. McAuley. MEMORYLLM: towards self-updatable large language models. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024

work page 2024
[41]

Self- updatable large language models by integrating context into model parameters.arXiv preprint arXiv:2410.00487, 2024

Yu Wang, Xinshuang Liu, Xiusi Chen, Sean O’Brien, Junda Wu, and Julian McAuley. Self- updatable large language models by integrating context into model parameters.arXiv preprint arXiv:2410.00487, 2024

work page arXiv 2024
[42]

Propagating knowledge updates to lms through distillation.Advances in Neural Information Processing Systems, 36:47124–47142, 2023

Shankar Padmanabhan, Yasumasa Onoe, Michael Zhang, Greg Durrett, and Eunsol Choi. Propagating knowledge updates to lms through distillation.Advances in Neural Information Processing Systems, 36:47124–47142, 2023

work page 2023
[43]

Learning to learn: Introduction and overview

Sebastian Thrun and Lorien Pratt. Learning to learn: Introduction and overview. InLearning to learn, pages 3–17. Springer, 1998

work page 1998
[44]

Model-agnostic meta-learning for fast adap- tation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap- tation of deep networks. InInternational conference on machine learning, pages 1126–1135. PMLR, 2017

work page 2017
[45]

On First-Order Meta-Learning Algorithms

Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[46]

Matching networks for one shot learning.Advances in neural information processing systems, 29, 2016

Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning.Advances in neural information processing systems, 29, 2016

work page 2016
[47]

Prototypical networks for few-shot learning

Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017

work page 2017
[48]

Meta-learning with memory-augmented neural networks

Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. InInternational conference on machine learning, pages 1842–1850. PMLR, 2016

work page 2016
[49]

Meta-learning with implicit gradients.Advances in neural information processing systems, 32, 2019

Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit gradients.Advances in neural information processing systems, 32, 2019

work page 2019
[50]

What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022

Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022

work page 2022
[51]

Test-time training with self-supervision for generalization under distribution shifts

Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 9229–9248. PMLR, ...

work page 2020
[52]

Scaling search-augmented llm reasoning via adaptive information control.arXiv preprint arXiv:2602.01672, 2026

Siheng Xiong, Oguzhan Gungordu, Blair Johnson, James C Kerce, and Faramarz Fekri. Scaling search-augmented llm reasoning via adaptive information control.arXiv preprint arXiv:2602.01672, 2026

work page arXiv 2026
[53]

Scope: Prompt evolution for enhancing agent effectiveness.arXiv preprint arXiv:2512.15374, 2025

Zehua Pei, Hui-Ling Zhen, Shixiong Kai, Sinno Jialin Pan, Yunhe Wang, Mingxuan Yuan, and Bei Yu. Scope: Prompt evolution for enhancing agent effectiveness.arXiv preprint arXiv:2512.15374, 2025

work page arXiv 2025
[54]

Tent: Fully test-time adaptation by entropy minimization

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. InInternational Conference on Learning Representations, 2021

work page 2021
[55]

Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

work page arXiv 2025
[56]

TTRL: Test-Time Reinforcement Learning

Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, et al. Ttrl: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675, 2025

Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, et al. End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675, 2025

work page arXiv 2025
[58]

Self- adapting language models.arXiv preprint arXiv:2506.10943, 2025

Adam Zweiger, Jyothish Pari, Han Guo, Ekin Akyürek, Yoon Kim, and Pulkit Agrawal. Self- adapting language models.arXiv preprint arXiv:2506.10943, 2025

work page arXiv 2025
[59]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 13 A Additional Experimental Details Implementation and Baselines.We implement MemDLM in PyTorch [26] on top of the open-source dllm [27] training library and use lm-evaluation-harness [28] for downstream evaluation. We study two backbones in the...

work page internal anchor Pith review Pith/arXiv arXiv 2017