Test-Time Training with KV Binding Is Secretly Linear Attention

Junchen Liu , Sven Elflein , Or Litany , Zan Gojcic , Ruilong Li

Authors on Pith no claims yet

classification 💻 cs.LG cs.AIcs.CV

keywords attentionlinearformtest-timebindinglearnedmultipletraining

read the original abstract

Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomena that contradict this memorization-based interpretation. Motivated by these findings, we revisit the formulation of TTT and show that a broad class of TTT architectures can be expressed as a form of learned linear attention operator. Beyond explaining previously puzzling model behaviors, this perspective yields multiple practical benefits: it enables principled architectural simplifications, admits fully parallel formulations that preserve performance while improving efficiency, and provides a systematic reduction of diverse TTT variants to a standard linear attention form. Overall, our results reframe TTT not as test-time memorization, but as learned linear attention with enhanced representational capacity. Project page: https://research.nvidia.com/labs/sil/projects/tttla/.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
cs.AI 2026-04 unverdicted novelty 6.0

LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and a...
Fast Spatial Memory with Elastic Test-Time Training
cs.CV 2026-04 unverdicted novelty 6.0

Elastic Test-Time Training stabilizes test-time updates via an elastic prior and moving-average anchor, enabling Fast Spatial Memory for scalable long-sequence 4D reconstruction with reduced memory use and fewer shortcuts.