arxiv: 2505.23884 · v1 · submitted 2025-05-29 · 💻 cs.LG · cs.CL· cs.CV

Recognition: 3 theorem links

· Lean Theorem

Test-Time Training Done Right

Tianyuan Zhang , Sai Bi , Yicong Hong , Kai Zhang , Fujun Luan , Songlin Yang , Kalyan Sunkavalli , William T. Freeman

show 1 more author

Hao Tan

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:21 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CV

keywords test-time traininglarge chunk updatesfast weightslong-context modelingvideo diffusionnovel view synthesisGPU utilizationonline adaptation

0 comments

The pith

Large-chunk updates during inference make test-time training efficient enough to scale nonlinear states to 40 percent of model parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that switching test-time training to very large update chunks, from 2K to 1M tokens, raises GPU utilization by orders of magnitude compared with the tiny minibatches used before. Prior approaches kept fast-weight updates so small that hardware sat idle most of the time and state capacity stayed limited. With large chunks the nonlinear state can grow to 40 percent of total parameters, sophisticated optimizers integrate easily, and no custom kernels are required. This produces working 14-billion-parameter autoregressive video diffusion models on 56K-token sequences and 1-million-token novel-view synthesis. Readers should care because the change removes the main practical obstacle to deploying test-time adaptation on long, high-dimensional data.

Core claim

LaCT performs test-time weight updates on extremely large chunks of 2K to 1M tokens, which raises hardware utilization by orders of magnitude and allows the fast weights to scale up to 40 percent of model parameters, thereby increasing state capacity and enabling large-scale applications such as 14B-parameter autoregressive video diffusion on 56K tokens and 1M-token novel view synthesis without custom kernels.

What carries the argument

Large Chunk Test-Time Training (LaCT), the practice of adapting fast weights on massive token segments instead of small online minibatches to raise utilization and state capacity.

If this is right

Nonlinear state size can scale to 40 percent of total model parameters without custom kernels.
Sophisticated optimizers such as Muon integrate directly into the online update step.
Autoregressive video diffusion models reach 14 billion parameters on sequences of 56K tokens.
Novel-view synthesis handles context lengths of 1 million tokens on standard hardware.
The same chunking approach applies across language, image sets, and video modalities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

LaCT may reduce reliance on specialized long-context hardware by making high-utilization online adaptation available on ordinary GPUs.
The removal of tiny-minibatch constraints could let test-time training handle non-sequential data structures such as point clouds or graphs more directly.
Hybrid schedules that mix large and small chunks might be tested to retain strict causality where needed while keeping efficiency gains.
Because no custom kernels are required, individual labs can now experiment with state sizes far larger than those previously feasible.

Load-bearing premise

Performing weight updates on extremely large chunks of 2K to 1M tokens preserves or improves modeling quality relative to the fine-grained causal updates used in earlier test-time training work.

What would settle it

A controlled experiment on the same long-sequence task in which a LaCT model using large chunks produces higher loss or lower accuracy than an otherwise identical model that updates on 16- or 64-token minibatches.

read the original abstract

Test-Time Training (TTT) models context dependencies by adapting part of the model's weights (referred to as fast weights) during inference. This fast weight, akin to recurrent states in RNNs, stores temporary memories of past tokens in the current sequence. Existing TTT methods struggled to show effectiveness in handling long-context data, due to their inefficiency on modern GPUs. The TTT layers in many of these approaches operate with extremely low FLOPs utilization (often <5%) because they deliberately apply small online minibatch sizes (e.g., updating fast weights every 16 or 64 tokens). Moreover, a small minibatch implies fine-grained block-wise causal dependencies in the data, unsuitable for data beyond 1D ordered sequences, like sets or N-dimensional grids such as images or videos. In contrast, we pursue the opposite direction by using an extremely large chunk update, ranging from 2K to 1M tokens across tasks of varying modalities, which we refer to as Large Chunk Test-Time Training (LaCT). It improves hardware utilization by orders of magnitude, and more importantly, facilitates scaling of nonlinear state size (up to 40% of model parameters), hence substantially improving state capacity, all without requiring cumbersome and error-prone kernel implementations. It also allows easy integration of sophisticated optimizers, e.g. Muon for online updates. We validate our approach across diverse modalities and tasks, including novel view synthesis with image set, language models, and auto-regressive video diffusion. Our approach can scale up to 14B-parameter AR video diffusion model on sequences up to 56K tokens. In our longest sequence experiment, we perform novel view synthesis with 1 million context length. We hope this work will inspire and accelerate new research in the field of long-context modeling and test-time training. Website: https://tianyuanzhang.com/projects/ttt-done-right

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Large-chunk TTT delivers clear hardware and scaling wins but rests on an untested assumption that quality holds versus prior small-minibatch designs.

read the letter

The main point is that this paper reverses the usual TTT design by using very large chunks (2K to 1M tokens) instead of tiny online minibatches. That change lifts GPU utilization by orders of magnitude, lets the nonlinear state grow to 40% of model size, and removes the need for custom kernels. They also show Muon works for the online updates. Those are practical engineering steps that matter for long sequences and non-sequential data like images or video frames. The results include a 14B-parameter autoregressive video model on 56K tokens and novel-view synthesis at 1M context, which are concrete scaling numbers across modalities. The work is honest about the motivation: small chunks hurt hardware and limit state capacity, so bigger chunks are worth trying. What is missing is direct evidence that quality stays the same or improves. The abstract and summary give no side-by-side perplexity, FID, or PSNR numbers that hold model, optimizer, and data fixed while varying only chunk size. Without those controlled comparisons, the claim that large chunks preserve modeling quality versus the fine-grained causal updates in earlier TTT papers remains an assumption rather than a demonstrated result. Minor gaps like missing error bars or ablation tables are common in early empirical work, but here they sit at the center of the argument. This paper is for groups already running long-context or multimodal inference who need workable ways to scale test-time adaptation on current hardware. A reader who wants reproducible patterns for chunked updates and optimizer choices will get something usable. I would send it to peer review. The scaling claims are worth referee scrutiny even if the quality comparison needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper claims that prior Test-Time Training (TTT) methods suffer from low GPU utilization (<5% FLOPs) due to small minibatch updates (16-64 tokens) and are unsuitable for non-sequential data. It introduces Large Chunk Test-Time Training (LaCT) using large chunks (2K-1M tokens) for fast-weight updates, which purportedly boosts hardware efficiency by orders of magnitude, enables scaling nonlinear state sizes to 40% of model parameters, supports advanced optimizers like Muon, and scales to a 14B-parameter autoregressive video diffusion model on 56K tokens plus 1M-context novel view synthesis without custom kernels.

Significance. If the empirical claims hold, LaCT could make TTT viable for long-context and multi-modal tasks on standard hardware, substantially increasing state capacity and broadening applicability beyond 1D sequences.

major comments (2)

[Experiments] Experiments section: no side-by-side ablation compares large-chunk LaCT (2K-1M tokens) against fine-grained small-minibatch TTT on identical models, data, and metrics (e.g., perplexity or FID), which is load-bearing for the claim that large chunks preserve modeling quality.
[Results] Results and abstract: concrete scaling claims (14B model, 56K tokens, 1M context, orders-of-magnitude utilization gains) are stated without tables, error bars, or quantitative hardware measurements, preventing verification of the central efficiency and capacity improvements.

minor comments (2)

[Introduction] Clarify the precise definition and parameterization of 'nonlinear state size' on first use.
[Experiments] Add a table summarizing hardware utilization (FLOPs %) for LaCT versus prior TTT baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The points raised highlight opportunities to strengthen the presentation of our experimental comparisons and quantitative results. We address each major comment below and commit to revisions that will improve verifiability without altering the core claims.

read point-by-point responses

Referee: [Experiments] Experiments section: no side-by-side ablation compares large-chunk LaCT (2K-1M tokens) against fine-grained small-minibatch TTT on identical models, data, and metrics (e.g., perplexity or FID), which is load-bearing for the claim that large chunks preserve modeling quality.

Authors: We agree that an explicit side-by-side ablation on identical models and metrics would provide stronger evidence that large chunks preserve (or improve) modeling quality relative to small-minibatch TTT. The current experiments focus on regimes where small-minibatch TTT is impractical due to GPU utilization constraints and data modality requirements, but we will add a controlled ablation study on a smaller-scale language modeling task, directly comparing LaCT (large chunks) against small-minibatch variants while reporting perplexity and other relevant metrics. revision: yes
Referee: [Results] Results and abstract: concrete scaling claims (14B model, 56K tokens, 1M context, orders-of-magnitude utilization gains) are stated without tables, error bars, or quantitative hardware measurements, preventing verification of the central efficiency and capacity improvements.

Authors: The scaling results are demonstrated through successful end-to-end training and inference runs described in the experiments section. To enhance verifiability, we will expand the results section with dedicated tables that report quantitative hardware metrics (such as achieved FLOPs utilization percentages and throughput), include error bars from repeated runs where feasible, and provide direct numerical comparisons to baseline TTT utilization figures. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering choice with no derivation chain

full rationale

The paper introduces LaCT as a practical reversal of prior small-minibatch TTT design, justified by hardware utilization gains and empirical scaling results on language, video, and novel-view tasks. No equations, uniqueness theorems, or first-principles derivations are presented that reduce to fitted parameters or self-citations by construction. All performance claims (e.g., 14B model on 56K tokens, 1M-token NVS) rest on reported experiments rather than any self-definitional or fitted-input prediction loop. The work is self-contained against external benchmarks and contains no load-bearing self-citation chains.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach relies on standard assumptions that online gradient updates on large chunks remain stable and that the chosen optimizer (Muon) transfers from offline to online use. No new physical or mathematical entities are postulated.

free parameters (1)

chunk size
Chosen per task (2K–1M tokens); directly controls both efficiency and the granularity of causal dependencies.

axioms (1)

domain assumption Online gradient steps on large chunks preserve modeling quality relative to fine-grained updates.
Central to claiming that LaCT is not only faster but also effective.

pith-pipeline@v0.9.0 · 5668 in / 1267 out tokens · 26856 ms · 2026-05-16T11:21:09.142492+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.EightTick eight_tick_forces_D3 echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we pursue the opposite direction by using an extremely large chunk update, ranging from 2K to 1M tokens... Large Chunk Test-Time Training (LaCT). It improves hardware utilization by orders of magnitude, and more importantly, facilitates scaling of nonlinear state size (up to 40% of model parameters)
IndisputableMonolith.Foundation.LedgerForcing conservation_from_balance echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Test-Time Training (TTT) models context dependencies by adapting part of the model's weights (referred to as fast weights) during inference. This fast weight, akin to recurrent states in RNNs, stores temporary memories of past tokens
IndisputableMonolith.Foundation.DiscretenessForcing discreteness_forced echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we adopt the opposite strategy and introduce Large Chunk Test-Time Training (LaCT). LaCT leverages extremely large chunk (from 2048 to 1M tokens) as the basic unit to update the fast weight

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Test-Time Training with KV Binding Is Secretly Linear Attention
cs.LG 2026-02 conditional novelty 8.0

Test-time training with KV binding reduces to learned linear attention.
TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens
cs.CV 2026-04 unverdicted novelty 7.0

TokenGS uses learnable Gaussian tokens in an encoder-decoder architecture to regress 3D means directly, achieving SOTA feed-forward reconstruction on static and dynamic scenes with better robustness.
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training
cs.CV 2026-04 unverdicted novelty 7.0

Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
MemDLM: Memory-Enhanced DLM Training
cs.CL 2026-03 unverdicted novelty 7.0

MemDLM embeds a simulated denoising trajectory into DLM training via bi-level optimization, creating a parametric memory that improves convergence and long-context performance even when the memory is dropped at test time.
ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training
cs.CV 2026-03 unverdicted novelty 7.0

ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.
Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction
cs.CV 2026-05 unverdicted novelty 6.0

Asymmetric token reduction, with distinct merging for queries and pruning for key-values plus layer-wise adaptation, delivers up to 28x speedup on 1000-frame 3D reconstruction inputs while preserving competitive quality.
Lyra 2.0: Explorable Generative 3D Worlds
cs.CV 2026-04 unverdicted novelty 6.0

Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
cs.LG 2026-04 unverdicted novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
cs.CV 2026-04 unverdicted novelty 6.0

INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...
In-Place Test-Time Training
cs.LG 2026-04 conditional novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows
cs.CV 2026-04 conditional novelty 6.0

LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
cs.CV 2026-02 unverdicted novelty 6.0

Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
cs.CV 2025-09 unverdicted novelty 6.0

Rolling Forcing generates multi-minute videos in real time by jointly denoising frames at increasing noise levels, anchoring attention to early frames, and using windowed distillation to limit error accumulation.
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
cs.CV 2025-06 unverdicted novelty 6.0

Self Forcing trains autoregressive video diffusion models by performing autoregressive rollout with KV caching during training to close the exposure bias gap, using a holistic video-level loss and few-step diffusion f...
Kaczmarz Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference
cs.DC 2026-03 unverdicted novelty 5.0

Unifying LLM memory optimizations into a Prepare-Compute-Retrieve-Apply pipeline and accelerating it on GPU-FPGA hardware yields up to 2.2x faster inference and 4.7x less energy than GPU-only baselines.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · cited by 18 Pith papers · 19 internal anchors

[1]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 13

work page 2017
[2]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states. arXiv preprint arXiv:2407.04620, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Linear transformers are secretly fast weight programmers

Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, pages 9355–9366. PMLR, 2021

work page 2021
[4]

Ke Alexander Wang, Jiaxin Shi, and Emily B. Fox. Test-time regression: a unifying framework for designing sequence models with associative memory, 2025

work page 2025
[5]

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. arXiv preprint arXiv:2501.00663, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization, 2025

Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization, 2025

work page 2025
[7]

Lattice: Learning to efficiently compress the memory

Mahdi Karami and Vahab Mirrokni. Lattice: Learning to efficiently compress the memory. arXiv preprint arXiv:2504.05646, 2025

work page arXiv 2025
[8]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024

work page 2024
[9]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1, 2023

work page 2023
[10]

Parallelizing linear transformers with the delta rule over sequence length

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. arXiv preprint arXiv:2406.06484, 2024

work page arXiv 2024
[11]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Gated linear attention transformers with hardware-efficient training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. In International Conference on Machine Learning, pages 56501–56523. PMLR, 2024

work page 2024
[14]

Various lengths, constant speed: Efficient language modeling with lightning attention

Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Various lengths, constant speed: Efficient language modeling with lightning attention. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[15]

Parallelizing linear transformers with the delta rule over sequence length

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[16]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Online normalizer calculation for softmax

Maxim Milakov and Natalia Gimelshein. Online normalizer calculation for softmax. arXiv preprint arXiv:1805.02867, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems, 35:16344–16359, 2022

work page 2022
[19]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arxiv e-prints. arXiv preprint arXiv:1512.03385, 10:9, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[20]

Llama 2: Open foundation and fine-tuned chat models, 2023

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page 2023
[21]

Glu variants improve transformer, 2020

Noam Shazeer. Glu variants improve transformer, 2020

work page 2020
[22]

Weight normalization: A simple reparameterization to accelerate training of deep neural networks

Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in neural information processing systems, 29, 2016

work page 2016
[23]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Simple linear attention language models balance the recall-throughput tradeoff

Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, and Christopher Ré. Simple linear attention language models balance the recall-throughput tradeoff. arXiv preprint arXiv:2402.18668, 2024

work page arXiv 2024
[25]

Transformer quality in linear time

Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc Le. Transformer quality in linear time. In International conference on machine learning, pages 9099–9117. PMLR, 2022

work page 2022
[26]

Leave no context behind: Efficient infinite context transformers with infini-attention, 2024

Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal. Leave no context behind: Efficient infinite context transformers with infini-attention, 2024

work page 2024
[27]

Plenoptic modeling: An image-based rendering system

Leonard McMillan and Gary Bishop. Plenoptic modeling: An image-based rendering system. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 433–440. 2023

work page 2023
[28]

Light field rendering

Marc Levoy and Pat Hanrahan. Light field rendering. InSeminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 441–452. 2023

work page 2023
[29]

Lvsm: A large view synthesis model with minimal 3d inductive bias

Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias. arXiv preprint arXiv:2410.17242, 2024

work page arXiv 2024
[30]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[31]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13142–13153, 2023

work page 2023
[32]

Gs-lrm: Large reconstruction model for 3d gaussian splatting

Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. Gs-lrm: Large reconstruction model for 3d gaussian splatting. In European Conference on Computer Vision, pages 1–19. Springer, 2024

work page 2024
[33]

Google scanned objects: A high-quality dataset of 3d scanned household items

Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022

work page 2022
[34]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024

work page 2024
[35]

Perceiver: General perception with iterative attention

Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–

work page
[36]

Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats

Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Li Fuxin, and Zexiang Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. arXiv preprint arXiv:2410.12781, 2024

work page arXiv 2024
[37]

Long data collections database, 2024

Together AI. Long data collections database, 2024

work page 2024
[38]

Forgetting transformer: Softmax attention with a forget gate

Zhixuan Lin, Evgenii Nikishin, Xu He, and Aaron Courville. Forgetting transformer: Softmax attention with a forget gate. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[39]

Ruler: What’s the real context size of your long-context language models? In First Conference on Language Modeling, 2024

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models? In First Conference on Language Modeling, 2024. 15

work page 2024
[40]

Effective long-context scaling of foundation models, 2023

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao Ma. Effective long-context scaling of foundation models, 2023

work page 2023
[41]

Base of rope bounds context length

Xin Men, Mingyu Xu, Bingning Wang, Qingyu Zhang, Hongyu Lin, Xianpei Han, and Weipeng Chen. Base of rope bounds context length. arXiv preprint arXiv:2405.14591, 2024

work page arXiv 2024
[42]

Roformer: Enhanced transformer with rotary position embedding, 2023

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023

work page 2023
[43]

Wan: Open and Advanced Large-Scale Video Generative Models

Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv. org/abs/2403.03206, 2, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization

Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization. arXiv preprint arXiv:2504.13173, 2025

work page arXiv 2025
[46]

Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc V . Le. Transformer quality in linear time, 2022

work page 2022
[47]

Mega: Moving average equipped gated attention

Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. Mega: Moving average equipped gated attention. In The Eleventh International Conference on Learning Representations, 2023

work page 2023
[48]

Megalodon: Efficient LLM pretraining and inference with unlimited context length

Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, LILI YU, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, and Chunting Zhou. Megalodon: Efficient LLM pretraining and inference with unlimited context length. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[49]

Block-recurrent transformers

DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, and Behnam Neyshabur. Block-recurrent transformers. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022

work page 2022
[50]

Nerf: Representing scenes as neural radiance fields for view synthesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021

work page 2021
[51]

Zero- 1-to-3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero- 1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023

work page 2023
[52]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024

work page 2024
[53]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Movie Gen: A Cast of Media Foundation Models

A Polyak, A Zohar, A Brown, A Tjandra, A Sinha, A Lee, A Vyas, B Shi, CY Ma, CY Chuang, et al. Movie gen: A cast of media foundation models, 2025. URL https://arxiv. org/abs/2410.13720, page 51, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. pmlr, 2015

work page 2015
[56]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[57]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022. 16

work page internal anchor Pith review Pith/arXiv arXiv 2022
[58]

Diffusion for world modeling: Visual details matter in atari

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos J Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. Advances in Neural Information Processing Systems, 37:58757–58791, 2024

work page 2024
[59]

Pyramidal flow matching for efficient video generative modeling

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954, 2024

work page arXiv 2024
[60]

Diffusion Models Are Real-Time Game Engines

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837, 2024

work page internal anchor Pith review arXiv 2024
[61]

Rolling diffusion models

David Ruhe, Jonathan Heek, Tim Salimans, and Emiel Hoogeboom. Rolling diffusion models. In International Conference on Machine Learning, pages 42818–42835. PMLR, 2024

work page 2024
[62]

T., Durand, F., Shechtman, E., and Huang, X

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast causal video generators. arXiv preprint arXiv:2412.07772, 2024

work page arXiv 2024
[63]

History-Guided Video Diffusion

Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History- guided video diffusion. arXiv preprint arXiv:2502.06764, 2025

work page internal anchor Pith review arXiv 2025
[64]

Diffusion forcing: Next-token prediction meets full-sequence diffusion

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems, 37:24081–24125, 2024

work page 2024
[65]

Magi-1: Autoregressive video generation at scale, 2025

Sand-AI. Magi-1: Autoregressive video generation at scale, 2025

work page 2025
[66]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

work page 2018
[67]

Query-key normal- ization for transformers

Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normal- ization for transformers. In Findings of the Association for Computational Linguistics: EMNLP 2020 , pages 4246–4253, 2020

work page 2020
[68]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[69]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

work page 2024
[70]

The mamba in the llama: Distilling and accelerating hybrid models

Junxiong Wang, Daniele Paliotta, Avner May, Alexander Rush, and Tri Dao. The mamba in the llama: Distilling and accelerating hybrid models. Advances in Neural Information Processing Systems, 37:62432– 62457, 2024

work page 2024
[71]

Flexattention: The flexibility of pytorch with the performance of flashattention

Horace He, Driss Guessous, Yanbo Liang, and Joy Dong. Flexattention: The flexibility of pytorch with the performance of flashattention. PyTorch Blog, 8, 2024

work page 2024
[72]

Unipc: A unified predictor-corrector framework for fast sampling of diffusion models

Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models. Advances in Neural Information Processing Systems, 36:49842–49869, 2023

work page 2023
[73]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

update_then_apply

Jiarui Fang and Shangchun Zhao. Usp: A unified sequence parallelism approach for long context generative ai. arXiv preprint arXiv:2405.07719, 2024. 17 A LaCT Model Implementation Details State Size calculation. Motivated by recent progress in LLM, we adopt SwiGLU-MLP [ 21] without bias terms as the fast-weight network. Our fast weights consists of three w...

work page arXiv 2024
[75]

needle in a haystack

The actual chunk size and attention window size is above 2048 in our implementation for better utilization and state size scaling (discussed in Sec. 2.2). As illustrated in the left most of Fig. 10, we employ a parallel design of the TTT layer and sliding window attention to save the number of model parameters and computation FLOPs. In details, the query ...

work page 2048