pith. machine review for the scientific record. sign in

arxiv: 2505.23884 · v1 · submitted 2025-05-29 · 💻 cs.LG · cs.CL· cs.CV

Recognition: 3 theorem links

· Lean Theorem

Test-Time Training Done Right

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:21 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CV
keywords test-time traininglarge chunk updatesfast weightslong-context modelingvideo diffusionnovel view synthesisGPU utilizationonline adaptation
0
0 comments X

The pith

Large-chunk updates during inference make test-time training efficient enough to scale nonlinear states to 40 percent of model parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that switching test-time training to very large update chunks, from 2K to 1M tokens, raises GPU utilization by orders of magnitude compared with the tiny minibatches used before. Prior approaches kept fast-weight updates so small that hardware sat idle most of the time and state capacity stayed limited. With large chunks the nonlinear state can grow to 40 percent of total parameters, sophisticated optimizers integrate easily, and no custom kernels are required. This produces working 14-billion-parameter autoregressive video diffusion models on 56K-token sequences and 1-million-token novel-view synthesis. Readers should care because the change removes the main practical obstacle to deploying test-time adaptation on long, high-dimensional data.

Core claim

LaCT performs test-time weight updates on extremely large chunks of 2K to 1M tokens, which raises hardware utilization by orders of magnitude and allows the fast weights to scale up to 40 percent of model parameters, thereby increasing state capacity and enabling large-scale applications such as 14B-parameter autoregressive video diffusion on 56K tokens and 1M-token novel view synthesis without custom kernels.

What carries the argument

Large Chunk Test-Time Training (LaCT), the practice of adapting fast weights on massive token segments instead of small online minibatches to raise utilization and state capacity.

If this is right

  • Nonlinear state size can scale to 40 percent of total model parameters without custom kernels.
  • Sophisticated optimizers such as Muon integrate directly into the online update step.
  • Autoregressive video diffusion models reach 14 billion parameters on sequences of 56K tokens.
  • Novel-view synthesis handles context lengths of 1 million tokens on standard hardware.
  • The same chunking approach applies across language, image sets, and video modalities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • LaCT may reduce reliance on specialized long-context hardware by making high-utilization online adaptation available on ordinary GPUs.
  • The removal of tiny-minibatch constraints could let test-time training handle non-sequential data structures such as point clouds or graphs more directly.
  • Hybrid schedules that mix large and small chunks might be tested to retain strict causality where needed while keeping efficiency gains.
  • Because no custom kernels are required, individual labs can now experiment with state sizes far larger than those previously feasible.

Load-bearing premise

Performing weight updates on extremely large chunks of 2K to 1M tokens preserves or improves modeling quality relative to the fine-grained causal updates used in earlier test-time training work.

What would settle it

A controlled experiment on the same long-sequence task in which a LaCT model using large chunks produces higher loss or lower accuracy than an otherwise identical model that updates on 16- or 64-token minibatches.

read the original abstract

Test-Time Training (TTT) models context dependencies by adapting part of the model's weights (referred to as fast weights) during inference. This fast weight, akin to recurrent states in RNNs, stores temporary memories of past tokens in the current sequence. Existing TTT methods struggled to show effectiveness in handling long-context data, due to their inefficiency on modern GPUs. The TTT layers in many of these approaches operate with extremely low FLOPs utilization (often <5%) because they deliberately apply small online minibatch sizes (e.g., updating fast weights every 16 or 64 tokens). Moreover, a small minibatch implies fine-grained block-wise causal dependencies in the data, unsuitable for data beyond 1D ordered sequences, like sets or N-dimensional grids such as images or videos. In contrast, we pursue the opposite direction by using an extremely large chunk update, ranging from 2K to 1M tokens across tasks of varying modalities, which we refer to as Large Chunk Test-Time Training (LaCT). It improves hardware utilization by orders of magnitude, and more importantly, facilitates scaling of nonlinear state size (up to 40% of model parameters), hence substantially improving state capacity, all without requiring cumbersome and error-prone kernel implementations. It also allows easy integration of sophisticated optimizers, e.g. Muon for online updates. We validate our approach across diverse modalities and tasks, including novel view synthesis with image set, language models, and auto-regressive video diffusion. Our approach can scale up to 14B-parameter AR video diffusion model on sequences up to 56K tokens. In our longest sequence experiment, we perform novel view synthesis with 1 million context length. We hope this work will inspire and accelerate new research in the field of long-context modeling and test-time training. Website: https://tianyuanzhang.com/projects/ttt-done-right

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that prior Test-Time Training (TTT) methods suffer from low GPU utilization (<5% FLOPs) due to small minibatch updates (16-64 tokens) and are unsuitable for non-sequential data. It introduces Large Chunk Test-Time Training (LaCT) using large chunks (2K-1M tokens) for fast-weight updates, which purportedly boosts hardware efficiency by orders of magnitude, enables scaling nonlinear state sizes to 40% of model parameters, supports advanced optimizers like Muon, and scales to a 14B-parameter autoregressive video diffusion model on 56K tokens plus 1M-context novel view synthesis without custom kernels.

Significance. If the empirical claims hold, LaCT could make TTT viable for long-context and multi-modal tasks on standard hardware, substantially increasing state capacity and broadening applicability beyond 1D sequences.

major comments (2)
  1. [Experiments] Experiments section: no side-by-side ablation compares large-chunk LaCT (2K-1M tokens) against fine-grained small-minibatch TTT on identical models, data, and metrics (e.g., perplexity or FID), which is load-bearing for the claim that large chunks preserve modeling quality.
  2. [Results] Results and abstract: concrete scaling claims (14B model, 56K tokens, 1M context, orders-of-magnitude utilization gains) are stated without tables, error bars, or quantitative hardware measurements, preventing verification of the central efficiency and capacity improvements.
minor comments (2)
  1. [Introduction] Clarify the precise definition and parameterization of 'nonlinear state size' on first use.
  2. [Experiments] Add a table summarizing hardware utilization (FLOPs %) for LaCT versus prior TTT baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The points raised highlight opportunities to strengthen the presentation of our experimental comparisons and quantitative results. We address each major comment below and commit to revisions that will improve verifiability without altering the core claims.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: no side-by-side ablation compares large-chunk LaCT (2K-1M tokens) against fine-grained small-minibatch TTT on identical models, data, and metrics (e.g., perplexity or FID), which is load-bearing for the claim that large chunks preserve modeling quality.

    Authors: We agree that an explicit side-by-side ablation on identical models and metrics would provide stronger evidence that large chunks preserve (or improve) modeling quality relative to small-minibatch TTT. The current experiments focus on regimes where small-minibatch TTT is impractical due to GPU utilization constraints and data modality requirements, but we will add a controlled ablation study on a smaller-scale language modeling task, directly comparing LaCT (large chunks) against small-minibatch variants while reporting perplexity and other relevant metrics. revision: yes

  2. Referee: [Results] Results and abstract: concrete scaling claims (14B model, 56K tokens, 1M context, orders-of-magnitude utilization gains) are stated without tables, error bars, or quantitative hardware measurements, preventing verification of the central efficiency and capacity improvements.

    Authors: The scaling results are demonstrated through successful end-to-end training and inference runs described in the experiments section. To enhance verifiability, we will expand the results section with dedicated tables that report quantitative hardware metrics (such as achieved FLOPs utilization percentages and throughput), include error bars from repeated runs where feasible, and provide direct numerical comparisons to baseline TTT utilization figures. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering choice with no derivation chain

full rationale

The paper introduces LaCT as a practical reversal of prior small-minibatch TTT design, justified by hardware utilization gains and empirical scaling results on language, video, and novel-view tasks. No equations, uniqueness theorems, or first-principles derivations are presented that reduce to fitted parameters or self-citations by construction. All performance claims (e.g., 14B model on 56K tokens, 1M-token NVS) rest on reported experiments rather than any self-definitional or fitted-input prediction loop. The work is self-contained against external benchmarks and contains no load-bearing self-citation chains.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach relies on standard assumptions that online gradient updates on large chunks remain stable and that the chosen optimizer (Muon) transfers from offline to online use. No new physical or mathematical entities are postulated.

free parameters (1)
  • chunk size
    Chosen per task (2K–1M tokens); directly controls both efficiency and the granularity of causal dependencies.
axioms (1)
  • domain assumption Online gradient steps on large chunks preserve modeling quality relative to fine-grained updates.
    Central to claiming that LaCT is not only faster but also effective.

pith-pipeline@v0.9.0 · 5668 in / 1267 out tokens · 26856 ms · 2026-05-16T11:21:09.142492+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.EightTick eight_tick_forces_D3 echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    we pursue the opposite direction by using an extremely large chunk update, ranging from 2K to 1M tokens... Large Chunk Test-Time Training (LaCT). It improves hardware utilization by orders of magnitude, and more importantly, facilitates scaling of nonlinear state size (up to 40% of model parameters)

  • IndisputableMonolith.Foundation.LedgerForcing conservation_from_balance echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Test-Time Training (TTT) models context dependencies by adapting part of the model's weights (referred to as fast weights) during inference. This fast weight, akin to recurrent states in RNNs, stores temporary memories of past tokens

  • IndisputableMonolith.Foundation.DiscretenessForcing discreteness_forced echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    we adopt the opposite strategy and introduce Large Chunk Test-Time Training (LaCT). LaCT leverages extremely large chunk (from 2048 to 1M tokens) as the basic unit to update the fast weight

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Test-Time Training with KV Binding Is Secretly Linear Attention

    cs.LG 2026-02 conditional novelty 8.0

    Test-time training with KV binding reduces to learned linear attention.

  2. TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens

    cs.CV 2026-04 unverdicted novelty 7.0

    TokenGS uses learnable Gaussian tokens in an encoder-decoder architecture to regress 3D means directly, achieving SOTA feed-forward reconstruction on static and dynamic scenes with better robustness.

  3. Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training

    cs.CV 2026-04 unverdicted novelty 7.0

    Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.

  4. MemDLM: Memory-Enhanced DLM Training

    cs.CL 2026-03 unverdicted novelty 7.0

    MemDLM embeds a simulated denoising trajectory into DLM training via bi-level optimization, creating a parametric memory that improves convergence and long-context performance even when the memory is dropped at test time.

  5. ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training

    cs.CV 2026-03 unverdicted novelty 7.0

    ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.

  6. Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction

    cs.CV 2026-05 unverdicted novelty 6.0

    Asymmetric token reduction, with distinct merging for queries and pruning for key-values plus layer-wise adaptation, delivers up to 28x speedup on 1000-frame 3D reconstruction inputs while preserving competitive quality.

  7. Lyra 2.0: Explorable Generative 3D Worlds

    cs.CV 2026-04 unverdicted novelty 6.0

    Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.

  8. Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

    cs.LG 2026-04 unverdicted novelty 6.0

    DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...

  9. Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.

  10. INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

    cs.CV 2026-04 unverdicted novelty 6.0

    INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...

  11. In-Place Test-Time Training

    cs.LG 2026-04 conditional novelty 6.0

    In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

  12. LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows

    cs.CV 2026-04 conditional novelty 6.0

    LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.

  13. Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

    cs.CV 2026-02 unverdicted novelty 6.0

    Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.

  14. Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

    cs.CV 2025-09 unverdicted novelty 6.0

    Rolling Forcing generates multi-minute videos in real time by jointly denoising frames at increasing noise levels, anchoring attention to early frames, and using windowed distillation to limit error accumulation.

  15. Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    cs.CV 2025-06 unverdicted novelty 6.0

    Self Forcing trains autoregressive video diffusion models by performing autoregressive rollout with KV caching during training to close the exposure bias gap, using a holistic video-level loss and few-step diffusion f...

  16. Kaczmarz Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...

  17. MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.

  18. Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference

    cs.DC 2026-03 unverdicted novelty 5.0

    Unifying LLM memory optimizations into a Prepare-Compute-Retrieve-Apply pipeline and accelerating it on GPU-FPGA hardware yields up to 2.2x faster inference and 4.7x less energy than GPU-only baselines.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · cited by 18 Pith papers · 19 internal anchors

  1. [1]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 13

  2. [2]

    Learning to (Learn at Test Time): RNNs with Expressive Hidden States

    Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states. arXiv preprint arXiv:2407.04620, 2024

  3. [3]

    Linear transformers are secretly fast weight programmers

    Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, pages 9355–9366. PMLR, 2021

  4. [4]

    Ke Alexander Wang, Jiaxin Shi, and Emily B. Fox. Test-time regression: a unifying framework for designing sequence models with associative memory, 2025

  5. [5]

    Titans: Learning to Memorize at Test Time

    Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. arXiv preprint arXiv:2501.00663, 2024

  6. [6]

    It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization, 2025

    Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization, 2025

  7. [7]

    Lattice: Learning to efficiently compress the memory

    Mahdi Karami and Vahab Mirrokni. Lattice: Learning to efficiently compress the memory. arXiv preprint arXiv:2504.05646, 2025

  8. [8]

    Muon: An optimizer for hidden layers in neural networks, 2024

    Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024

  9. [9]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1, 2023

  10. [10]

    Parallelizing linear transformers with the delta rule over sequence length

    Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. arXiv preprint arXiv:2406.06484, 2024

  11. [11]

    Retentive Network: A Successor to Transformer for Large Language Models

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023

  12. [12]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

  13. [13]

    Gated linear attention transformers with hardware-efficient training

    Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. In International Conference on Machine Learning, pages 56501–56523. PMLR, 2024

  14. [14]

    Various lengths, constant speed: Efficient language modeling with lightning attention

    Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Various lengths, constant speed: Efficient language modeling with lightning attention. In Forty-first International Conference on Machine Learning, 2024

  15. [15]

    Parallelizing linear transformers with the delta rule over sequence length

    Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  16. [16]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023

  17. [17]

    Online normalizer calculation for softmax

    Maxim Milakov and Natalia Gimelshein. Online normalizer calculation for softmax. arXiv preprint arXiv:1805.02867, 2018

  18. [18]

    Flashattention: Fast and memory-efficient exact attention with io-awareness

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems, 35:16344–16359, 2022

  19. [19]

    Deep Residual Learning for Image Recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arxiv e-prints. arXiv preprint arXiv:1512.03385, 10:9, 2015

  20. [20]

    Llama 2: Open foundation and fine-tuned chat models, 2023

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

  21. [21]

    Glu variants improve transformer, 2020

    Noam Shazeer. Glu variants improve transformer, 2020

  22. [22]

    Weight normalization: A simple reparameterization to accelerate training of deep neural networks

    Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in neural information processing systems, 29, 2016

  23. [23]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060, 2024

  24. [24]

    Simple linear attention language models balance the recall-throughput tradeoff

    Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, and Christopher Ré. Simple linear attention language models balance the recall-throughput tradeoff. arXiv preprint arXiv:2402.18668, 2024

  25. [25]

    Transformer quality in linear time

    Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc Le. Transformer quality in linear time. In International conference on machine learning, pages 9099–9117. PMLR, 2022

  26. [26]

    Leave no context behind: Efficient infinite context transformers with infini-attention, 2024

    Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal. Leave no context behind: Efficient infinite context transformers with infini-attention, 2024

  27. [27]

    Plenoptic modeling: An image-based rendering system

    Leonard McMillan and Gary Bishop. Plenoptic modeling: An image-based rendering system. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 433–440. 2023

  28. [28]

    Light field rendering

    Marc Levoy and Pat Hanrahan. Light field rendering. InSeminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 441–452. 2023

  29. [29]

    Lvsm: A large view synthesis model with minimal 3d inductive bias

    Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias. arXiv preprint arXiv:2410.17242, 2024

  30. [30]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

  31. [31]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13142–13153, 2023

  32. [32]

    Gs-lrm: Large reconstruction model for 3d gaussian splatting

    Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. Gs-lrm: Large reconstruction model for 3d gaussian splatting. In European Conference on Computer Vision, pages 1–19. Springer, 2024

  33. [33]

    Google scanned objects: A high-quality dataset of 3d scanned household items

    Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022

  34. [34]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024

  35. [35]

    Perceiver: General perception with iterative attention

    Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–

  36. [36]

    Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats

    Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Li Fuxin, and Zexiang Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. arXiv preprint arXiv:2410.12781, 2024

  37. [37]

    Long data collections database, 2024

    Together AI. Long data collections database, 2024

  38. [38]

    Forgetting transformer: Softmax attention with a forget gate

    Zhixuan Lin, Evgenii Nikishin, Xu He, and Aaron Courville. Forgetting transformer: Softmax attention with a forget gate. In The Thirteenth International Conference on Learning Representations, 2025

  39. [39]

    Ruler: What’s the real context size of your long-context language models? In First Conference on Language Modeling, 2024

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models? In First Conference on Language Modeling, 2024. 15

  40. [40]

    Effective long-context scaling of foundation models, 2023

    Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao Ma. Effective long-context scaling of foundation models, 2023

  41. [41]

    Base of rope bounds context length

    Xin Men, Mingyu Xu, Bingning Wang, Qingyu Zhang, Hongyu Lin, Xianpei Han, and Weipeng Chen. Base of rope bounds context length. arXiv preprint arXiv:2405.14591, 2024

  42. [42]

    Roformer: Enhanced transformer with rotary position embedding, 2023

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023

  43. [43]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025

  44. [44]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv. org/abs/2403.03206, 2, 2024

  45. [45]

    It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization

    Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization. arXiv preprint arXiv:2504.13173, 2025

  46. [46]

    Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc V . Le. Transformer quality in linear time, 2022

  47. [47]

    Mega: Moving average equipped gated attention

    Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. Mega: Moving average equipped gated attention. In The Eleventh International Conference on Learning Representations, 2023

  48. [48]

    Megalodon: Efficient LLM pretraining and inference with unlimited context length

    Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, LILI YU, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, and Chunting Zhou. Megalodon: Efficient LLM pretraining and inference with unlimited context length. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  49. [49]

    Block-recurrent transformers

    DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, and Behnam Neyshabur. Block-recurrent transformers. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022

  50. [50]

    Nerf: Representing scenes as neural radiance fields for view synthesis

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021

  51. [51]

    Zero- 1-to-3: Zero-shot one image to 3d object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero- 1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023

  52. [52]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024

  53. [53]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024

  54. [54]

    Movie Gen: A Cast of Media Foundation Models

    A Polyak, A Zohar, A Brown, A Tjandra, A Sinha, A Lee, A Vyas, B Shi, CY Ma, CY Chuang, et al. Movie gen: A cast of media foundation models, 2025. URL https://arxiv. org/abs/2410.13720, page 51, 2024

  55. [55]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. pmlr, 2015

  56. [56]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020

  57. [57]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022. 16

  58. [58]

    Diffusion for world modeling: Visual details matter in atari

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos J Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. Advances in Neural Information Processing Systems, 37:58757–58791, 2024

  59. [59]

    Pyramidal flow matching for efficient video generative modeling

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954, 2024

  60. [60]

    Diffusion Models Are Real-Time Game Engines

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837, 2024

  61. [61]

    Rolling diffusion models

    David Ruhe, Jonathan Heek, Tim Salimans, and Emiel Hoogeboom. Rolling diffusion models. In International Conference on Machine Learning, pages 42818–42835. PMLR, 2024

  62. [62]

    T., Durand, F., Shechtman, E., and Huang, X

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast causal video generators. arXiv preprint arXiv:2412.07772, 2024

  63. [63]

    History-Guided Video Diffusion

    Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History- guided video diffusion. arXiv preprint arXiv:2502.06764, 2025

  64. [64]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems, 37:24081–24125, 2024

  65. [65]

    Magi-1: Autoregressive video generation at scale, 2025

    Sand-AI. Magi-1: Autoregressive video generation at scale, 2025

  66. [66]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

  67. [67]

    Query-key normal- ization for transformers

    Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normal- ization for transformers. In Findings of the Association for Computational Linguistics: EMNLP 2020 , pages 4246–4253, 2020

  68. [68]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

  69. [69]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

  70. [70]

    The mamba in the llama: Distilling and accelerating hybrid models

    Junxiong Wang, Daniele Paliotta, Avner May, Alexander Rush, and Tri Dao. The mamba in the llama: Distilling and accelerating hybrid models. Advances in Neural Information Processing Systems, 37:62432– 62457, 2024

  71. [71]

    Flexattention: The flexibility of pytorch with the performance of flashattention

    Horace He, Driss Guessous, Yanbo Liang, and Joy Dong. Flexattention: The flexibility of pytorch with the performance of flashattention. PyTorch Blog, 8, 2024

  72. [72]

    Unipc: A unified predictor-corrector framework for fast sampling of diffusion models

    Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models. Advances in Neural Information Processing Systems, 36:49842–49869, 2023

  73. [73]

    DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023

  74. [74]

    update_then_apply

    Jiarui Fang and Shangchun Zhao. Usp: A unified sequence parallelism approach for long context generative ai. arXiv preprint arXiv:2405.07719, 2024. 17 A LaCT Model Implementation Details State Size calculation. Motivated by recent progress in LLM, we adopt SwiGLU-MLP [ 21] without bias terms as the fast-weight network. Our fast weights consists of three w...

  75. [75]

    needle in a haystack

    The actual chunk size and attention window size is above 2048 in our implementation for better utilization and state size scaling (discussed in Sec. 2.2). As illustrated in the left most of Fig. 10, we employ a parallel design of the TTT layer and sliding window attention to save the number of model parameters and computation FLOPs. In details, the query ...