Towards Memory-Efficient Autoregressive Video Generation via Instance-Specific Parametric Absorption
Pith reviewed 2026-07-02 14:10 UTC · model grok-4.3
The pith
Instance-specific weight modulations let autoregressive video models switch layers to local attention and halve their KV cache with near-lossless quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
During a warmup phase, output discrepancies between global and local attention are monitored; at the transition point a closed-form least-squares problem yields an instance-specific weight modulation that compensates for the missing historical context, allowing selected layers to switch from full attention to local attention and thereby removing up to 50% of the KV cache without significant temporal flickering or identity loss.
What carries the argument
Instance-Specific Parametric Absorption (ISPA), which uses a least-squares solution on warmup discrepancies to modulate weights and enable the shift to local attention.
If this is right
- KV cache size can be reduced by up to 50% on models ranging from 1.3B to 14B parameters.
- Visual quality remains near-lossless in autoregressive video generation.
- The approach applies across different autoregressive architectures for streaming video.
- Cache management can shift from discarding tokens to distilling context into weights.
Where Pith is reading between the lines
- The same absorption step could be tested on non-video long-context tasks such as audio or text generation.
- Allowing the transition point to move dynamically during inference rather than at a fixed warmup end might further reduce average cache size.
- Repeating the least-squares adjustment at multiple points in a very long sequence could handle cases where early compensation drifts.
Load-bearing premise
The closed-form least-squares weight adjustment derived from the warmup phase will continue to compensate for missing historical context throughout generation without introducing new temporal artifacts or identity drift.
What would settle it
Generate long videos with the method and check whether temporal flickering or identity changes appear after the warmup phase on varied content.
Figures
read the original abstract
Autoregressive (AR) streaming models have emerged as a powerful paradigm for long video generation. However, the linearly growing Key-Value (KV) cache poses a significant bottleneck, leading to memory overload and degraded inference throughput. A common compression method is to drop redundant KV tokens, which often breaks long-range dependencies, resulting in temporal flickering and identity loss. In this paper, we propose Instance-Specific Parametric Absorption (ISPA), a novel framework that shifts the KV cache compression from discarding to distilling. The core idea is to transit a subset of layers from Full-Attention (F-Layers) to memory-efficient Local-Attention (L-Layers) by "absorbing" historical context into the model's weights. Specifically, during a brief warmup phase, ISPA monitors the output discrepancy between global and local attention. At the transition point, we solve a closed-form least-squares problem to compute an instance-specific weight modulation that compensates for the missing history. Experiments across architectures (1.3B to 14B) demonstrate that ISPA can remove up to 50\% of the KV cache with near-lossless visual quality. We hope this perspective encourages future work to explore parametric memory consolidation beyond external token-level cache management for streaming generative models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Instance-Specific Parametric Absorption (ISPA) to address KV cache growth in autoregressive video generation. During a brief warmup, output discrepancies between full-attention and local-attention layers are monitored; at transition, a closed-form least-squares problem yields an instance-specific weight modulation that allows selected layers to switch to local attention while absorbing historical context. Experiments on 1.3B–14B models report up to 50% KV cache reduction with near-lossless visual quality.
Significance. If the central claim holds, ISPA offers a parametric alternative to token-dropping KV compression that preserves long-range dependencies without external cache management. The closed-form least-squares computation and scaling experiments across model sizes (1.3B to 14B) are positive features that could be reproducible if the formulation is supplied.
major comments (2)
- [Abstract] Abstract and method description: the central claim that a single closed-form least-squares modulation computed from warmup-phase discrepancies suffices to compensate for missing KV history for the remainder of an arbitrarily long video lacks any derivation, stationarity bound, or post-transition validation. This directly undermines the 'near-lossless' guarantee when content, motion, or identity statistics evolve after the transition point.
- [Abstract] Abstract: no explicit formulation of the least-squares objective, the precise discrepancy metric, the variables being solved for, or how the resulting modulation is injected into the layer weights is provided. Without these, the reported results cannot be verified or reproduced, rendering the 'closed-form' and 'parameter-free' aspects unverifiable.
Simulated Author's Rebuttal
We thank the referee for the constructive comments highlighting the need for greater clarity and validation in the ISPA method. We address each point below and will revise the manuscript to improve reproducibility and strengthen the empirical support for the central claims.
read point-by-point responses
-
Referee: [Abstract] Abstract and method description: the central claim that a single closed-form least-squares modulation computed from warmup-phase discrepancies suffices to compensate for missing KV history for the remainder of an arbitrarily long video lacks any derivation, stationarity bound, or post-transition validation. This directly undermines the 'near-lossless' guarantee when content, motion, or identity statistics evolve after the transition point.
Authors: We agree that the manuscript does not include a formal derivation or stationarity bound proving the modulation remains effective for arbitrarily long videos under evolving statistics. The provided experiments demonstrate near-lossless quality on sequences substantially longer than the warmup phase across multiple model scales, but these are empirical. We will add explicit post-transition validation on extended videos (e.g., 2x–4x the original lengths) with varying motion and identity changes, plus a short discussion of the implicit stationarity assumption during the brief warmup. A theoretical bound is beyond the current scope but could be explored in future work. revision: partial
-
Referee: [Abstract] Abstract: no explicit formulation of the least-squares objective, the precise discrepancy metric, the variables being solved for, or how the resulting modulation is injected into the layer weights is provided. Without these, the reported results cannot be verified or reproduced, rendering the 'closed-form' and 'parameter-free' aspects unverifiable.
Authors: The method section derives the closed-form solution by solving a least-squares problem that minimizes the L2 output discrepancy between full-attention and local-attention layers over the warmup tokens; the variables solved are an instance-specific diagonal scaling matrix applied multiplicatively to the query and key projection weights of the transitioned layer. We acknowledge the abstract omits these details and the method description could be more self-contained. We will revise the abstract to include a concise formulation statement and expand the method section with the exact objective, discrepancy metric, solved variables, and injection procedure to enable direct reproduction. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's central mechanism computes an instance-specific weight modulation via closed-form least-squares on output discrepancies observed only during an explicit warmup phase, then applies the fixed modulation to enable local attention for the remainder of generation. This is an empirical proposal whose reported success (near-lossless quality after KV cache reduction) is measured on held-out video content rather than being identical to the warmup discrepancies by construction. No equations reduce the final quality metric to the fitted parameters, no self-citations are invoked as load-bearing uniqueness theorems, and the derivation chain contains no self-definitional steps or renamed known results. The method is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Output discrepancy measured during warmup is a sufficient signal to compute a weight modulation that fully compensates for history loss.
Reference graph
Works this paper leans on
-
[1]
In: International Conference on Machine Learning
Behnam,P.,Fu,Y.,Zhao,R.,Tsai,P.A.,Yu,Z.,Tumanov,A.:Rocketkv:Accelerat- ing long-context llm inference via two-stage kv cache compression. In: International Conference on Machine Learning. pp. 3358–3392. PMLR (2025)
2025
-
[2]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
SkyReels-V2: Infinite-length Film Generative Model
Chen, G., Lin, D., Yang, J., Lin, C., Zhu, J., Fan, M., Zhang, H., Chen, S., Chen, Z., Ma, C., et al.: Skyreels-v2: Infinite-length film generative model. arXiv preprint arXiv:2504.13074 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y., Hsieh, C.J.: Self-forcing++: Towards minute-scale high-quality video generation. arXiv preprint arXiv:2510.02283 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
In: The Twelfth International Conference on Learning Representations
Dao, T.: Flashattention-2: Faster attention with better parallelism and work par- titioning. In: The Twelfth International Conference on Learning Representations
-
[6]
Advances in neural information pro- cessing systems35, 16344–16359 (2022)
Dao, T., Fu, D., Ermon, S., Rudra, A., Ré, C.: Flashattention: Fast and memory- efficient exact attention with io-awareness. Advances in neural information pro- cessing systems35, 16344–16359 (2022)
2022
-
[7]
In: Forty-first international conference on machine learning (2024)
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)
2024
-
[8]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Fu, X., Li, J.: Tcfg: Truncated classifier-free guidance for efficient and scalable text- to-image acceleration. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18552–18562 (2025)
2025
-
[9]
In: The Thirteenth International Conference on Learning Representations
Gu, X., Pang, T., Du, C., Liu, Q., Zhang, F., Du, C., Wang, Y., Lin, M.: When attention sink emerges in language models: An empirical view. In: The Thirteenth International Conference on Learning Representations
-
[10]
arXiv preprint arXiv:2601.20499 (2026)
Guo, H., Jia, Z., Li, J., Li, B., Cai, Y., Wang, J., Li, Y., Lu, Y.: Efficient au- toregressive video diffusion with dummy head. arXiv preprint arXiv:2601.20499 (2026)
-
[11]
Advances in neural information processing systems33, 6840–6851 (2020)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)
2020
-
[12]
In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hooper, C.R.C., Kim, S., Mohammadzadeh, H., Maheswaran, M., Zhao, S., Paik, J., Mahoney, M.W., Keutzer, K., Gholami, A.: Squeezed attention: Accelerating long context length llm inference. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 32631–32652 (2025)
2025
-
[13]
arXiv preprint arXiv:2602.03747 (2026)
Huang, J., Ye, Z., Hu, X., He, T., Zhang, G., Shi, S., Bian, J., Jiang, L.: Live: Long-horizon interactive video world modeling. arXiv preprint arXiv:2602.03747 (2026)
-
[14]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length
Huang, Y., Guo, H., Wu, F., Zhang, S., Huang, S., Gan, Q., Liu, L., Zhao, S., Chen, E., Liu, J., et al.: Live avatar: Streaming real-time audio-driven avatar generation with infinite length. arXiv preprint arXiv:2512.04677 (2025) Memory-Efficient Video Generation by Parametric Absorption 17
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024)
2024
-
[17]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Li, J., Fu, X., Gao, Y., Wang, J., Wang, X., So, H.K.H.: Rethinking conditioning in diffusion models: Dynamic token scheduling for efficient and aligned text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4160–4169 (2026)
2026
-
[18]
arXiv preprint arXiv:2602.14027 (2026)
Li, J., Fu, X., Peng, X., Chen, W., Zheng, Y., Zhao, T., Wang, J., Chen, F., Wang, X., So, H.K.H.: Train short, inference long: Training-free horizon extension for autoregressive video generation. arXiv preprint arXiv:2602.14027 (2026)
-
[19]
arXiv preprint arXiv:2510.09212 (2025)
Li, W., Pan, W., Luan, P.C., Gao, Y., Alahi, A.: Stable video infinity: Infinite- length video generation with error recycling. arXiv preprint arXiv:2510.09212 (2025)
-
[20]
Advances in Neural Information Processing Systems37, 22947–22970 (2024)
Li, Y., Huang, Y., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., Chen, D.: Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems37, 22947–22970 (2024)
2024
-
[21]
In: The Eleventh International Conference on Learning Rep- resentations
Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: The Eleventh International Conference on Learning Rep- resentations
-
[22]
Advances in Neural Information Processing Systems37, 139997–140031 (2024)
Liu, A., Liu, J., Pan, Z., He, Y., Haffari, G., Zhuang, B.: Minicache: Kv cache compression in depth dimension for large language models. Advances in Neural Information Processing Systems37, 139997–140031 (2024)
2024
-
[23]
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
Liu, K., Hu, W., Xu, J., Shan, Y., Lu, S.: Rolling forcing: Autoregressive long video diffusion in real time. arXiv preprint arXiv:2509.25161 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
In: The Eleventh International Conference on Learning Representations
Liu, X., Gong, C., et al.: Flow straight and fast: Learning to generate and transfer data with rectified flow. In: The Eleventh International Conference on Learning Representations
-
[25]
Lu, Y., Zeng, Y., Li, H., Ouyang, H., Wang, Q., Cheng, K.L., Zhu, J., Cao, H., Zhang, Z., Zhu, X., et al.: Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation. arXiv preprint arXiv:2512.04678 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
In: European Conference on Computer Vision
Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In: European Conference on Computer Vision. pp. 23–40. Springer (2024)
2024
-
[27]
Latte: Latent Diffusion Transformer for Video Generation
Ma, X., Wang, Y., Chen, X., Jia, G., Liu, Z., Li, Y.F., Chen, C., Qiao, Y.: Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
arXiv preprint arXiv:2602.10825 (2026)
Ma, Y., Zheng, X., Xu, J., Xu, X., Ling, F., Zheng, X., Kuang, H., Li, H., Wang, X., Xiao, X., et al.: Flow caching for autoregressive video generation. arXiv preprint arXiv:2602.10825 (2026)
-
[29]
com/krea-ai/realtime-video
Millon,E.:Krearealtime14b:Real-timevideogeneration(2025),https://github. com/krea-ai/realtime-video
2025
-
[30]
Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)
2023
-
[31]
Movie Gen: A Cast of Media Foundation Models
Polyak, A., Zohar, A., Brown, A., Tjandra, A., Sinha, A., Lee, A., Vyas, A., Shi, B., Ma, C.Y., Chuang, C.Y., et al.: Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720 (2024) 18 X. Fu et al
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
2022
-
[33]
In: The Twelfth International Conference on Learning Representations
Song, Y., Dhariwal, P.: Improved techniques for training consistency models. In: The Twelfth International Conference on Learning Representations
-
[34]
In: Interna- tional Conference on Machine Learning
Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: Interna- tional Conference on Machine Learning. pp. 32211–32252. PMLR (2023)
2023
-
[35]
In: Forty-first International Con- ference on Machine Learning
Tang, J., Zhao, Y., Zhu, K., Xiao, G., Kasikci, B., Han, S.: Quest: Query-aware sparsity for efficient long-context llm inference. In: Forty-first International Con- ference on Machine Learning
-
[36]
Advancing Open-source World Models
Team, R., Gao, Z., Wang, Q., Zeng, Y., Zhu, J., Cheng, K.L., Li, Y., Wang, H., Xu, Y., Ma, S., et al.: Advancing open-source world models. arXiv preprint arXiv:2601.20540 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[37]
MAGI-1: Autoregressive Video Generation at Scale
Teng, H., Jia, H., Sun, L., Li, L., Li, M., Tang, M., Han, S., Zhang, T., Zhang, W., Luo, W., et al.: Magi-1: Autoregressive video generation at scale. arXiv preprint arXiv:2505.13211 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
In: International Conference on Learning Representations
Villegas, R., Babaeizadeh, M., Kindermans, P.J., Moraldo, H., Zhang, H., Saffar, M.T., Castro, S., Kunze, J., Erhan, D.: Phenaki: Variable length video generation from open domain textual descriptions. In: International Conference on Learning Representations
-
[39]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
In: The Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems
Wang, A., Chen, H., Tan, J., Zhang, K., Cai, X., Lin, Z., Han, J., Ding, G.: Pre- fixkv: Adaptive prefix kv cache is what vision instruction-following models need for efficient generation. In: The Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems
-
[41]
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Wu, H., Wu, D., He, T., Guo, J., Ye, Y., Duan, Y., Bian, J.: Geometry forcing: Mar- rying video diffusion and 3d representation for consistent world modeling. arXiv preprint arXiv:2507.07982 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [42]
-
[43]
Efficient Streaming Language Models with Attention Sinks
Xiao, G., Tian, Y., Chen, B., Han, S., Lewis, M.: Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
LongLive: Real-time Interactive Long Video Generation
Yang, S., Huang, W., Chu, R., Xiao, Y., Zhao, Y., Wang, X., Li, M., Xie, E., Chen, Y., Lu, Y., et al.: Longlive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
arXiv preprint arXiv:2511.20649 (2025)
Yesiltepe, H., Meral, T.H.S., Akan, A.K., Oktay, K., Yanardag, P.: Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self- rollout. arXiv preprint arXiv:2511.20649 (2025)
-
[47]
arXiv preprint arXiv:2512.05081 (2025)
Yi, J., Jang, W., Cho, P.H., Nam, J., Yoon, H., Kim, S.: Deep forcing: Training-free long video generation with deep sink and participative compression. arXiv preprint arXiv:2512.05081 (2025)
-
[48]
Advances in neural information processing systems37, 47455–47487 (2024) Memory-Efficient Video Generation by Parametric Absorption 19
Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., Freeman, B.: Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems37, 47455–47487 (2024) Memory-Efficient Video Generation by Parametric Absorption 19
2024
-
[49]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distribution matching distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6613– 6623 (2024)
2024
-
[50]
In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference
Yin, T., Zhang, Q., Zhang, R., Freeman, W.T., Durand, F., Shechtman, E., Huang, X.: From slow bidirectional to fast autoregressive video diffusion models. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 22963– 22974 (2025)
2025
-
[51]
In: International Conference on Machine Learning
Yu, Z., Wang, Z., Fu, Y., Shi, H., Shaikh, K., Lin, Y.C.: Unveiling and harness- ing hidden attention sinks: Enhancing large language models without training through attention calibration. In: International Conference on Machine Learning. pp. 57659–57677. PMLR (2024)
2024
-
[52]
In: International Conference on Machine Learning
Zhang,J.,Huang,H.,Zhang,P.,Wei,J.,Zhu,J.,Chen,J.:Sageattention2:Efficient attention with thorough outlier smoothing and per-thread int4 quantization. In: International Conference on Machine Learning. pp. 75097–75119. PMLR (2025)
2025
-
[53]
In: The Thirteenth International Conference on Learning Representations
Zhang, J., Zhang, P., Zhu, J., Chen, J., et al.: Sageattention: Accurate 8-bit at- tention for plug-and-play inference acceleration. In: The Thirteenth International Conference on Learning Representations
-
[54]
Advances in Neural Information Processing Systems36, 34661–34710 (2023)
Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., Ré, C., Barrett, C., et al.: H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems36, 34661–34710 (2023)
2023
-
[55]
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
Zheng, D., Huang, Z., Liu, H., Zou, K., He, Y., Zhang, F., Gu, L., Zhang, Y., He, J., Zheng, W.S., et al.: Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Zhu, H., Zhao, M., He, G., Su, H., Li, C., Zhu, J.: Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video genera- tion. arXiv preprint arXiv:2602.02214 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.