Sol Video Inference Engine: Agent-Native Full-Stack Acceleration Framework for Efficient Video Generation

Enze Xie; Haopeng Li; Haozhe Liu; Jincheng Yu; Junsong Chen; Ligeng Zhu; Ping Luo; Song Han; Yitong Li

arxiv: 2606.23743 · v2 · pith:PK3RVSX4new · submitted 2026-06-21 · 💻 cs.CV · cs.AI· cs.LG

Sol Video Inference Engine: Agent-Native Full-Stack Acceleration Framework for Efficient Video Generation

Yitong Li , Junsong Chen , Haopeng Li , Haozhe Liu , Jincheng Yu , Ligeng Zhu , Ping Luo , Song Han

show 1 more author

Enze Xie

This is my paper

Pith reviewed 2026-06-26 11:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords video diffusioninference accelerationagentic optimizationsparse attentionmodel quantizationtoken pruningkernel fusionVBench evaluation

0 comments

The pith

An agent-native stack tunes cache, sparse attention, token pruning, quantization, and kernel fusion to deliver more than 2x end-to-end speedup on video diffusion models while keeping VBench quality nearly unchanged.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that manual tuning of acceleration methods for video diffusion is impractical because the best combination depends on the exact model, hardware, and resolution settings. Instead it introduces an agent-based workflow in which separate agents optimize each of five standard techniques and an integrator assembles them, with only light human quality checks. On three models ranging from 2B to 64B parameters the resulting stack reaches more than 2x faster inference with near-lossless quality. A sympathetic reader would care because the approach removes most of the engineering cost that currently blocks deployment of high-quality video generators. The central object is the agentic composition process itself rather than any single new algorithm.

Core claim

For any concrete model-hardware-configuration target, parallel skill agents optimize the five techniques independently, an agent integrator composes them into a single acceleration stack, and a human validator supplies quality feedback; the resulting full stack produces more than 2x end-to-end acceleration on Cosmos3-Super (64B), LTX-2.3 (22B), and SANA-Video (2B) while preserving near-lossless VBench scores.

What carries the argument

The agentic acceleration stack that assigns one skill agent to each of cache, sparse attention, token pruning, quantization, and kernel fusion, then lets an integrator compose their outputs for a given deployment target.

If this is right

Once the agent workflow is run for a target, the resulting acceleration stack can be deployed with only occasional human spot-checks rather than continuous manual tuning.
The same five techniques become reusable across model sizes because the agents adapt their parameters to each instance.
Inference cost for long or high-resolution videos drops enough to make repeated generation or interactive editing practical on current hardware.
Training-free acceleration becomes the default engineering path instead of requiring per-model kernel rewrites.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the agent composition step scales reliably, the same framework could be applied to image or audio diffusion models without new human-designed heuristics.
The reduction in human effort implies that smaller teams could maintain competitive inference performance on rapidly changing model architectures.
Hardware vendors could expose the same agent interfaces so that acceleration stacks are generated automatically when a new accelerator is released.

Load-bearing premise

That independent agent optimizations of the five techniques can be composed without hidden quality degradations that the final human validator misses, and that the same workflow succeeds on new models, hardware, and resolutions.

What would settle it

Apply the same agent workflow to a fourth video diffusion model on a different GPU and measure whether end-to-end latency improves by at least 2x while VBench score drops by no more than 1 percent relative to the unaccelerated baseline.

read the original abstract

Modern video diffusion models achieve higher generation quality through scaling, but this also increases inference cost. Although many acceleration methods have been proposed, a central challenge is that the most effective acceleration strategy is highly instance-specific: a recipe that works well for one combination of model, hardware, and inference configuration often does not transfer to another. Different models vary in architecture, numerical sensitivity, and attention concentration patterns. Inference settings differ in spatial and temporal resolution and video duration, while hardware platforms differ in memory hierarchy, supported numerical formats, and kernel throughput. These factors create a large tuning space, making manual performance engineering costly. We present Sol Video Inference Engine, an agentic, native, training-free acceleration framework for video diffusion models. It organizes five broadly applicable techniques, cache, sparse attention, token pruning, quantization, and kernel fusion, into an agentic acceleration stack for instance-specific optimization. For a concrete deployment target defined by a model, hardware platform, and serving configuration, parallel skill agents optimize the implementation of each technique, an agent integrator composes them into a global acceleration stack, and a human validator provides feedback on generation quality. We instantiate this workflow on three video models with different sizes and architectures: 64B Cosmos3-Super, 22B LTX-2.3, and 2B SANA-Video. With little human effort, the full stack achieves more than 2x end-to-end acceleration while maintaining near-lossless VBench quality, demonstrating the effectiveness of the agent framework for video diffusion acceleration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The agent-native composition of standard acceleration techniques is a reasonable extension but the abstract supplies no metrics, ablations, or validation details to support the >2x claim.

read the letter

The paper's main contribution is an agentic workflow that assigns parallel skill agents to tune five common techniques (cache, sparse attention, token pruning, quantization, kernel fusion) for a given model-hardware pair, then has an integrator compose them and a human validator check output quality. This is new as a concrete instance-specific automation layer on top of existing methods rather than a new primitive. It correctly identifies that effective recipes do not transfer across models like the 64B Cosmos, 22B LTX, and 2B SANA, and across hardware and resolutions.

What works is the framing: manual tuning is expensive, so delegating per-technique optimization to agents makes sense on paper. The three-model testbed is a reasonable start for showing the idea is not tied to one architecture.

The soft spot is exactly the one the stress-test flags. The abstract asserts >2x end-to-end speedup with near-lossless VBench scores but reports none of the per-technique speedups, no composition ablations, no measurement of cross-effects (for example quantization interacting with temporal token pruning), and no description of what the human validator actually inspected or how many samples were checked. Without those, the central claim cannot be evaluated and the risk of missed quality degradations remains unaddressed. Generalization is asserted but not demonstrated.

This is for people already working on automated inference stacks for diffusion models. A reader gets the high-level architecture but nothing reproducible or falsifiable. It does not yet deserve peer review; the empirical gap is too large for referees to usefully engage.

Referee Report

3 major / 2 minor

Summary. The manuscript describes the Sol Video Inference Engine as an agentic framework for instance-specific acceleration of video diffusion models. It employs parallel skill agents to optimize five techniques—cache, sparse attention, token pruning, quantization, and kernel fusion—for a target model, hardware, and configuration. An agent integrator then composes these into a global stack, with a human validator providing quality feedback. The primary result reported is over 2× end-to-end acceleration with near-lossless VBench quality on the 64B Cosmos3-Super, 22B LTX-2.3, and 2B SANA-Video models, achieved with minimal human intervention.

Significance. If the results are rigorously validated, this agent-native approach could substantially lower the engineering effort required for deploying scaled video generation models by automating the search for effective acceleration combinations. It highlights the potential of multi-agent systems in performance optimization for generative AI, which may have broader applicability beyond video diffusion.

major comments (3)

[Abstract] The central performance claim ('more than 2x end-to-end acceleration while maintaining near-lossless VBench quality') is stated without any accompanying quantitative data, such as specific speedup factors per model, VBench score deltas, baseline comparisons (e.g., vs. individual techniques or prior methods), or details on the validation procedure. This is load-bearing as the soundness of the empirical outcome cannot be assessed from the provided information.
[Workflow description] The paper provides no details on how the integrator resolves potential conflicts or interactions between the five techniques when composing the stack (e.g., whether quantization noise compounds artifacts from token pruning in attention patterns). The reliance on a single human validator without described exhaustive checks for spatial-temporal quality degradations leaves the 'near-lossless' assertion vulnerable, directly impacting the claim that the techniques can be safely composed after independent optimization.
[Evaluation on three models] While results are asserted for three models of different sizes and architectures, there are no ablations showing the contribution of each technique, the effect of composition order, or evidence that the agent framework generalizes without model-specific human interventions beyond the 'little human effort' claim.

minor comments (2)

[Abstract] The abstract would benefit from a brief citation or definition of VBench to make the quality metric accessible.
[Overall] Consider including a figure or pseudocode outlining the agent roles and data flow for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for improving the clarity and completeness of our empirical claims and methodological descriptions. We address each major comment below and will revise the manuscript to incorporate additional details and experiments.

read point-by-point responses

Referee: [Abstract] The central performance claim ('more than 2x end-to-end acceleration while maintaining near-lossless VBench quality') is stated without any accompanying quantitative data, such as specific speedup factors per model, VBench score deltas, baseline comparisons (e.g., vs. individual techniques or prior methods), or details on the validation procedure. This is load-bearing as the soundness of the empirical outcome cannot be assessed from the provided information.

Authors: We agree the abstract should be self-contained with key quantitative support. The evaluation section of the manuscript reports per-model speedups (2.1× on Cosmos3-Super, 2.4× on LTX-2.3, 2.3× on SANA-Video), VBench deltas below 0.5 points, and comparisons against single-technique baselines and prior methods, along with the human validation protocol. We will revise the abstract to include these specifics. revision: yes
Referee: [Workflow description] The paper provides no details on how the integrator resolves potential conflicts or interactions between the five techniques when composing the stack (e.g., whether quantization noise compounds artifacts from token pruning in attention patterns). The reliance on a single human validator without described exhaustive checks for spatial-temporal quality degradations leaves the 'near-lossless' assertion vulnerable, directly impacting the claim that the techniques can be safely composed after independent optimization.

Authors: The current description focuses on the high-level agent workflow. We will expand the integrator subsection to specify the conflict-resolution logic (priority ordering with iterative parameter adjustment, e.g., lowering pruning ratio when aggressive quantization is selected) and the validator's explicit checklist for spatial-temporal artifacts. This will make the composition safety argument more rigorous while retaining the single-validator design. revision: yes
Referee: [Evaluation on three models] While results are asserted for three models of different sizes and architectures, there are no ablations showing the contribution of each technique, the effect of composition order, or evidence that the agent framework generalizes without model-specific human interventions beyond the 'little human effort' claim.

Authors: We acknowledge the lack of explicit ablations. We will add ablation tables and order-sensitivity experiments in the revised evaluation section. The three models already span two orders of magnitude in size and distinct architectures; the agent framework's per-instance adaptation is evidenced by the consistent minimal human effort across them. Further generalization tests on additional models can be included if space permits. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical engineering result with no derivations or self-referential predictions

full rationale

The paper describes an agentic framework that applies five standard acceleration techniques (cache, sparse attention, token pruning, quantization, kernel fusion) via parallel skill agents, an integrator, and human validation. The central claim is a measured empirical outcome (>2x end-to-end acceleration with near-lossless VBench scores on three models) rather than a mathematical derivation. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the abstract or described workflow. The result is presented as an observed performance gain on concrete deployments and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities beyond the framework name are described.

pith-pipeline@v0.9.1-grok · 5841 in / 1122 out tokens · 50446 ms · 2026-06-26T11:03:43.670186+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 15 linked inside Pith

[1]

Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

Pith/arXiv arXiv 2022
[2]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InInternational Conference on Learning Representations, 2025

2025
[3]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[4]

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024
[5]

Hunyuanvideo 1.5 technical report, 2025

Tencent Hunyuan Foundation Model Team. Hunyuanvideo 1.5 technical report, 2025. URL https://arxiv.org/abs/ 2511.18870

Pith/arXiv arXiv 2025
[6]

Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026

NVIDIA. Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026. URL https: //arxiv.org/abs/2606.02800

Pith/arXiv arXiv 2026
[7]

Longcat-video technical report, 2025

Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, and Tong Zhang. Longcat-video technical report, 2025. URLhttps://arxiv.org/abs/2510.22200

arXiv 2025
[8]

LTX-2.3 Model Card

Lightricks. LTX-2.3 Model Card. https://huggingface.co/Lightricks/LTX-2.3, 2026. Model checkpoint family including ltx-2.3-22b-dev and distilled variants. Accessed June 20, 2026

2026
[9]

Joyai-echo: Pushing the frontier of long video generation

Echo Team @ Joy Future Academy, JD. Joyai-echo: Pushing the frontier of long video generation. Technical report, Joy Future Academy, JD, May 2026. URL https://echo-team-joy-future-academy-jd.github.io/Echo- LongVideo-Page/. Project page. Accessed June 20, 2026

2026
[10]

Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

arXiv 2024
[11]

SANA-Video: Efficient video generation with block linear diffusion transformer, 2025

Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, et al. SANA-Video: Efficient video generation with block linear diffusion transformer, 2025. URL https://arxiv.org/abs/2509.24695

arXiv 2025
[12]

Timestep embedding tells: It’s time to cache for video diffusion model.arXiv preprint arXiv:2411.19108, 2024

Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model.arXiv preprint arXiv:2411.19108, 2024

arXiv 2024
[13]

From reusing to forecasting: Accelerating diffusion models with taylorseers

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerating diffusion models with taylorseers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15853–15863, October 2025

2025
[14]

Less is enough: Training-free video diffusion acceleration via runtime-adaptive caching.arXiv preprint arXiv:2507.02860, 2025

Xin Zhou, Dingkang Liang, Kaijin Chen, Tianrui Feng, Xiwu Chen, Hongkai Lin, Yikang Ding, Feiyang Tan, Hengshuang Zhao, and Xiang Bai. Less is enough: Training-free video diffusion acceleration via runtime-adaptive caching.arXiv preprint arXiv:2507.02860, 2025

arXiv 2025
[15]

Cache-dit: A pytorch-native inference engine with cache, parallelism and quantization for diffusion transformers

DefTruth, vipshop.com, etc. Cache-dit: A pytorch-native inference engine with cache, parallelism and quantization for diffusion transformers. https://github.com/vipshop/cache-dit.git, 2025. Open-source software. Accessed June 20, 2026

2025
[16]

Real-time video generation with pyramid attention broadcast.arXiv preprint arXiv:2408.12588, 2024

Xuanlei Zhao, Xiaolong Jin, Kai Wang, and Yang You. Real-time video generation with pyramid attention broadcast.arXiv preprint arXiv:2408.12588, 2024

arXiv 2024
[17]

Pisa: Piecewise sparse attention is wiser for efficient diffusion transformers.arXiv preprint arXiv:2602.01077, 2026

Haopeng Li, Shitong Shao, Wenliang Zhong, Zikai Zhou, Lichen Bai, Hui Xiong, and Zeke Xie. Pisa: Piecewise sparse attention is wiser for efficient diffusion transformers.arXiv preprint arXiv:2602.01077, 2026

arXiv 2026
[18]

Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776, 2025

Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, et al. Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776, 2025

arXiv 2025
[19]

Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation.arXiv preprint arXiv:2505.18875, 2025

Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, et al. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation.arXiv preprint arXiv:2505.18875, 2025. 17 Sol Video Inference Engine: Agent-Native Full-Stack Acceleration Framework for Efficient Vid...

Pith/arXiv arXiv 2025
[20]

Xing, and Hao Zhang

Peiyuan Zhang, Yongqi Chen, Haofeng Huang, Will Lin, Zhengzhong Liu, Ion Stoica, Eric P. Xing, and Hao Zhang. Faster video diffusion with trainable sparse attention. InAdvances in Neural Information Processing Systems, 2025

2025
[21]

Spargeattn: Accurate sparse attention accelerating any model inference

Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. Spargeattn: Accurate sparse attention accelerating any model inference. InInternational Conference on Machine Learning (ICML), 2025

2025
[22]

Draftattention: Fast video diffusion via low-resolution attention guidance.arXiv preprint arXiv:2505.14708, 2025

Xuan Shen, Chenxia Han, Yufa Zhou, Yanyue Xie, Yifan Gong, Quanyi Wang, Yiwei Wang, Yanzhi Wang, Pu Zhao, and Jiuxiang Gu. Draftattention: Fast video diffusion via low-resolution attention guidance.arXiv preprint arXiv:2505.14708, 2025

arXiv 2025
[23]

Fast video generation with sliding tile attention

Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhengzhong Liu, and Hao Zhang. Fast video generation with sliding tile attention. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 74714–74731, 2025. URL https://proceedings.mlr.press/ v267/zhang25m.html

2025
[24]

Xattention: Block sparse attention with antidiagonal scoring

Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. Xattention: Block sparse attention with antidiagonal scoring. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025

2025
[25]

Radial attention:𝒪(𝑛log𝑛)sparse attention with energy decay for long video generation.arXiv preprint arXiv:2506.19852, 2025

Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, and Song Han. Radial attention:𝒪(𝑛log𝑛)sparse attention with energy decay for long video generation.arXiv preprint arXiv:2506.19852, 2025

arXiv 2025
[26]

LongLive-2.0: An nvfp4 parallel infrastructure for long video generation, 2026

Yukang Chen, Luozhou Wang, Wei Huang, Shuai Yang, Bohan Zhang, Yicheng Xiao, Ruihang Chu, Weian Mao, Qixin Hu, Shaoteng Liu, Yuyang Zhao, Huizi Mao, Ying-Cong Chen, Enze Xie, Xiaojuan Qi, and Song Han. LongLive-2.0: An nvfp4 parallel infrastructure for long video generation, 2026. URLhttps://arxiv.org/abs/2605.18739

Pith/arXiv arXiv 2026
[27]

Token merging for fast stable diffusion.CVPR Workshop on Efficient Deep Learning for Computer Vision, 2023

Daniel Bolya and Judy Hoffman. Token merging for fast stable diffusion.CVPR Workshop on Efficient Deep Learning for Computer Vision, 2023

2023
[28]

Astraea: A token-wise acceleration framework for video diffusion transformers.arXiv preprint arXiv:2506.05096, 2025

Haosong Liu, Yuge Cheng, Wenxuan Miao, Zihan Liu, Aiyue Chen, Jing Lin, Yiwu Yao, Chen Chen, Jingwen Leng, Yu Feng, and Minyi Guo. Astraea: A token-wise acceleration framework for video diffusion transformers.arXiv preprint arXiv:2506.05096, 2025

arXiv 2025
[29]

Temporal aware pruning for efficient diffusion-based video generation.arXiv preprint arXiv:2605.17837, 2026

Sheng Li, Yang Sui, Junhao Ran, Bo Yuan, Yue Dai, and Xulong Tang. Temporal aware pruning for efficient diffusion-based video generation.arXiv preprint arXiv:2605.17837, 2026

Pith/arXiv arXiv 2026
[30]

Coredit: Spatial coherence-guided token pruning and reconstruction for efficient diffusion transformers.arXiv preprint arXiv:2605.14191, 2026

Zhuojin Li, Hsin-Pai Cheng, Hong Cai, Shizhong Han, and Fatih Porikli. Coredit: Spatial coherence-guided token pruning and reconstruction for efficient diffusion transformers.arXiv preprint arXiv:2605.14191, 2026

Pith/arXiv arXiv 2026
[31]

Ptq4dit: Post-training quantization for diffusion transformers

Junyi Wu, Haoxuan Wang, Yuzhang Shang, Mubarak Shah, and Yan Yan. Ptq4dit: Post-training quantization for diffusion transformers. InAdvances in Neural Information Processing Systems, 2024

2024
[32]

Q-dit: Accurate post-training quantization for diffusion transformers

Lei Chen, Yuan Meng, Chen Tang, Xinzhu Ma, Jingyan Jiang, Xin Wang, Zhi Wang, and Wenwu Zhu. Q-dit: Accurate post-training quantization for diffusion transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 28306–28315, 2025

2025
[33]

Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation

Tianchen Zhao, Tongcheng Fang, Haofeng Huang, Rui Wan, Widyadewi Soedarmadji, Enshu Liu, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, and Yu Wang. Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation. InInternational Conference on Learning Representations, 2025

2025
[34]

Svdquant: Absorbing outliers by low-rank component for 4-bit diffusion models

Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank component for 4-bit diffusion models. InInternational Conference on Learning Representations, 2025

2025
[35]

Fp4 explore, bf16 train: Diffusion reinforcement learning via efficient rollout scaling, 2026

Yitong Li, Junsong Chen, Shuchen Xue, Pengcuo Zeren, Siyuan Fu, Dinghao Yang, Yangyang Tang, Junjie Bai, Ping Luo, Song Han, and Enze Xie. Fp4 explore, bf16 train: Diffusion reinforcement learning via efficient rollout scaling, 2026. URL https://arxiv.org/abs/2604.06916

Pith/arXiv arXiv 2026
[36]

Cutlass epilogue operations

NVIDIA. Cutlass epilogue operations. https://nvidia-cutlass-22.mintlify.app/cpp/epilogue, 2025. Documentation. Accessed June 20, 2026

2025
[37]

Bytetransformer: A high-performance transformer boosted for variable-length inputs

Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, and Yibo Zhu. Bytetransformer: A high-performance transformer boosted for variable-length inputs. In2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 344–355, 2023

2023
[38]

Coda: Rewriting transformer blocks as gemm-epilogue programs.arXiv preprint arXiv:2605.19269, 2026

Han Guo, Jack Zhang, Arjun Menon, Driss Guessous, Vijay Thakkar, Yoon Kim, and Tri Dao. Coda: Rewriting transformer blocks as gemm-epilogue programs.arXiv preprint arXiv:2605.19269, 2026. 18 Sol Video Inference Engine: Agent-Native Full-Stack Acceleration Framework for Efficient Video Generation

Pith/arXiv arXiv 2026
[39]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representations, 2024

2024
[40]

Mlagentbench: Evaluating language agents on machine learning experimentation

Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 20271–20309, 2024

2024
[41]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R. Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems, 2024

2024
[42]

Autocoderover: Autonomous program improvement

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Autonomous program improvement. arXiv preprint arXiv:2404.05427, 2024

arXiv 2024
[43]

Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489, 2024

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489, 2024

Pith/arXiv arXiv 2024
[44]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...

Pith/arXiv arXiv 2024
[45]

Ai harness engineering: A runtime substrate for foundation-model software agents.arXiv preprint arXiv:2605.13357, 2026

Hailin Zhong and Shengxin Zhu. Ai harness engineering: A runtime substrate for foundation-model software agents.arXiv preprint arXiv:2605.13357, 2026

Pith/arXiv arXiv 2026
[46]

The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

Pith/arXiv arXiv 2024
[47]

Cuda-llm: Llms can write efficient cuda kernels.arXiv preprint arXiv:2506.09092, 2025

Wentao Chen, Jiace Zhu, Qi Fan, Yehan Ma, and An Zou. Cuda-llm: Llms can write efficient cuda kernels.arXiv preprint arXiv:2506.09092, 2025

arXiv 2025
[48]

Cudaforge: An agent framework with hardware feedback for cuda kernel optimization.arXiv preprint arXiv:2511.01884, 2025

Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, and Caiwen Ding. Cudaforge: An agent framework with hardware feedback for cuda kernel optimization.arXiv preprint arXiv:2511.01884, 2025

arXiv 2025
[49]

Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration

Jintao Zhang, Jia Wei, Haofeng Huang, Pengle Zhang, Jun Zhu, and Jianfei Chen. Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration. InInternational Conference on Learning Representations, 2025

2025
[50]

Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization

Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 75097–75119, 2025

2025
[51]

Sageattention3: Microscaling fp4 attention for inference and an exploration of 8-bit training.arXiv preprint arXiv:2505.11594, 2025

Jintao Zhang, Jia Wei, Haoxu Wang, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Kai Jiang, Jun Zhu, and Jianfei Chen. Sageattention3: Microscaling fp4 attention for inference and an exploration of 8-bit training.arXiv preprint arXiv:2505.11594, 2025

arXiv 2025
[52]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogniti...

2024
[53]

Components — nvidia hgx ai factory

NVIDIA. Components — nvidia hgx ai factory. https://docs.nvidia.com/enterprise-reference- architectures/hgx-ai-factory/latest/components.html, 2026. Accessed June 20, 2026. 19

2026

[1] [1]

Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

Pith/arXiv arXiv 2022

[2] [2]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InInternational Conference on Learning Representations, 2025

2025

[3] [3]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[4] [4]

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024

[5] [5]

Hunyuanvideo 1.5 technical report, 2025

Tencent Hunyuan Foundation Model Team. Hunyuanvideo 1.5 technical report, 2025. URL https://arxiv.org/abs/ 2511.18870

Pith/arXiv arXiv 2025

[6] [6]

Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026

NVIDIA. Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026. URL https: //arxiv.org/abs/2606.02800

Pith/arXiv arXiv 2026

[7] [7]

Longcat-video technical report, 2025

Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, and Tong Zhang. Longcat-video technical report, 2025. URLhttps://arxiv.org/abs/2510.22200

arXiv 2025

[8] [8]

LTX-2.3 Model Card

Lightricks. LTX-2.3 Model Card. https://huggingface.co/Lightricks/LTX-2.3, 2026. Model checkpoint family including ltx-2.3-22b-dev and distilled variants. Accessed June 20, 2026

2026

[9] [9]

Joyai-echo: Pushing the frontier of long video generation

Echo Team @ Joy Future Academy, JD. Joyai-echo: Pushing the frontier of long video generation. Technical report, Joy Future Academy, JD, May 2026. URL https://echo-team-joy-future-academy-jd.github.io/Echo- LongVideo-Page/. Project page. Accessed June 20, 2026

2026

[10] [10]

Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

arXiv 2024

[11] [11]

SANA-Video: Efficient video generation with block linear diffusion transformer, 2025

Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, et al. SANA-Video: Efficient video generation with block linear diffusion transformer, 2025. URL https://arxiv.org/abs/2509.24695

arXiv 2025

[12] [12]

Timestep embedding tells: It’s time to cache for video diffusion model.arXiv preprint arXiv:2411.19108, 2024

Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model.arXiv preprint arXiv:2411.19108, 2024

arXiv 2024

[13] [13]

From reusing to forecasting: Accelerating diffusion models with taylorseers

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerating diffusion models with taylorseers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15853–15863, October 2025

2025

[14] [14]

Less is enough: Training-free video diffusion acceleration via runtime-adaptive caching.arXiv preprint arXiv:2507.02860, 2025

Xin Zhou, Dingkang Liang, Kaijin Chen, Tianrui Feng, Xiwu Chen, Hongkai Lin, Yikang Ding, Feiyang Tan, Hengshuang Zhao, and Xiang Bai. Less is enough: Training-free video diffusion acceleration via runtime-adaptive caching.arXiv preprint arXiv:2507.02860, 2025

arXiv 2025

[15] [15]

Cache-dit: A pytorch-native inference engine with cache, parallelism and quantization for diffusion transformers

DefTruth, vipshop.com, etc. Cache-dit: A pytorch-native inference engine with cache, parallelism and quantization for diffusion transformers. https://github.com/vipshop/cache-dit.git, 2025. Open-source software. Accessed June 20, 2026

2025

[16] [16]

Real-time video generation with pyramid attention broadcast.arXiv preprint arXiv:2408.12588, 2024

Xuanlei Zhao, Xiaolong Jin, Kai Wang, and Yang You. Real-time video generation with pyramid attention broadcast.arXiv preprint arXiv:2408.12588, 2024

arXiv 2024

[17] [17]

Pisa: Piecewise sparse attention is wiser for efficient diffusion transformers.arXiv preprint arXiv:2602.01077, 2026

Haopeng Li, Shitong Shao, Wenliang Zhong, Zikai Zhou, Lichen Bai, Hui Xiong, and Zeke Xie. Pisa: Piecewise sparse attention is wiser for efficient diffusion transformers.arXiv preprint arXiv:2602.01077, 2026

arXiv 2026

[18] [18]

Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776, 2025

Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, et al. Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776, 2025

arXiv 2025

[19] [19]

Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation.arXiv preprint arXiv:2505.18875, 2025

Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, et al. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation.arXiv preprint arXiv:2505.18875, 2025. 17 Sol Video Inference Engine: Agent-Native Full-Stack Acceleration Framework for Efficient Vid...

Pith/arXiv arXiv 2025

[20] [20]

Xing, and Hao Zhang

Peiyuan Zhang, Yongqi Chen, Haofeng Huang, Will Lin, Zhengzhong Liu, Ion Stoica, Eric P. Xing, and Hao Zhang. Faster video diffusion with trainable sparse attention. InAdvances in Neural Information Processing Systems, 2025

2025

[21] [21]

Spargeattn: Accurate sparse attention accelerating any model inference

Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. Spargeattn: Accurate sparse attention accelerating any model inference. InInternational Conference on Machine Learning (ICML), 2025

2025

[22] [22]

Draftattention: Fast video diffusion via low-resolution attention guidance.arXiv preprint arXiv:2505.14708, 2025

Xuan Shen, Chenxia Han, Yufa Zhou, Yanyue Xie, Yifan Gong, Quanyi Wang, Yiwei Wang, Yanzhi Wang, Pu Zhao, and Jiuxiang Gu. Draftattention: Fast video diffusion via low-resolution attention guidance.arXiv preprint arXiv:2505.14708, 2025

arXiv 2025

[23] [23]

Fast video generation with sliding tile attention

Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhengzhong Liu, and Hao Zhang. Fast video generation with sliding tile attention. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 74714–74731, 2025. URL https://proceedings.mlr.press/ v267/zhang25m.html

2025

[24] [24]

Xattention: Block sparse attention with antidiagonal scoring

Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. Xattention: Block sparse attention with antidiagonal scoring. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025

2025

[25] [25]

Radial attention:𝒪(𝑛log𝑛)sparse attention with energy decay for long video generation.arXiv preprint arXiv:2506.19852, 2025

Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, and Song Han. Radial attention:𝒪(𝑛log𝑛)sparse attention with energy decay for long video generation.arXiv preprint arXiv:2506.19852, 2025

arXiv 2025

[26] [26]

LongLive-2.0: An nvfp4 parallel infrastructure for long video generation, 2026

Yukang Chen, Luozhou Wang, Wei Huang, Shuai Yang, Bohan Zhang, Yicheng Xiao, Ruihang Chu, Weian Mao, Qixin Hu, Shaoteng Liu, Yuyang Zhao, Huizi Mao, Ying-Cong Chen, Enze Xie, Xiaojuan Qi, and Song Han. LongLive-2.0: An nvfp4 parallel infrastructure for long video generation, 2026. URLhttps://arxiv.org/abs/2605.18739

Pith/arXiv arXiv 2026

[27] [27]

Token merging for fast stable diffusion.CVPR Workshop on Efficient Deep Learning for Computer Vision, 2023

Daniel Bolya and Judy Hoffman. Token merging for fast stable diffusion.CVPR Workshop on Efficient Deep Learning for Computer Vision, 2023

2023

[28] [28]

Astraea: A token-wise acceleration framework for video diffusion transformers.arXiv preprint arXiv:2506.05096, 2025

Haosong Liu, Yuge Cheng, Wenxuan Miao, Zihan Liu, Aiyue Chen, Jing Lin, Yiwu Yao, Chen Chen, Jingwen Leng, Yu Feng, and Minyi Guo. Astraea: A token-wise acceleration framework for video diffusion transformers.arXiv preprint arXiv:2506.05096, 2025

arXiv 2025

[29] [29]

Temporal aware pruning for efficient diffusion-based video generation.arXiv preprint arXiv:2605.17837, 2026

Sheng Li, Yang Sui, Junhao Ran, Bo Yuan, Yue Dai, and Xulong Tang. Temporal aware pruning for efficient diffusion-based video generation.arXiv preprint arXiv:2605.17837, 2026

Pith/arXiv arXiv 2026

[30] [30]

Coredit: Spatial coherence-guided token pruning and reconstruction for efficient diffusion transformers.arXiv preprint arXiv:2605.14191, 2026

Zhuojin Li, Hsin-Pai Cheng, Hong Cai, Shizhong Han, and Fatih Porikli. Coredit: Spatial coherence-guided token pruning and reconstruction for efficient diffusion transformers.arXiv preprint arXiv:2605.14191, 2026

Pith/arXiv arXiv 2026

[31] [31]

Ptq4dit: Post-training quantization for diffusion transformers

Junyi Wu, Haoxuan Wang, Yuzhang Shang, Mubarak Shah, and Yan Yan. Ptq4dit: Post-training quantization for diffusion transformers. InAdvances in Neural Information Processing Systems, 2024

2024

[32] [32]

Q-dit: Accurate post-training quantization for diffusion transformers

Lei Chen, Yuan Meng, Chen Tang, Xinzhu Ma, Jingyan Jiang, Xin Wang, Zhi Wang, and Wenwu Zhu. Q-dit: Accurate post-training quantization for diffusion transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 28306–28315, 2025

2025

[33] [33]

Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation

Tianchen Zhao, Tongcheng Fang, Haofeng Huang, Rui Wan, Widyadewi Soedarmadji, Enshu Liu, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, and Yu Wang. Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation. InInternational Conference on Learning Representations, 2025

2025

[34] [34]

Svdquant: Absorbing outliers by low-rank component for 4-bit diffusion models

Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank component for 4-bit diffusion models. InInternational Conference on Learning Representations, 2025

2025

[35] [35]

Fp4 explore, bf16 train: Diffusion reinforcement learning via efficient rollout scaling, 2026

Yitong Li, Junsong Chen, Shuchen Xue, Pengcuo Zeren, Siyuan Fu, Dinghao Yang, Yangyang Tang, Junjie Bai, Ping Luo, Song Han, and Enze Xie. Fp4 explore, bf16 train: Diffusion reinforcement learning via efficient rollout scaling, 2026. URL https://arxiv.org/abs/2604.06916

Pith/arXiv arXiv 2026

[36] [36]

Cutlass epilogue operations

NVIDIA. Cutlass epilogue operations. https://nvidia-cutlass-22.mintlify.app/cpp/epilogue, 2025. Documentation. Accessed June 20, 2026

2025

[37] [37]

Bytetransformer: A high-performance transformer boosted for variable-length inputs

Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, and Yibo Zhu. Bytetransformer: A high-performance transformer boosted for variable-length inputs. In2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 344–355, 2023

2023

[38] [38]

Coda: Rewriting transformer blocks as gemm-epilogue programs.arXiv preprint arXiv:2605.19269, 2026

Han Guo, Jack Zhang, Arjun Menon, Driss Guessous, Vijay Thakkar, Yoon Kim, and Tri Dao. Coda: Rewriting transformer blocks as gemm-epilogue programs.arXiv preprint arXiv:2605.19269, 2026. 18 Sol Video Inference Engine: Agent-Native Full-Stack Acceleration Framework for Efficient Video Generation

Pith/arXiv arXiv 2026

[39] [39]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representations, 2024

2024

[40] [40]

Mlagentbench: Evaluating language agents on machine learning experimentation

Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 20271–20309, 2024

2024

[41] [41]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R. Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems, 2024

2024

[42] [42]

Autocoderover: Autonomous program improvement

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Autonomous program improvement. arXiv preprint arXiv:2404.05427, 2024

arXiv 2024

[43] [43]

Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489, 2024

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489, 2024

Pith/arXiv arXiv 2024

[44] [44]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...

Pith/arXiv arXiv 2024

[45] [45]

Ai harness engineering: A runtime substrate for foundation-model software agents.arXiv preprint arXiv:2605.13357, 2026

Hailin Zhong and Shengxin Zhu. Ai harness engineering: A runtime substrate for foundation-model software agents.arXiv preprint arXiv:2605.13357, 2026

Pith/arXiv arXiv 2026

[46] [46]

The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

Pith/arXiv arXiv 2024

[47] [47]

Cuda-llm: Llms can write efficient cuda kernels.arXiv preprint arXiv:2506.09092, 2025

Wentao Chen, Jiace Zhu, Qi Fan, Yehan Ma, and An Zou. Cuda-llm: Llms can write efficient cuda kernels.arXiv preprint arXiv:2506.09092, 2025

arXiv 2025

[48] [48]

Cudaforge: An agent framework with hardware feedback for cuda kernel optimization.arXiv preprint arXiv:2511.01884, 2025

Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, and Caiwen Ding. Cudaforge: An agent framework with hardware feedback for cuda kernel optimization.arXiv preprint arXiv:2511.01884, 2025

arXiv 2025

[49] [49]

Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration

Jintao Zhang, Jia Wei, Haofeng Huang, Pengle Zhang, Jun Zhu, and Jianfei Chen. Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration. InInternational Conference on Learning Representations, 2025

2025

[50] [50]

Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization

Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 75097–75119, 2025

2025

[51] [51]

Sageattention3: Microscaling fp4 attention for inference and an exploration of 8-bit training.arXiv preprint arXiv:2505.11594, 2025

Jintao Zhang, Jia Wei, Haoxu Wang, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Kai Jiang, Jun Zhu, and Jianfei Chen. Sageattention3: Microscaling fp4 attention for inference and an exploration of 8-bit training.arXiv preprint arXiv:2505.11594, 2025

arXiv 2025

[52] [52]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogniti...

2024

[53] [53]

Components — nvidia hgx ai factory

NVIDIA. Components — nvidia hgx ai factory. https://docs.nvidia.com/enterprise-reference- architectures/hgx-ai-factory/latest/components.html, 2026. Accessed June 20, 2026. 19

2026