AoiZora: Topology-Aware Auto-Parallel Optimization for Inference of Diffusion Transformers
Pith reviewed 2026-06-26 23:04 UTC · model grok-4.3
The pith
AoiZora reconnects logical sharding to physical TPU interconnect layout during compilation to reduce video diffusion inference latency by up to 1.42x.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AoiZora is a compiler-mediated topology planner that reconnects logical sharding with physical placement by drawing on different points in the compilation flow: it eliminates weak sharding candidates from inexpensive pre-compilation IRs, compiles only the survivors, and orders their physical placements using compiled HLO together with a topology-aware communication model. The winning plan is realized along the ordinary compiler path, leaving model code, compiler lowering, collective kernels, and network routing entirely intact.
What carries the argument
Two-stage planner that prunes sharding candidates via pre-compilation IR then ranks physical placements via a topology-aware communication model built from compiled HLO.
If this is right
- Distributed video diffusion inference runs with lower communication overhead on existing TPU sub-slice fabrics.
- No modifications are required to model source, lowering passes, collective kernels, or network routing.
- Early IR filtering shrinks the set of plans that must be fully compiled.
- Latency gains appear directly in one-step denoising without changes to the serving stack.
Where Pith is reading between the lines
- The same early-filter-plus-HLO-model approach could be applied to GPU clusters whose interconnect topology is also known at compile time.
- Extending the planner to multi-step denoising pipelines would test whether the ranking remains stable across successive iterations.
- The method implies that other auto-parallel systems could close similar gaps by exposing physical topology information earlier in their search.
Load-bearing premise
The topology-aware communication model built from compiled HLO accurately ranks physical placements without needing full end-to-end execution or runtime measurements on the target hardware.
What would settle it
Execute every surviving placement plan on TPU v5e sub-slices and check whether the model's highest-ranked plan matches the lowest measured one-step denoising latency.
Figures
read the original abstract
Video diffusion has quickly grown into a key generative serving workload, yet producing each clip demands many denoising iterations over large spatio-temporal latents, which puts low-latency inference out of reach on a single device. A denoising step is therefore typically distributed across multiple accelerators, and TPU sub-slices have become an attractive and practical fabric for doing so. Current auto-parallel systems, however, search almost exclusively over logical device meshes and disregard how a chosen sharding is actually laid out on the physical TPU interconnect -- an oversight that leaves large, topology-dependent performance on the table. We address this gap with AoiZora, a compiler-mediated topology planner built for low-latency video diffusion inference on TPU sub-slices. Its guiding principle is to reconnect logical sharding with physical placement by drawing on different points in the compilation flow: AoiZora first eliminates weak sharding candidates from inexpensive pre-compilation IRs, then compiles only the ones that survive and orders their physical placements using compiled HLO together with a topology-aware communication model. The winning plan is realized along the ordinary compiler path, leaving model code, compiler lowering, collective kernels, and network routing entirely intact. On TPU v5e sub-slices, AoiZora reduces Wan 2.1 one-step denoising latency by as much as 1.42x relative to existing solutions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents AoiZora, a compiler-mediated topology planner for low-latency inference of video diffusion transformers on TPU sub-slices. It eliminates weak logical sharding candidates via inexpensive pre-compilation IRs, then compiles survivors and ranks their physical placements using compiled HLO together with a topology-aware communication model; the winning plan is realized through the standard compiler path without altering model code, lowering, kernels, or routing. The central empirical claim is a reduction in Wan 2.1 one-step denoising latency by as much as 1.42× on TPU v5e sub-slices relative to existing auto-parallel solutions.
Significance. If the central performance claim holds, the work addresses a practical gap in auto-parallel systems by reconnecting logical sharding decisions with physical TPU interconnect topology, which is relevant for latency-sensitive distributed inference of large generative models. The pragmatic design that reuses existing compilation artifacts and leaves the rest of the stack unchanged is a strength that could facilitate adoption.
major comments (3)
- [Evaluation] Evaluation section: the 1.42× latency reduction for Wan 2.1 is reported as the outcome of selecting a physical placement via the compiled-HLO communication model, yet the manuscript supplies no quantitative evidence (correlation coefficient, top-k accuracy, or ranking agreement) that the model's predicted ordering matches measured end-to-end execution times on TPU v5e; this validation is load-bearing for the speedup claim because the model never executes the full program or takes runtime measurements.
- [§3] §3 (topology-aware communication model): the model is constructed solely from compiled HLO without runtime measurements on the target interconnect, but the manuscript does not demonstrate that its cost estimates correctly rank placements under the specific TPU v5e sub-slice topology; if collective costs are systematically mis-estimated, the selected plan need not be optimal and the reported 1.42× figure would not materialize.
- [Evaluation] Experimental setup (throughout Evaluation): the abstract and reported results omit baseline descriptions, number of trials, variance statistics, and ablation data on the communication model itself, preventing assessment of whether the observed improvement is attributable to topology awareness rather than other factors.
minor comments (2)
- [Abstract] The abstract would benefit from a brief statement of the number of candidate shardings considered and the hardware configuration used for the 1.42× measurement.
- [§2] Notation for logical vs. physical meshes is introduced without a dedicated table or diagram early in the paper, which could improve readability for readers unfamiliar with TPU sub-slice fabrics.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the 1.42× latency reduction for Wan 2.1 is reported as the outcome of selecting a physical placement via the compiled-HLO communication model, yet the manuscript supplies no quantitative evidence (correlation coefficient, top-k accuracy, or ranking agreement) that the model's predicted ordering matches measured end-to-end execution times on TPU v5e; this validation is load-bearing for the speedup claim because the model never executes the full program or takes runtime measurements.
Authors: We agree that the manuscript does not supply quantitative validation (e.g., correlation or ranking agreement) of the communication model's ordering against measured TPU v5e execution times. In the revised version we will add a dedicated validation subsection reporting these metrics on a representative set of placements. revision: yes
-
Referee: [§3] §3 (topology-aware communication model): the model is constructed solely from compiled HLO without runtime measurements on the target interconnect, but the manuscript does not demonstrate that its cost estimates correctly rank placements under the specific TPU v5e sub-slice topology; if collective costs are systematically mis-estimated, the selected plan need not be optimal and the reported 1.42× figure would not materialize.
Authors: The referee correctly notes the absence of direct evidence that the static HLO-derived model ranks placements accurately on TPU v5e. We will add experiments in the revision that compare model-estimated collective costs against measured performance on the target sub-slice topology. revision: yes
-
Referee: [Evaluation] Experimental setup (throughout Evaluation): the abstract and reported results omit baseline descriptions, number of trials, variance statistics, and ablation data on the communication model itself, preventing assessment of whether the observed improvement is attributable to topology awareness rather than other factors.
Authors: We agree that the experimental description is incomplete. The revised Evaluation section will explicitly list the baselines, report the number of trials and variance, and include an ablation isolating the topology-aware component. revision: yes
Circularity Check
No circularity; empirical systems technique with independent validation path
full rationale
The paper describes a compiler-mediated placement search that filters candidates via pre-compilation IRs then ranks survivors using compiled HLO plus a topology-aware communication model. No equations, fitted parameters, or derived quantities are presented as predictions; the 1.42x latency result is an end-to-end measured outcome on TPU v5e hardware. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the provided text. The contribution is therefore self-contained against external benchmarks (actual execution times) and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Learning interactive real-world simu- lators, 2024
Sherry Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Leslie Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simu- lators, 2024. URL https://arxiv.org/abs/2310. 06114
2024
-
[2]
Worldsimbench: Towards video generation models as world simulators, 2024
Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, En- shen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, Lei Bai, Wanli Ouyang, and Ruimao Zhang. Worldsimbench: Towards video generation models as world simulators, 2024. URL https://arxiv.org/ abs/2410.18072
arXiv 2024
-
[3]
Genie: Generative interactive environ- ments, 2024
Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder Si...
2024
-
[4]
Cogvideox: Text-to-video diffusion models with an expert transformer, 2025
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer, 2025. URL https://arxiv.org/abs/2408.06072
Pith/arXiv arXiv 2025
-
[5]
Hunyuanvideo: A systematic framework for large video generative models, 2025
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jian- wei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duo- jun Huang, Fang Yang, Hao Tan, Hongmei Wang, Ja- cob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shua...
Pith/arXiv arXiv 2025
-
[6]
Wan: Open and advanced large-scale video gener- ative models, 2025
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chao- jie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haim- ing Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...
Pith/arXiv arXiv 2025
-
[7]
Videocrafter1: Open diffusion models for high-quality video generation, 2023
Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation, 2023. URL https://arxiv.org/ abs/2310.19512
Pith/arXiv arXiv 2023
-
[8]
Latte: Latent diffusion transformer for video generation,
Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation,
-
[9]
URLhttps://arxiv.org/abs/2401.03048
-
[10]
Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020. URL https: //arxiv.org/abs/1909.08053. 13
Pith/arXiv arXiv 2020
-
[11]
Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale,
Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, and Yuxiong He. Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale,
-
[12]
URLhttps://arxiv.org/abs/2207.00032
-
[13]
Gonza- lez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with pagedat- tention, 2023. URL https://arxiv.org/abs/2309. 06180
2023
-
[14]
Jiacheng Yang, Jun Wu, Yaoyao Ding, Zhiying Xu, Yida Wang, and Gennady Pekhimenko. Swiftfusion: Scal- able sequence parallelism for distributed inference of diffusion transformers on gpus, 2026. URL https: //arxiv.org/abs/2601.20273
Pith/arXiv arXiv 2026
-
[15]
xdit: an inference engine for diffusion trans- formers (dits) with massive parallelism, 2024
Jiarui Fang, Jinzhe Pan, Xibo Sun, Aoyu Li, and Jiannan Wang. xdit: an inference engine for diffusion trans- formers (dits) with massive parallelism, 2024. URL https://arxiv.org/abs/2411.01738
arXiv 2024
-
[16]
Norman P. Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Cliff Young, Xiang Zhou, Zongwei Zhou, and David Patterson. Tpu v4: An optically reconfigurable supercomputer for ma- chine learning with hardware support for embeddings,
-
[17]
URLhttps://arxiv.org/abs/2304.01433
-
[18]
Google TPU v5e Doc
Google Team. Google TPU v5e Doc. https://docs. cloud.google.com/tpu/docs/v5e, 2026. Accessed: 2026-06-08
2026
-
[19]
JAX: composable transforma- tions of Python+NumPy programs, 2018
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Yash Katariya, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transforma- tions of Python+NumPy programs, 2018. URL http://github.com/jax-ml/jax
2018
-
[20]
OpenXLA Project. XLA. https://openxla.org/ xla, 2024. Accessed: 2026-06-08
2024
-
[21]
Blueconnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy.Pro- ceedings of Machine Learning and Systems, 1:241–251, 2019
Minsik Cho, Ulrich Finkler, David Kung, and Hillery Hunter. Blueconnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy.Pro- ceedings of Machine Learning and Systems, 1:241–251, 2019
2019
-
[22]
Tacos: Topology-aware collective algorithm synthesizer for distributed machine learning
William Won, Midhilesh Elavazhagan, Sudarshan Srini- vasan, Swati Gupta, and Tushar Krishna. Tacos: Topology-aware collective algorithm synthesizer for distributed machine learning. InProceedings of the 2024 57th IEEE/ACM International Symposium on Microarchitecture, MICRO ’24, page 856–870. IEEE Press, 2024. doi: 10.1109/MICRO61859. 2024.00068. URL https...
-
[23]
TACCL: Guiding collective algorithm synthesis using communi- cation sketches
Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Ja- cob Nelson, Olli Saarikivi, and Rachee Singh. TACCL: Guiding collective algorithm synthesis using communi- cation sketches. In20th USENIX Symposium on Net- worked Systems Design and Implementation (NSDI 23), pages 593–612, Boston, MA, April 2023. USENIX As- sociation
2023
-
[24]
TopoOpt: Co-optimizing network topology and parallelization strategy for dis- tributed training jobs
Weiyang Wang, Moein Khazraee, Zhizhen Zhong, Manya Ghobadi, Zhihao Jia, Dheevatsa Mudigere, Ying Zhang, and Anthony Kewitsch. TopoOpt: Co-optimizing network topology and parallelization strategy for dis- tributed training jobs. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 739–767, Boston, MA, April 2023. USENIX A...
2023
-
[25]
StableHLO
OpenXLA Project. StableHLO. https://openxla. org/stablehlo, 2024. Accessed: 2026-06-08
2024
-
[26]
OpenXLA Project. Shardy. https://openxla.org/ shardy, 2024. Accessed: 2026-06-08
2024
-
[27]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Jour- nal of Machine Learning Research, 21(140):1–67, 2020. URLhttp://jmlr.org/papers/v21/20-074.html
2020
-
[28]
Auto-encoding variational bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. InInternational Conference on Learn- ing Representations, 2014
2014
-
[29]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022
2022
-
[30]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems, volume 33, pages 6840–
-
[31]
Curran Associates, Inc., 2020
2020
-
[32]
Score- based generative modeling through stochastic differen- tial equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differen- tial equations. InInternational Conference on Learning 14 Representations, 2021. URL https://openreview. net/forum?id=PxTIG12RRHS
2021
-
[33]
Scalable diffu- sion models with transformers
William Peebles and Saining Xie. Scalable diffu- sion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 4195–4205, October 2023
2023
-
[34]
Gspmd: General and scalable parallelization for ml computation graphs, 2021
Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, Ruom- ing Pang, Noam Shazeer, Shibo Wang, Tao Wang, Yonghui Wu, and Zhifeng Chen. Gspmd: General and scalable parallelization for ml computation graphs, 2021. URLhttps://arxiv.org/abs/2105.04663
Pith/arXiv arXiv 2021
-
[35]
Genserve: Efficient co- serving of heterogeneous diffusion model workloads
Fanjiang Ye, Zhangke Li, Xinrui Zhong, Ethan Ma, Russell Chen, Kaijian Wang, Jingwei Zuo, Desen Sun, Ye Cao, Triston Cao, et al. Genserve: Efficient co- serving of heterogeneous diffusion model workloads. arXiv preprint arXiv:2604.04335, 2026
Pith/arXiv arXiv 2026
-
[36]
SGLang Project. SGLang. https://github.com/ sgl-project/sglang, 2024. Accessed: 2026-06-08
2024
-
[37]
Xing, Joseph E
Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, and Ion Stoica. Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning. In16th USENIX Symposium on Operating Systems De- sign and Implementation (OSDI 22), pages 559–578, ...
2022
-
[38]
DistriFusion: Distributed parallel inference for high-resolution diffusion models
Muyang Li, Tianle Cai, Jiaxin Cao, Qinsheng Zhang, Han Cai, Junjie Bai, Yangqing Jia, Ming-Yu Liu, Kai Li, and Song Han. DistriFusion: Distributed parallel inference for high-resolution diffusion models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7183– 7193, 2024
2024
-
[39]
Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference
Jiarui Fang, Jinzhe Pan, Aoyu Li, Xibo Sun, and W ANG Jiannan. Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference. InThe Thirty-ninth An- nual Conference on Neural Information Processing Sys- tems, 2026. URL https://openreview.net/forum? id=5xwyxupsLL
2026
-
[40]
Rink, Michael Schaarschmidt, Timur Sitdikov, Agnieszka Swietlik, Dimitrios Vytiniotis, and Joel Wee
Sami Alabed, Daniel Belov, Bart Chrzaszcz, Juliana Franco, Dominik Grewe, Dougal Maclaurin, James Mol- loy, Tom Natan, Tamara Norman, Xiaoyue Pan, Adam Paszke, Norman A. Rink, Michael Schaarschmidt, Timur Sitdikov, Agnieszka Swietlik, Dimitrios Vytiniotis, and Joel Wee. Partir: Composing spmd partitioning strate- gies for machine learning. ASPLOS ’25, pag...
arXiv 2025
-
[41]
Beyond data and model parallelism for deep neural networks,
Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond data and model parallelism for deep neural networks,
-
[42]
URLhttps://arxiv.org/abs/1807.05358
-
[43]
Unity: Accelerating DNN training through joint opti- mization of algebraic transformations and paralleliza- tion
Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ra- makrishnaiah, Nirmal Prajapati, Patrick McCormick, Jamaludin Mohd-Yusof, Xi Luo, Dheevatsa Mudigere, Jongsoo Park, Misha Smelyanskiy, and Alex Aiken. Unity: Accelerating DNN training through joint opti- mization of algebraic transformations and paralleliza...
2022
-
[44]
Generic topology map- ping strategies for large-scale parallel architectures
Torsten Hoefler and Marc Snir. Generic topology map- ping strategies for large-scale parallel architectures. In Proceedings of the International Conference on Super- computing, ICS ’11, page 75–84, New York, NY , USA,
-
[45]
Association for Computing Machinery. ISBN 9781450301022. doi: 10.1145/1995896.1995909. URL https://doi.org/10.1145/1995896.1995909. 15
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.