pith. machine review for the scientific record. sign in

arxiv: 2604.12163 · v1 · submitted 2026-04-14 · 💻 cs.CV

Recognition: unknown

Nucleus-Image: Sparse MoE for Image Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-image generationsparse mixture of expertsdiffusion transformerefficient inferenceopen-source modelimage synthesisMoE scaling
0
0 comments X

The pith

Sparse MoE diffusion transformers can match leading text-to-image models on quality benchmarks while activating only about 2 billion parameters per forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Nucleus-Image, a text-to-image generation model that employs a sparse mixture-of-experts architecture within a diffusion transformer framework. By routing to activate only roughly 2 billion parameters out of a 17 billion total per forward pass, the model matches or exceeds the performance of leading models on GenEval, DPG-Bench, and OneIG-Bench. The architecture omits text tokens from the transformer backbone and incorporates joint attention along with decoupled routing for timestep modulation. A large-scale dataset of 1.5 billion high-quality image-text pairs is used with a progressive training curriculum from low to high resolution and increasing sparsity. The approach achieves these results without any reinforcement learning, preference optimization, or human tuning steps.

Core claim

Nucleus-Image employs a sparse mixture-of-experts diffusion transformer with Expert-Choice Routing that scales total model capacity to 17B parameters across 64 routed experts per layer while activating only approximately 2B parameters per forward pass, matching or exceeding leading models on GenEval, DPG-Bench, and OneIG-Bench. It uses a streamlined architecture that excludes text tokens from the transformer backbone, joint attention for KV sharing, and decoupled routing for stability with timestep modulation. The model is trained on 1.5B high-quality pairs with progressive resolution and sparsification, using the Muon optimizer, and demonstrates high-quality generation without post-training

What carries the argument

Sparse mixture-of-experts (MoE) diffusion transformer architecture with Expert-Choice Routing and decoupled timestep-aware expert assignment.

Load-bearing premise

The reported benchmark scores reflect genuine quality gains from the described architecture and training rather than from undisclosed data curation, evaluation choices, or implementation details.

What would settle it

An independent replication that applies the same model code and weights to the public benchmarks but substitutes a different training dataset and obtains substantially lower scores would show the results depend more on data than on the MoE design.

Figures

Figures reproduced from arXiv: 2604.12163 by Ajay Modukuri, Chandan Akiti, Gunavardhan Akiti, Haozhe Liu, Murali Nandan Nagarapu.

Figure 1
Figure 1. Figure 1: Nucleus-Image generations of human subjects and portraits, spanning diverse cultures, ages, and artistic [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Nucleus-Image generations spanning fantasy, surrealism, animation, and the natural world. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Nucleus-Image generations across product photography, architecture, typography, food, and world culture [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overall performance computed as the average of GenEval, DPG-Bench, and OneIG-Bench benchmark scores [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Dataset retention across the data pipeline. Block height indicates retained corpus size, and ring markers [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Representative quality tiers for real images. Real-image samples are ranked using aesthetic scoring together [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Representative quality tiers for synthetic images. Synthetic samples bypass aesthetic scoring and are assigned [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Average caption length across quality tiers and episodic buckets. Higher quality tiers generally carry longer [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Joint view of static quality tiers and episodic buckets. Columns denote quality tiers A1-A5 and rows denote [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Overview of the Nucleus-Image architecture. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: GenEval overall scores for top-performing models. Nucleus-Image matches Qwen-Image at 0.87 and leads [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: DPG-Bench overall scores for top-performing models. Nucleus-Image achieves the highest overall score of [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: OneIG-Bench overall scores for top-performing models. Nucleus-Image scores 0.522, surpassing Imagen4 [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Expert allocation and diversity across three domains. Each column shows a different generation: stylized text rendering (left), photorealistic scene composition (center), and portrait photography (right). Top: generated images. Middle: normalized expert allocation aggregated over all 29 MoE layers and 50 denoising steps—bright regions attract more expert capacity. Bottom: expert diversity, measuring the n… view at source ↗
Figure 15
Figure 15. Figure 15: Timestep evolution of expert allocation at layer 17. Each row shows the number-of-experts-per-token heatmap overlaid on the generated image, sampled at 10 evenly spaced denoising steps (left to right, top to bottom: steps 0, 5, 11, 17, 22, 27, 33, 38, 44, 49). Early steps exhibit diffuse, spatially unstructured allocation; mid-steps develop coarse semantic structure; late steps produce the sharpest, most … view at source ↗
read the original abstract

We present Nucleus-Image, a text-to-image generation model that establishes a new Pareto frontier in quality-versus-efficiency by matching or exceeding leading models on GenEval, DPG-Bench, and OneIG-Bench while activating only approximately 2B parameters per forward pass. Nucleus-Image employs a sparse mixture-of-experts (MoE) diffusion transformer architecture with Expert-Choice Routing that scales total model capacity to 17B parameters across 64 routed experts per layer. We adopt a streamlined architecture optimized for inference efficiency by excluding text tokens from the transformer backbone entirely and using joint attention that enables text KV sharing across timesteps. To improve routing stability when using timestep modulation, we introduce a decoupled routing design that separates timestep-aware expert assignment from timestep-conditioned expert computation. We construct a large-scale training corpus of 1.5B high-quality training pairs spanning 700M unique images through multi-stage filtering, deduplication, aesthetic tiering, and caption curation. Training follows a progressive resolution curriculum (256 to 512 to 1024) with multi-aspect-ratio bucketing at every stage, coupled with progressive sparsification of the expert capacity factor. We adopt the Muon optimizer and share our parameter grouping recipe tailored for diffusion models with timestep modulation. Nucleus-Image demonstrates that sparse MoE scaling is a highly effective path to high-quality image generation, reaching the performance of models with significantly larger active parameter budgets at a fraction of the inference cost. These results are achieved without post-training optimization of any kind: no reinforcement learning, no direct preference optimization, and no human preference tuning. We release the training recipe, making Nucleus-Image the first fully open-source MoE diffusion model at this quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces Nucleus-Image, a sparse mixture-of-experts (MoE) diffusion transformer for text-to-image generation. It claims to set a new Pareto frontier in quality versus efficiency by matching or exceeding leading models on GenEval, DPG-Bench, and OneIG-Bench while activating only ~2B parameters per forward pass from a total capacity of 17B parameters (64 routed experts per layer). Key elements include Expert-Choice Routing, decoupled timestep-aware assignment, joint attention for text KV sharing (excluding text tokens from the backbone), a 1.5B-pair training corpus built via multi-stage filtering/deduplication/aesthetic tiering, progressive resolution curriculum (256→512→1024) with multi-aspect bucketing and progressive sparsification, the Muon optimizer, and a parameter-grouping recipe for timestep-modulated diffusion models. Results are reported without post-training (no RL, DPO, or preference tuning), and the training recipe is released.

Significance. If the benchmark numbers hold under scrutiny and the performance gains are attributable to the described MoE architecture and training choices rather than data or implementation artifacts, the work would be significant for demonstrating that sparse MoE scaling can deliver high-quality image generation at substantially lower inference cost than dense models with comparable active parameters. The explicit release of the full training recipe, including the Muon optimizer grouping tailored for diffusion models, is a concrete strength that supports reproducibility and community follow-up.

major comments (3)
  1. [§5 / abstract] The central Pareto-frontier claim (abstract and §5) rests on benchmark scores that are presented without error bars, standard deviations, or details on the number of evaluation runs or seeds. This makes it impossible to assess whether the reported matching/exceeding of leading models on GenEval, DPG-Bench, and OneIG-Bench reflects a reliable improvement or could be within run-to-run variance.
  2. [§3.2 / §4] No ablation studies are provided for the key architectural innovations (Expert-Choice Routing, decoupled timestep-aware assignment, joint attention for text KV sharing). For example, the claim that decoupled routing improves stability under timestep modulation (abstract and §3.2) would be strengthened by a direct comparison to a standard timestep-modulated MoE baseline on at least one benchmark and a routing-stability metric.
  3. [§4.1] The training corpus construction (§4.1) is described at a high level (1.5B pairs, 700M unique images, multi-stage filtering, aesthetic tiering). To support the claim that results are achieved without undisclosed data advantages, the paper should report quantitative statistics on the filtering thresholds, deduplication method, and aesthetic score distribution, plus an ablation on a smaller corpus without tiering.
minor comments (3)
  1. [§3] Notation for expert capacity factor and routing probabilities is introduced without a consolidated table of symbols; adding one would improve readability when comparing the progressive sparsification schedule across stages.
  2. [Figure 1 / §3] Figure captions for the architecture diagram and routing visualization should explicitly state the active-parameter count per forward pass and the total parameter count to make the efficiency claim immediately verifiable from the figure.
  3. [§2 / Table 2] The paper cites prior MoE diffusion works but does not include a direct comparison table against other open MoE image models with similar active-parameter budgets; adding this would clarify the novelty of the 2B-active / 17B-total tradeoff.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below. Where the suggestions identify areas for greater rigor or transparency, we have revised the manuscript accordingly to strengthen the work.

read point-by-point responses
  1. Referee: The central Pareto-frontier claim (abstract and §5) rests on benchmark scores that are presented without error bars, standard deviations, or details on the number of evaluation runs or seeds. This makes it impossible to assess whether the reported matching/exceeding of leading models on GenEval, DPG-Bench, and OneIG-Bench reflects a reliable improvement or could be within run-to-run variance.

    Authors: We agree that statistical variability measures are necessary to substantiate the reliability of the reported benchmark results. In the revised manuscript, we will add error bars and standard deviations for all scores on GenEval, DPG-Bench, and OneIG-Bench. These will be computed from multiple independent evaluation runs (at least three) using distinct random seeds, and we will explicitly state the number of runs and seeds in the evaluation section. revision: yes

  2. Referee: No ablation studies are provided for the key architectural innovations (Expert-Choice Routing, decoupled timestep-aware assignment, joint attention for text KV sharing). For example, the claim that decoupled routing improves stability under timestep modulation (abstract and §3.2) would be strengthened by a direct comparison to a standard timestep-modulated MoE baseline on at least one benchmark and a routing-stability metric.

    Authors: We acknowledge that dedicated ablations would more convincingly isolate the benefits of the proposed components. In the revision, we will add ablation studies comparing the full Nucleus-Image architecture against variants that replace decoupled timestep-aware assignment with standard timestep-modulated MoE routing. These will report performance on GenEval and include a routing-stability metric such as expert activation variance across timesteps. Parallel ablations will cover Expert-Choice Routing and joint attention for text KV sharing. revision: yes

  3. Referee: The training corpus construction (§4.1) is described at a high level (1.5B pairs, 700M unique images, multi-stage filtering, aesthetic tiering). To support the claim that results are achieved without undisclosed data advantages, the paper should report quantitative statistics on the filtering thresholds, deduplication method, and aesthetic score distribution, plus an ablation on a smaller corpus without tiering.

    Authors: We recognize the value of quantitative transparency in data curation. The revised §4.1 will include specific filtering thresholds (e.g., minimum aesthetic and quality scores), the deduplication method (e.g., embedding similarity threshold), and summary statistics or distributions of aesthetic scores in the final 1.5B-pair corpus. We will also add an ablation training a smaller model on a non-tiered version of the corpus and compare its benchmark performance to the tiered version. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical architecture description, training procedure, and benchmark results for a sparse MoE diffusion model. No mathematical derivations, predictions, or first-principles claims are made that reduce to self-defined quantities, fitted inputs renamed as outputs, or self-citation chains. All performance claims rest on external benchmarks (GenEval, DPG-Bench, OneIG-Bench) and the stated active-parameter count, with no internal reduction of results to inputs by construction. The work is self-contained as an engineering report.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The work is empirical and introduces design choices rather than new theoretical axioms or entities; free parameters are primarily architectural hyperparameters chosen to target efficiency.

free parameters (2)
  • number of routed experts
    Design choice of 64 experts per layer to reach 17B total capacity while keeping active parameters near 2B.
  • expert capacity factor
    Progressive sparsification schedule that controls how many tokens each expert processes.
axioms (1)
  • domain assumption Diffusion transformers remain stable under sparse expert routing when timestep modulation is decoupled from expert assignment.
    Invoked to justify the decoupled routing design for training stability.

pith-pipeline@v0.9.0 · 5632 in / 1238 out tokens · 32555 ms · 2026-05-10T15:25:41.428613+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Normalizing Trajectory Models

    cs.CV 2026-05 unverdicted novelty 7.0

    NTM uses per-step conditional normalizing flows plus a trajectory-wide predictor to achieve exact-likelihood 4-step sampling that matches or exceeds baselines on text-to-image tasks.

  2. Normalizing Trajectory Models

    cs.CV 2026-05 unverdicted novelty 7.0

    NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.

Reference graph

Works this paper leans on

69 extracted references · 23 canonical work pages · cited by 1 Pith paper · 16 internal anchors

  1. [1]

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

  2. [2]

    Learning transferable visual models from natural language supervision, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021

  3. [3]

    Meta clip 2: A worldwide scaling recipe, 2025

    Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen tau Yih, Shang-Wen Li, and Hu Xu. Meta clip 2: A worldwide scaling recipe, 2025. 1Equal contribution 2CMU, work done while interning at NucleusAI 3KAUST 34 Nucleus-ImageTECH...

  4. [4]

    Improved techniques for training single-image gans, 2020

    Tobias Hinz, Matthew Fisher, Oliver Wang, and Stefan Wermter. Improved techniques for training single-image gans, 2020

  5. [5]

    NVIDIA DALI: A gpu-accelerated data loading library, 2024

    Nvidia. NVIDIA DALI: A gpu-accelerated data loading library, 2024

  6. [6]

    NeMo Curator: GPU-accelerated data curation for large language models, 2024

    Joseph Jennings, Mostofa Patwary Bhandwaldar, Vibhu Jawa Elazar, Ayush Dattagupta Ryan, Jiwei Liu Zeng, Shankar Rao Nithin, Jared Casper, Ashwath Aithal Gonzalez, et al. NeMo Curator: GPU-accelerated data curation for large language models, 2024

  7. [7]

    Unveiling and mitigating memorization in text-to-image diffusion models through cross attention, 2025

    Jie Ren, Yaxin Li, Shenglai Zeng, Han Xu, Lingjuan Lyu, Yue Xing, and Jiliang Tang. Unveiling and mitigating memorization in text-to-image diffusion models through cross attention, 2025

  8. [8]

    Captain: Semantic feature injection for memorization mitigation in text-to-image diffusion models, 2025

    Tong Zhang, Carlos Hinojosa, and Bernard Ghanem. Captain: Semantic feature injection for memorization mitigation in text-to-image diffusion models, 2025

  9. [9]

    Waon: Large-scale and high-quality japanese image-text pair dataset for vision-language models, 2025

    Issa Sugiura, Shuhei Kurita, Yusuke Oda, Daisuke Kawahara, Yasuo Okabe, and Naoaki Okazaki. Waon: Large-scale and high-quality japanese image-text pair dataset for vision-language models, 2025

  10. [10]

    Fox, and Haiyi Zhu

    Jordan Taylor, William Agnew, Maarten Sap, Sarah E. Fox, and Haiyi Zhu. The algorithmic gaze of image quality assessment: An audit and trace ethnography of the laion-aesthetics predictor, 2026

  11. [11]

    Lumina-image 2.0: A unified and efficient image generative framework, 2025

    Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, Xiangyang Zhu, Manyuan Zhang, Will Beddow, Erwann Millon, Victor Perez, Wenhai Wang, Conghui He, Bo Zhang, Xiaohong Liu, Hongsheng Li, Yu Qiao, Chang Xu, and Peng Gao. Lumina-image 2.0: A unified and efficient image generative framework, 2025

  12. [12]

    Playground v3: Improving text-to-image alignment with deep-fusion large language models, 2024

    Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Chase Lambert, Joao Souza, Suhail Doshi, and Daiqing Li. Playground v3: Improving text-to-image alignment with deep-fusion large language models, 2024

  13. [13]

    Qwen-image technical report, 2025

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

  14. [14]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  15. [15]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

  16. [16]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  17. [17]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints.arXiv:2305.13245, 2023

  18. [18]

    Scaling diffusion transformers to 16 billion parameters, 2024

    Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, and Junshi Huang. Scaling diffusion transformers to 16 billion parameters, 2024

  19. [19]

    GLU Variants Improve Transformer

    Noam Shazeer. GLU variants improve transformer.arXiv:2002.05202, 2020

  20. [20]

    Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 35:7103–7114, 2022

    Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 35:7103–7114, 2022

  21. [21]

    EC-DIT: Scaling diffusion transformers with adaptive expert-choice routing, 2024

    Haotian Sun, Tao Lei, Bowen Zhang, Yanghao Li, Haoshuo Huang, Ruoming Pang, Bo Dai, and Nan Du. EC-DIT: Scaling diffusion transformers with adaptive expert-choice routing, 2024

  22. [22]

    Liger kernel: Efficient triton kernels for LLM training, 2024

    Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, and Yanning Chen. Liger kernel: Efficient triton kernels for LLM training, 2024

  23. [23]

    Flashattention-3: Fast and accurate attention with asynchrony and low-precision, 2024

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.arXiv:2407.08608, 2024

  24. [24]

    Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

  25. [25]

    TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training, 2024

    Wanchao Huang, Zuchao Luk, Patrick Blöbaum, Shiyang Zeng, Tian Ge, Peng Deng, Himanshu Chauhan, Jian Li, Deven Lim, Helen Lai, Will Deng, Vignesh Bom, Boyuan Roh, et al. TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training, 2024. 35 Nucleus-ImageTECHNICALREPORT

  26. [26]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  27. [27]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

  28. [28]

    ST-MoE: Designing stable and transferable sparse expert models, 2022

    Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. ST-MoE: Designing stable and transferable sparse expert models, 2022

  29. [29]

    Advancing expert specialization for better MoE, 2025

    Hongcan Guo, Haolang Lu, Guoshun Nan, Bolun Chu, Jialin Zhuang, Yuan Yang, Wenhao Che, Xinye Cao, Sicong Leng, Qimei Cui, and Xudong Jiang. Advancing expert specialization for better MoE, 2025

  30. [30]

    ERNIE 4.5 technical report

    Baidu-ERNIE-Team. ERNIE 4.5 technical report. https://ernie.baidu.com/blog/publication/ERNIE_ Technical_Report.pdf, 2025

  31. [31]

    Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models, 2025

    Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models, 2025

  32. [32]

    Muon: An optimizer for hidden layers in neural networks, 2024

    Keller Jordan. Muon: An optimizer for hidden layers in neural networks, 2024

  33. [33]

    Aider-ai/aider, 2023a

    Changxin Tian, Peng Wang, et al. WSM: Decay-free learning rate schedule via checkpoint merging for LLM pre-training.arXiv preprint arXiv:2507.17634, 2025

  34. [34]

    Muon is scalable for LLM training, 2025

    Jingyuan Liu, Jianlin Zeng, et al. Muon is scalable for LLM training, 2025

  35. [35]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019

  36. [36]

    High-resolution image synthesis with latent diffusion models, 2021

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021

  37. [37]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv:2207.12598, 2022

  38. [38]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

  39. [39]

    DeepEP: an efficient expert-parallel communication library, 2025

    DeepSeek-AI. DeepEP: an efficient expert-parallel communication library, 2025

  40. [40]

    Flux.https://github.com/black-forest-labs/flux, 2024

    BlackForest. Flux.https://github.com/black-forest-labs/flux, 2024

  41. [41]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

  42. [42]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

  43. [43]

    Oneig-bench: Omni-dimensional nuanced evaluation for image generation

    Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation.arXiv preprint arXiv:2506.07977, 2025

  44. [44]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

  45. [45]

    Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Zhongdao Wang, James T Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InICLR, 2024

  46. [46]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

  47. [47]

    Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

    Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7739–7751, 2025

  48. [48]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  49. [49]

    Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025

    Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, et al. Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025. 36 Nucleus-ImageTECHNICALREPORT

  50. [50]

    Seedream 3.0 Technical Report

    Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025

  51. [51]

    Gpt-image-1, 2025

    OpenAI. Gpt-image-1, 2025

  52. [52]

    Lumina-next: Making lumina-t2x stronger and faster with next-dit.Advances in Neural Information Processing Systems, 37:131278–131315, 2024

    Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Xiangyang Zhu, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina-t2x stronger and faster with next-dit.Advances in Neural Information Processing Systems, 37:131278–131315, 2024

  53. [53]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  54. [54]

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation.arXiv preprint arXiv:2402.17245, 2024

  55. [55]

    Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding, 2024

    Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue,...

  56. [56]

    Janus: Decoupling visual encoding for unified multimodal understanding and generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025

  57. [57]

    Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024

  58. [58]

    DALL·E 3

    OpenAI. DALL·E 3. https://openai.com/research/dall-e-3, September 2023

  59. [59]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

  60. [60]

    Show-o2: Improved Native Unified Multimodal Models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models. arXiv preprint arXiv:2506.15564, 2025

  61. [61]

    Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer

    Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Chengyue Wu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv preprint arXiv:2501.18427, 2025

  62. [62]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  63. [63]

    Kolors2.0

    Kuaishou Kolors team. Kolors2.0. https://app.klingai.com/cn/, 2025

  64. [64]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

  65. [65]

    Cogview4, 2025

    THUKEG Z.ai. Cogview4, 2025

  66. [66]

    Imagen 3, 2024

    Imagen Team Google. Imagen 3, 2024

  67. [67]

    Recraft v3.https://www.recraft.ai/, 2024

    Recraft. Recraft v3.https://www.recraft.ai/, 2024

  68. [68]

    Imagen, 2025

    Google. Imagen, 2025

  69. [69]

    Mithril Cloud, 2025

    Mithril Cloud. Mithril Cloud, 2025. 37 Nucleus-ImageTECHNICALREPORT Text Rendering Scene Composition Portrait Normalized Expert Allocation (↑) Expert Diversity (↓) Figure 14:Expert allocation and diversity across three domains.Each column shows a different generation: stylized text rendering (left), photorealistic scene composition (center), and portrait ...