Recognition: unknown
Nucleus-Image: Sparse MoE for Image Generation
Pith reviewed 2026-05-10 15:25 UTC · model grok-4.3
The pith
Sparse MoE diffusion transformers can match leading text-to-image models on quality benchmarks while activating only about 2 billion parameters per forward pass.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Nucleus-Image employs a sparse mixture-of-experts diffusion transformer with Expert-Choice Routing that scales total model capacity to 17B parameters across 64 routed experts per layer while activating only approximately 2B parameters per forward pass, matching or exceeding leading models on GenEval, DPG-Bench, and OneIG-Bench. It uses a streamlined architecture that excludes text tokens from the transformer backbone, joint attention for KV sharing, and decoupled routing for stability with timestep modulation. The model is trained on 1.5B high-quality pairs with progressive resolution and sparsification, using the Muon optimizer, and demonstrates high-quality generation without post-training
What carries the argument
Sparse mixture-of-experts (MoE) diffusion transformer architecture with Expert-Choice Routing and decoupled timestep-aware expert assignment.
Load-bearing premise
The reported benchmark scores reflect genuine quality gains from the described architecture and training rather than from undisclosed data curation, evaluation choices, or implementation details.
What would settle it
An independent replication that applies the same model code and weights to the public benchmarks but substitutes a different training dataset and obtains substantially lower scores would show the results depend more on data than on the MoE design.
Figures
read the original abstract
We present Nucleus-Image, a text-to-image generation model that establishes a new Pareto frontier in quality-versus-efficiency by matching or exceeding leading models on GenEval, DPG-Bench, and OneIG-Bench while activating only approximately 2B parameters per forward pass. Nucleus-Image employs a sparse mixture-of-experts (MoE) diffusion transformer architecture with Expert-Choice Routing that scales total model capacity to 17B parameters across 64 routed experts per layer. We adopt a streamlined architecture optimized for inference efficiency by excluding text tokens from the transformer backbone entirely and using joint attention that enables text KV sharing across timesteps. To improve routing stability when using timestep modulation, we introduce a decoupled routing design that separates timestep-aware expert assignment from timestep-conditioned expert computation. We construct a large-scale training corpus of 1.5B high-quality training pairs spanning 700M unique images through multi-stage filtering, deduplication, aesthetic tiering, and caption curation. Training follows a progressive resolution curriculum (256 to 512 to 1024) with multi-aspect-ratio bucketing at every stage, coupled with progressive sparsification of the expert capacity factor. We adopt the Muon optimizer and share our parameter grouping recipe tailored for diffusion models with timestep modulation. Nucleus-Image demonstrates that sparse MoE scaling is a highly effective path to high-quality image generation, reaching the performance of models with significantly larger active parameter budgets at a fraction of the inference cost. These results are achieved without post-training optimization of any kind: no reinforcement learning, no direct preference optimization, and no human preference tuning. We release the training recipe, making Nucleus-Image the first fully open-source MoE diffusion model at this quality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Nucleus-Image, a sparse mixture-of-experts (MoE) diffusion transformer for text-to-image generation. It claims to set a new Pareto frontier in quality versus efficiency by matching or exceeding leading models on GenEval, DPG-Bench, and OneIG-Bench while activating only ~2B parameters per forward pass from a total capacity of 17B parameters (64 routed experts per layer). Key elements include Expert-Choice Routing, decoupled timestep-aware assignment, joint attention for text KV sharing (excluding text tokens from the backbone), a 1.5B-pair training corpus built via multi-stage filtering/deduplication/aesthetic tiering, progressive resolution curriculum (256→512→1024) with multi-aspect bucketing and progressive sparsification, the Muon optimizer, and a parameter-grouping recipe for timestep-modulated diffusion models. Results are reported without post-training (no RL, DPO, or preference tuning), and the training recipe is released.
Significance. If the benchmark numbers hold under scrutiny and the performance gains are attributable to the described MoE architecture and training choices rather than data or implementation artifacts, the work would be significant for demonstrating that sparse MoE scaling can deliver high-quality image generation at substantially lower inference cost than dense models with comparable active parameters. The explicit release of the full training recipe, including the Muon optimizer grouping tailored for diffusion models, is a concrete strength that supports reproducibility and community follow-up.
major comments (3)
- [§5 / abstract] The central Pareto-frontier claim (abstract and §5) rests on benchmark scores that are presented without error bars, standard deviations, or details on the number of evaluation runs or seeds. This makes it impossible to assess whether the reported matching/exceeding of leading models on GenEval, DPG-Bench, and OneIG-Bench reflects a reliable improvement or could be within run-to-run variance.
- [§3.2 / §4] No ablation studies are provided for the key architectural innovations (Expert-Choice Routing, decoupled timestep-aware assignment, joint attention for text KV sharing). For example, the claim that decoupled routing improves stability under timestep modulation (abstract and §3.2) would be strengthened by a direct comparison to a standard timestep-modulated MoE baseline on at least one benchmark and a routing-stability metric.
- [§4.1] The training corpus construction (§4.1) is described at a high level (1.5B pairs, 700M unique images, multi-stage filtering, aesthetic tiering). To support the claim that results are achieved without undisclosed data advantages, the paper should report quantitative statistics on the filtering thresholds, deduplication method, and aesthetic score distribution, plus an ablation on a smaller corpus without tiering.
minor comments (3)
- [§3] Notation for expert capacity factor and routing probabilities is introduced without a consolidated table of symbols; adding one would improve readability when comparing the progressive sparsification schedule across stages.
- [Figure 1 / §3] Figure captions for the architecture diagram and routing visualization should explicitly state the active-parameter count per forward pass and the total parameter count to make the efficiency claim immediately verifiable from the figure.
- [§2 / Table 2] The paper cites prior MoE diffusion works but does not include a direct comparison table against other open MoE image models with similar active-parameter budgets; adding this would clarify the novelty of the 2B-active / 17B-total tradeoff.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment point by point below. Where the suggestions identify areas for greater rigor or transparency, we have revised the manuscript accordingly to strengthen the work.
read point-by-point responses
-
Referee: The central Pareto-frontier claim (abstract and §5) rests on benchmark scores that are presented without error bars, standard deviations, or details on the number of evaluation runs or seeds. This makes it impossible to assess whether the reported matching/exceeding of leading models on GenEval, DPG-Bench, and OneIG-Bench reflects a reliable improvement or could be within run-to-run variance.
Authors: We agree that statistical variability measures are necessary to substantiate the reliability of the reported benchmark results. In the revised manuscript, we will add error bars and standard deviations for all scores on GenEval, DPG-Bench, and OneIG-Bench. These will be computed from multiple independent evaluation runs (at least three) using distinct random seeds, and we will explicitly state the number of runs and seeds in the evaluation section. revision: yes
-
Referee: No ablation studies are provided for the key architectural innovations (Expert-Choice Routing, decoupled timestep-aware assignment, joint attention for text KV sharing). For example, the claim that decoupled routing improves stability under timestep modulation (abstract and §3.2) would be strengthened by a direct comparison to a standard timestep-modulated MoE baseline on at least one benchmark and a routing-stability metric.
Authors: We acknowledge that dedicated ablations would more convincingly isolate the benefits of the proposed components. In the revision, we will add ablation studies comparing the full Nucleus-Image architecture against variants that replace decoupled timestep-aware assignment with standard timestep-modulated MoE routing. These will report performance on GenEval and include a routing-stability metric such as expert activation variance across timesteps. Parallel ablations will cover Expert-Choice Routing and joint attention for text KV sharing. revision: yes
-
Referee: The training corpus construction (§4.1) is described at a high level (1.5B pairs, 700M unique images, multi-stage filtering, aesthetic tiering). To support the claim that results are achieved without undisclosed data advantages, the paper should report quantitative statistics on the filtering thresholds, deduplication method, and aesthetic score distribution, plus an ablation on a smaller corpus without tiering.
Authors: We recognize the value of quantitative transparency in data curation. The revised §4.1 will include specific filtering thresholds (e.g., minimum aesthetic and quality scores), the deduplication method (e.g., embedding similarity threshold), and summary statistics or distributions of aesthetic scores in the final 1.5B-pair corpus. We will also add an ablation training a smaller model on a non-tiered version of the corpus and compare its benchmark performance to the tiered version. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents an empirical architecture description, training procedure, and benchmark results for a sparse MoE diffusion model. No mathematical derivations, predictions, or first-principles claims are made that reduce to self-defined quantities, fitted inputs renamed as outputs, or self-citation chains. All performance claims rest on external benchmarks (GenEval, DPG-Bench, OneIG-Bench) and the stated active-parameter count, with no internal reduction of results to inputs by construction. The work is self-contained as an engineering report.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of routed experts
- expert capacity factor
axioms (1)
- domain assumption Diffusion transformers remain stable under sparse expert routing when timestep modulation is decoupled from expert assignment.
Forward citations
Cited by 2 Pith papers
-
Normalizing Trajectory Models
NTM uses per-step conditional normalizing flows plus a trajectory-wide predictor to achieve exact-likelihood 4-step sampling that matches or exceeds baselines on text-to-image tasks.
-
Normalizing Trajectory Models
NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.
Reference graph
Works this paper leans on
-
[1]
Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023
2023
-
[2]
Learning transferable visual models from natural language supervision, 2021
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021
2021
-
[3]
Meta clip 2: A worldwide scaling recipe, 2025
Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen tau Yih, Shang-Wen Li, and Hu Xu. Meta clip 2: A worldwide scaling recipe, 2025. 1Equal contribution 2CMU, work done while interning at NucleusAI 3KAUST 34 Nucleus-ImageTECH...
2025
-
[4]
Improved techniques for training single-image gans, 2020
Tobias Hinz, Matthew Fisher, Oliver Wang, and Stefan Wermter. Improved techniques for training single-image gans, 2020
2020
-
[5]
NVIDIA DALI: A gpu-accelerated data loading library, 2024
Nvidia. NVIDIA DALI: A gpu-accelerated data loading library, 2024
2024
-
[6]
NeMo Curator: GPU-accelerated data curation for large language models, 2024
Joseph Jennings, Mostofa Patwary Bhandwaldar, Vibhu Jawa Elazar, Ayush Dattagupta Ryan, Jiwei Liu Zeng, Shankar Rao Nithin, Jared Casper, Ashwath Aithal Gonzalez, et al. NeMo Curator: GPU-accelerated data curation for large language models, 2024
2024
-
[7]
Unveiling and mitigating memorization in text-to-image diffusion models through cross attention, 2025
Jie Ren, Yaxin Li, Shenglai Zeng, Han Xu, Lingjuan Lyu, Yue Xing, and Jiliang Tang. Unveiling and mitigating memorization in text-to-image diffusion models through cross attention, 2025
2025
-
[8]
Captain: Semantic feature injection for memorization mitigation in text-to-image diffusion models, 2025
Tong Zhang, Carlos Hinojosa, and Bernard Ghanem. Captain: Semantic feature injection for memorization mitigation in text-to-image diffusion models, 2025
2025
-
[9]
Waon: Large-scale and high-quality japanese image-text pair dataset for vision-language models, 2025
Issa Sugiura, Shuhei Kurita, Yusuke Oda, Daisuke Kawahara, Yasuo Okabe, and Naoaki Okazaki. Waon: Large-scale and high-quality japanese image-text pair dataset for vision-language models, 2025
2025
-
[10]
Fox, and Haiyi Zhu
Jordan Taylor, William Agnew, Maarten Sap, Sarah E. Fox, and Haiyi Zhu. The algorithmic gaze of image quality assessment: An audit and trace ethnography of the laion-aesthetics predictor, 2026
2026
-
[11]
Lumina-image 2.0: A unified and efficient image generative framework, 2025
Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, Xiangyang Zhu, Manyuan Zhang, Will Beddow, Erwann Millon, Victor Perez, Wenhai Wang, Conghui He, Bo Zhang, Xiaohong Liu, Hongsheng Li, Yu Qiao, Chang Xu, and Peng Gao. Lumina-image 2.0: A unified and efficient image generative framework, 2025
2025
-
[12]
Playground v3: Improving text-to-image alignment with deep-fusion large language models, 2024
Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Chase Lambert, Joao Souza, Suhail Doshi, and Daiqing Li. Playground v3: Improving text-to-image alignment with deep-fusion large language models, 2024
2024
-
[13]
Qwen-image technical report, 2025
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...
2025
-
[14]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023
2023
-
[16]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints.arXiv:2305.13245, 2023
work page internal anchor Pith review arXiv 2023
-
[18]
Scaling diffusion transformers to 16 billion parameters, 2024
Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, and Junshi Huang. Scaling diffusion transformers to 16 billion parameters, 2024
2024
-
[19]
GLU Variants Improve Transformer
Noam Shazeer. GLU variants improve transformer.arXiv:2002.05202, 2020
work page internal anchor Pith review arXiv 2002
-
[20]
Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 35:7103–7114, 2022
Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 35:7103–7114, 2022
2022
-
[21]
EC-DIT: Scaling diffusion transformers with adaptive expert-choice routing, 2024
Haotian Sun, Tao Lei, Bowen Zhang, Yanghao Li, Haoshuo Huang, Ruoming Pang, Bo Dai, and Nan Du. EC-DIT: Scaling diffusion transformers with adaptive expert-choice routing, 2024
2024
-
[22]
Liger kernel: Efficient triton kernels for LLM training, 2024
Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, and Yanning Chen. Liger kernel: Efficient triton kernels for LLM training, 2024
2024
-
[23]
Flashattention-3: Fast and accurate attention with asynchrony and low-precision, 2024
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.arXiv:2407.08608, 2024
-
[24]
Root mean square layer normalization.Advances in neural information processing systems, 32, 2019
Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019
2019
-
[25]
TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training, 2024
Wanchao Huang, Zuchao Luk, Patrick Blöbaum, Shiyang Zeng, Tian Ge, Peng Deng, Himanshu Chauhan, Jian Li, Deven Lim, Helen Lai, Will Deng, Vignesh Bom, Boyuan Roh, et al. TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training, 2024. 35 Nucleus-ImageTECHNICALREPORT
2024
-
[26]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[27]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024
2024
-
[28]
ST-MoE: Designing stable and transferable sparse expert models, 2022
Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. ST-MoE: Designing stable and transferable sparse expert models, 2022
2022
-
[29]
Advancing expert specialization for better MoE, 2025
Hongcan Guo, Haolang Lu, Guoshun Nan, Bolun Chu, Jialin Zhuang, Yuan Yang, Wenhao Che, Xinye Cao, Sicong Leng, Qimei Cui, and Xudong Jiang. Advancing expert specialization for better MoE, 2025
2025
-
[30]
ERNIE 4.5 technical report
Baidu-ERNIE-Team. ERNIE 4.5 technical report. https://ernie.baidu.com/blog/publication/ERNIE_ Technical_Report.pdf, 2025
2025
-
[31]
Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models, 2025
Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models, 2025
2025
-
[32]
Muon: An optimizer for hidden layers in neural networks, 2024
Keller Jordan. Muon: An optimizer for hidden layers in neural networks, 2024
2024
-
[33]
Changxin Tian, Peng Wang, et al. WSM: Decay-free learning rate schedule via checkpoint merging for LLM pre-training.arXiv preprint arXiv:2507.17634, 2025
-
[34]
Muon is scalable for LLM training, 2025
Jingyuan Liu, Jianlin Zeng, et al. Muon is scalable for LLM training, 2025
2025
-
[35]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019
2019
-
[36]
High-resolution image synthesis with latent diffusion models, 2021
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021
2021
-
[37]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv:2207.12598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[38]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023
work page internal anchor Pith review arXiv 2023
-
[39]
DeepEP: an efficient expert-parallel communication library, 2025
DeepSeek-AI. DeepEP: an efficient expert-parallel communication library, 2025
2025
-
[40]
Flux.https://github.com/black-forest-labs/flux, 2024
BlackForest. Flux.https://github.com/black-forest-labs/flux, 2024
2024
-
[41]
Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023
2023
-
[42]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024
work page internal anchor Pith review arXiv 2024
-
[43]
Oneig-bench: Omni-dimensional nuanced evaluation for image generation
Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation.arXiv preprint arXiv:2506.07977, 2025
-
[44]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024
work page internal anchor Pith review arXiv 2024
-
[45]
Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Zhongdao Wang, James T Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InICLR, 2024
2024
-
[46]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024
work page internal anchor Pith review arXiv 2024
-
[47]
Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation
Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7739–7751, 2025
2025
-
[48]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
work page internal anchor Pith review arXiv 2025
-
[49]
Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, et al. Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025. 36 Nucleus-ImageTECHNICALREPORT
-
[50]
Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025
work page internal anchor Pith review arXiv 2025
-
[51]
Gpt-image-1, 2025
OpenAI. Gpt-image-1, 2025
2025
-
[52]
Lumina-next: Making lumina-t2x stronger and faster with next-dit.Advances in Neural Information Processing Systems, 37:131278–131315, 2024
Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Xiangyang Zhu, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina-t2x stronger and faster with next-dit.Advances in Neural Information Processing Systems, 37:131278–131315, 2024
2024
-
[53]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi
Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation.arXiv preprint arXiv:2402.17245, 2024
-
[55]
Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding, 2024
Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue,...
2024
-
[56]
Janus: Decoupling visual encoding for unified multimodal understanding and generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025
2025
-
[57]
Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024
2024
-
[58]
DALL·E 3
OpenAI. DALL·E 3. https://openai.com/research/dall-e-3, September 2023
2023
-
[59]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025
work page Pith review arXiv 2025
-
[60]
Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models. arXiv preprint arXiv:2506.15564, 2025
work page internal anchor Pith review arXiv 2025
-
[61]
Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Chengyue Wu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv preprint arXiv:2501.18427, 2025
-
[62]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025
work page internal anchor Pith review arXiv 2025
-
[63]
Kolors2.0
Kuaishou Kolors team. Kolors2.0. https://app.klingai.com/cn/, 2025
2025
-
[64]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[65]
Cogview4, 2025
THUKEG Z.ai. Cogview4, 2025
2025
-
[66]
Imagen 3, 2024
Imagen Team Google. Imagen 3, 2024
2024
-
[67]
Recraft v3.https://www.recraft.ai/, 2024
Recraft. Recraft v3.https://www.recraft.ai/, 2024
2024
-
[68]
Imagen, 2025
Google. Imagen, 2025
2025
-
[69]
Mithril Cloud, 2025
Mithril Cloud. Mithril Cloud, 2025. 37 Nucleus-ImageTECHNICALREPORT Text Rendering Scene Composition Portrait Normalized Expert Allocation (↑) Expert Diversity (↓) Figure 14:Expert allocation and diversity across three domains.Each column shows a different generation: stylized text rendering (left), photorealistic scene composition (center), and portrait ...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.