Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis
Pith reviewed 2026-06-30 06:05 UTC · model grok-4.3
The pith
A token-editing step at inference and a grouped loss fix self-correction and sparsity problems in masked discrete diffusion for text-to-image generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By adding a token-editing mechanism that allows dynamic revision of unmasked tokens during inference and a Grouped Cross-Entropy objective that supplies positive learning signals to tokens neighboring the ground truth in embedding space, together with a fused operator that reduces VRAM consumption, masked discrete diffusion models overcome their lack of self-correction and training-signal sparsity, yielding improved efficiency and higher image fidelity on high-resolution text-to-image tasks.
What carries the argument
The token-editing mechanism, which revises already-unmasked discrete tokens at inference time, paired with the Grouped Cross-Entropy loss that rewards embedding-space neighbors of the correct token.
If this is right
- Discrete models can now iteratively refine an image in the same way continuous models progressively denoise the full latent.
- Larger token vocabularies become practical for generation because the loss no longer starves most tokens of gradient.
- Training runs require less memory in high-vocabulary regimes thanks to the fused operator.
- The resulting generators reach 0.90 on GenEval, 86.9 on DPG, and 10.76 on HPSv3.
Where Pith is reading between the lines
- The editing step might reduce the number of sampling iterations needed to reach a given quality level.
- The same grouped-loss idea could be tested on other discrete generative tasks such as audio or video token sequences.
- If token editing works reliably, future work could explore learned policies for when and which tokens to rewrite.
Load-bearing premise
The token-editing mechanism can be applied at inference without introducing new inconsistencies or artifacts that cancel out the self-correction benefit.
What would settle it
Generate the same prompts with and without the token-editing step and find that human preference or GenEval scores do not rise or that visible artifacts increase when editing is enabled.
read the original abstract
We propose Nemotron-Labs-Diffusion-Image, a state-of-the-art masked discrete diffusion model (MDM) for high-resolution text-to-image synthesis. Compared with prior work on masked image generation, Nemotron-Labs-Diffusion-Image addresses two key challenges. First, unlike continuous diffusion models which progressively refine latent representations across the entire image, standard MDMs lack self-correcting capability because discrete tokens cannot be modified once they are unmasked. Second, although increasing the vocabulary size of discrete image tokenizers improves reconstruction fidelity, it introduces optimization difficulties for generative modeling as the per-token training signal becomes increasingly sparse. To address the first challenge, Nemotron-Labs-Diffusion-Image incorporates a token-editing mechanism that enables the model to dynamically revise already-unmasked tokens during inference, similar to how a sculptor iteratively refines their work. To tackle the second challenge, we propose a Grouped Cross-Entropy (GCE) objective that assigns positive learning signals to tokens neighboring the ground truth in embedding space, thereby alleviating signal sparsity. To further improve training efficiency, we implement a custom fused operator for GCE that significantly reduces VRAM usage in large-vocabulary settings. Experimental results demonstrate that these innovations substantially improve both training efficiency and image fidelity of masked discrete image generators, achieving a score of 0.90 on GenEval, 86.9 on DPG and 10.76 of HPSv3.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Nemotron-Labs-Diffusion-Image, a masked discrete diffusion model (MDM) for high-resolution text-to-image synthesis. It identifies two challenges in prior MDMs: lack of self-correction because unmasked discrete tokens cannot be revised, and optimization difficulties from sparse per-token signals with large vocabularies. The work introduces a token-editing mechanism for dynamic revision of unmasked tokens at inference and a Grouped Cross-Entropy (GCE) objective that assigns positive signals to embedding-space neighbors of the ground-truth token, plus a fused operator to reduce VRAM usage. It reports scores of 0.90 on GenEval, 86.9 on DPG, and 10.76 on HPSv3 as evidence of improved efficiency and fidelity.
Significance. If the claims are substantiated, the token-editing mechanism and GCE objective would represent targeted advances for MDMs, addressing self-correction and sparsity issues that currently limit discrete models relative to continuous diffusion approaches. The fused operator for GCE could offer a practical efficiency gain in large-vocabulary regimes. These elements, if validated with proper controls, would be of interest to the image synthesis community.
major comments (3)
- [Abstract] Abstract: The reported benchmark scores (0.90 GenEval, 86.9 DPG, 10.76 HPSv3) are presented with no experimental details, baselines, ablations, training configurations, dataset information, or error analysis, making it impossible to determine whether the proposed mechanisms drive the claimed improvements.
- [Abstract] Abstract (first challenge paragraph): The token-editing mechanism is described as allowing the model to revise already-unmasked tokens at inference, yet the training is characterized as standard masked discrete diffusion with no auxiliary losses or edit supervision mentioned; this leaves an unverified train-inference gap that risks out-of-distribution states and artifacts rather than reliable self-correction.
- [Abstract] Abstract (second challenge paragraph): The Grouped Cross-Entropy (GCE) objective is introduced to mitigate signal sparsity by assigning positive signals to neighboring tokens, but no equations, pseudocode, or ablation results are supplied to demonstrate its formulation, gradient behavior, or quantitative effect on optimization or final metrics.
minor comments (1)
- [Abstract] Abstract: The abstract asserts 'state-of-the-art' performance without naming the specific prior MDM baselines or metric definitions used for comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We agree that the abstract is overly concise and will revise it to provide more context on experimental details, the mechanisms, and their validation while preserving brevity. Below we respond point-by-point to the major comments.
read point-by-point responses
-
Referee: [Abstract] Abstract: The reported benchmark scores (0.90 GenEval, 86.9 DPG, 10.76 HPSv3) are presented with no experimental details, baselines, ablations, training configurations, dataset information, or error analysis, making it impossible to determine whether the proposed mechanisms drive the claimed improvements.
Authors: The abstract is intentionally brief. The full manuscript contains a dedicated Experiments section (Section 4) that reports all requested information: training configurations, datasets (including LAION and internal high-res data), baselines (e.g., comparisons to prior MDMs and continuous diffusion models), ablations isolating token-editing and GCE, and error analysis via qualitative examples and metric breakdowns. We will revise the abstract to include a short clause referencing these controls and noting that the gains are measured against strong baselines. revision: yes
-
Referee: [Abstract] Abstract (first challenge paragraph): The token-editing mechanism is described as allowing the model to revise already-unmasked tokens at inference, yet the training is characterized as standard masked discrete diffusion with no auxiliary losses or edit supervision mentioned; this leaves an unverified train-inference gap that risks out-of-distribution states and artifacts rather than reliable self-correction.
Authors: The token-editing procedure is an inference-only technique that iteratively re-masks and re-predicts selected tokens using the same trained MDM; no auxiliary losses or edit-specific supervision are required because the model was already trained to denoise arbitrary partial masks. This is analogous to iterative refinement in continuous diffusion. We acknowledge the referee's concern about potential distribution shift and will add a clarifying sentence in the revised abstract plus a short discussion in Section 3.1 on why the mechanism stays in-distribution. We will also include an ablation measuring artifact rates with and without editing. revision: partial
-
Referee: [Abstract] Abstract (second challenge paragraph): The Grouped Cross-Entropy (GCE) objective is introduced to mitigate signal sparsity by assigning positive signals to neighboring tokens, but no equations, pseudocode, or ablation results are supplied to demonstrate its formulation, gradient behavior, or quantitative effect on optimization or final metrics.
Authors: The abstract summarizes GCE at a high level. The full formulation (including the mathematical definition, grouping strategy in embedding space, gradient analysis, and the fused operator) appears in Section 3.2, with pseudocode in Algorithm 1 and ablations in Section 4.3 quantifying its impact on convergence speed and final metrics. We will update the abstract to briefly state the core idea and direct readers to the detailed treatment in the methods section. revision: yes
Circularity Check
No circularity in claimed derivation
full rationale
The paper introduces a token-editing mechanism and Grouped Cross-Entropy objective as innovations for masked discrete diffusion, then reports empirical scores on GenEval, DPG, and HPSv3. No equations, fitted parameters, or self-citations are shown that reduce these outcomes to the inputs by construction. The derivation chain consists of standard MDM training augmented by the proposed components, with results presented as experimental outcomes rather than tautological predictions.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Standard MDMs lack self-correcting capability because discrete tokens cannot be modified once unmasked.
- domain assumption Larger vocabulary sizes introduce optimization difficulties due to increasingly sparse per-token training signals.
invented entities (2)
-
token-editing mechanism
no independent evidence
-
Grouped Cross-Entropy (GCE) objective
no independent evidence
Reference graph
Works this paper leans on
-
[1]
OpenAI. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
2024
-
[3]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 11 Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis
2022
-
[6]
Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020
2020
-
[7]
Jinbin Bai, Tian Ye, Wei Chow, Enxin Song, Xiangtai Li, Zhen Dong, Lei Zhu, and Shuicheng Yan. Meissonic: Revitalizing masked generative transformers for efficient high-resolution text-to-image synthesis.arXiv preprint arXiv:2410.08261, 2024
-
[8]
Maskgit: Masked generative image transformer
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022
2022
-
[9]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, and Jason Kuen. Lavida-o: Elastic masked diffusion models for unified multimodal understanding and generation.arXiv preprint arXiv:2509.19244, 2025
-
[11]
Shufan Li, Yuchen Zhu, Jiuxiang Gu, Kangning Liu, Zhe Lin, Yongxin Chen, Molei Tao, Aditya Grover, and Jason Kuen. Lavida-r1: Advancing reasoning for unified multimodal diffusion language models.arXiv preprint arXiv:2602.14147, 2026
-
[12]
MMaDA: Multimodal Large Diffusion Language Models
Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, et al. Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model.arXiv preprint arXiv:2505.23606, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Mario Michael Krell, Matej Kosec, Sergio P Perez, and Andrew Fitzgibbon. Efficient sequence packing with- out cross-contamination: Accelerating large language models without impacting performance.arXiv preprint arXiv:2107.02027, 2021
-
[15]
Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, and Jason Kuen. Sparse-lavida: Sparse multimodal discrete diffusion language models.arXiv preprint arXiv:2512.14008, 2025
-
[16]
dkv-cache: The cache for diffusion language models
Xinyin Ma, Runpeng Yu, Gongfan Fang, and Xinchao Wang. dkv-cache: The cache for diffusion language models. arXiv preprint arXiv:2505.15781, 2025
-
[17]
Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jin- sheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Scaling the codebook size of vq-gan to 100,000 with a utilization rate of 99%.Advances in Neural Information Processing Systems, 37:12612–12635, 2024
Lei Zhu, Fangyun Wei, Yanye Lu, and Dong Chen. Scaling the codebook size of vq-gan to 100,000 with a utilization rate of 99%.Advances in Neural Information Processing Systems, 37:12612–12635, 2024
2024
-
[20]
Yifan Chang, Jie Qin, Limeng Qiao, Xiaofeng Wang, Zheng Zhu, Lin Ma, and Xingang Wang. Scalable training for vector-quantized networks with 100% codebook utilization.arXiv preprint arXiv:2509.10140, 2025
-
[21]
Scalable image tokenization with index backpropagation quantization
Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16037–16046, 2025
2025
-
[22]
Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Aditya Grover, and Jason Kuen. Snce: Geometry-aware supervision for scalable discrete image generation.arXiv preprint arXiv:2603.15150, 2026
-
[23]
Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023
2023
-
[24]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu Ella. Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 5(7):16, 2024. 12 Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation, 2024
Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation, 2024
2024
-
[26]
Finite Scalar Quantization: VQ-VAE Made Simple
Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple.arXiv preprint arXiv:2309.15505, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Neural discrete representation learning.Advances in neural information processing systems, 30, 2017
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017
2017
-
[29]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[30]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Unified discrete diffusion for simultaneous vision-language generation.arXiv, 2022
Minghui Hu, Chuanxia Zheng, Heliang Zheng, Tat-Jen Cham, Chaoyue Wang, Zuopeng Yang, Dacheng Tao, and Ponnuthurai N Suganthan. Unified discrete diffusion for simultaneous vision-language generation.arXiv, 2022
2022
-
[34]
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, et al. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Edit flows: Flow matching with edit operations,
Marton Havasi, Brian Karrer, Itai Gat, and Ricky TQ Chen. Edit flows: Flow matching with edit operations. arXiv preprint arXiv:2506.09018, 2025
- [36]
-
[37]
Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026
Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, et al. Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026
-
[38]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Nemotron-labs-diffusion: A tri-mode language model unifying autoregressive, diffusion, and self-speculation decoding.preprint, May 2026
Yonggan Fu, Lexington Whalen, Abhinav Garg, Chengyue Wu, Maksim Khadkevich, Nicolai Oswald, Enze Xie, Daniel Egert, Sharath Turuvekere Sreenivas, Shizhe Diao, Chenhan Yu, Ye Yu, Weijia Chen, Sajad Norouzi, Shiyi Lan, Ligeng Zhu, Jin Wang, Jindong Jiang, Morteza Mardani, Mehran Maghoumi, Song Han, Ante Jukic, Nima Tajbakhsh, Jan Kautz, and Pavlo Molchanov....
2026
-
[40]
Beyond masks: Efficient, flexible diffusion language models via deletion-insertion processes
Fangyu Ding, Ding Ding, Sijin Chen, Kaibo Wang, Peng Xu, Zijin Feng, Haoli Bai, Kai Han, Youliang Yan, Binhang Yuan, et al. Beyond masks: Efficient, flexible diffusion language models via deletion-insertion processes. arXiv preprint arXiv:2603.23507, 2026
-
[41]
arXiv preprint arXiv:2512.15596 , year =
Shuibai Zhang, Fred Zhangzhi Peng, Yiheng Zhang, Jin Pan, and Grigorios G Chrysos. Corrective diffusion language models.arXiv preprint arXiv:2512.15596, 2025
-
[42]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 13 Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Hpsv3: Towards wide-spectrum human preference score
Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025
2025
-
[45]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024
2024
-
[46]
Dall·e 3.https://openai.com/index/dall-e-3/, 2023
OpenAI. Dall·e 3.https://openai.com/index/dall-e-3/, 2023
2023
-
[47]
Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Pixart-𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023
2023
-
[49]
Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015
2015
-
[50]
Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024
Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024
2024
-
[51]
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Lavida: A large diffusion language model for multimodal understanding
Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover. Lavida: A large diffusion language model for multimodal understanding. arXiv preprint arXiv:2505.16839, 2025
-
[54]
Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022
2022
-
[55]
Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset, 2022
Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset, 2022
2022
-
[56]
Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025
-
[57]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
2021
-
[58]
Laion-aesthetics
Christoph Schuhmann. Laion-aesthetics. https://laion.ai/blog/laion-aesthetics/, 2022. Accessed: 2024 - 03 - 06
2022
-
[59]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 14 Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis A. Additional Technical Details A.1. Formulation of M...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
(Nvidia Open Model License) Datasets:LAION [ 54] (MIT), COYO [55] (CC-BY-4.0), MJHQ [25] (CC-BY-4.0), BLIP3o-60k [43] (Apache-2.0), and ShareGPT4o-Image [56] (CC-BY-4.0). 20 Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis a beautiful sunset, bright and colourful, ultra realistic, UHD, 8k fluffy white ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.