pith. machine review for the scientific record. sign in

arxiv: 2512.19433 · v2 · submitted 2025-12-22 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords test-time scalingdiffusion multi-modal LLMsimage generationhierarchical searchself-verificationefficiency
0
0 comments X

The pith

A hierarchical search with self-verification lets diffusion multi-modal LLMs improve image quality at up to 6x the efficiency of linear test-time scaling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents dMLLM-TTS, a framework that scales test-time compute in diffusion multi-modal large language models along two axes: exploring varied generation trajectories and iteratively refining outputs. Instead of the conventional linear search that costs O(NT) and depends on an external judge, it introduces a hierarchical search that expands and prunes trajectories at O(N+T) cost. The framework also lets the dMLLM itself judge how well each generated image matches the text prompt, removing the external verifier. Experiments on the GenEval benchmark with models including Lumina-DiMOO, MMaDA, and Muddit report higher-quality results together with efficiency gains reaching 6x over linear search.

Core claim

The central claim is that replacing linear search across trajectory exploration and iterative refinement with an adaptive hierarchical algorithm of O(N+T) complexity, combined with self-verified feedback drawn from the dMLLM's own image-understanding capabilities, produces higher-quality images while cutting computational cost by up to a factor of six compared with standard test-time scaling.

What carries the argument

The hierarchical search algorithm that adaptively expands promising trajectories and prunes others, paired with self-verified feedback that uses the dMLLM's intrinsic image-understanding to score text-image alignment.

If this is right

  • Generation quality improves on the GenEval benchmark across three different dMLLMs.
  • Compute cost drops from O(NT) to O(N+T) while maintaining or exceeding the quality of linear search.
  • No external verifier is required, since the model itself supplies the alignment signal.
  • The same two-axis scaling (trajectory diversity plus iterative refinement) becomes practical at larger N and T values.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The self-verification step may transfer to other multi-modal generation tasks where the model already possesses understanding capabilities.
  • Efficiency gains could make test-time scaling viable for interactive or resource-constrained image generation settings.
  • The hierarchical pruning strategy might extend to non-diffusion architectures that share similar trajectory-based generation.

Load-bearing premise

The dMLLM's built-in image-understanding capabilities can reliably judge how well a generated image matches the input text prompt without help from an external verifier.

What would settle it

A controlled test in which the images selected by the model's self-verification score measurably lower on human or external automatic metrics than the best images chosen by a separate verifier on the same candidate set.

Figures

Figures reproduced from arXiv: 2512.19433 by Bin Fu, Haoxing Chen, Jinbin Bai, Junjun He, Kaiwen Zhu, Qi Qin, Rongchao Zhang, Shuo Cao, Siqi Luo, Tianxiang Xu, Xiaohong Liu, Yangfan He, Yihao Liu, Yi Xin, Yuewen Cao, Zhiwei Zhang.

Figure 1
Figure 1. Figure 1: dMLLM-TTS: We present the generative effects and performance improvements achieved by applying Test-Time Scaling (TTS) to dMLLMs. Images generated with TTS exhibit higher quality and stronger prompt alignment than those generated without TTS. Abstract Diffusion Multi-modal Large Language Models (dMLLMs) have recently emerged as a novel architecture unifying im￾age generation and understanding. However, dev… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of the image generation process in dMLLMs. The first row shows the input latent masks at each step, and the second row depicts the corresponding outputs. Sampling begins with fully masked tokens (gray) and gradually fills the discrete multimodal token space with increasingly confident predictions (blue). ation. Early discrete diffusion methods [1, 16] established the foundation for token-leve… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of dMLLM-TTS framework. (a) dMLLM-TTS scales compute along two axes: trajectory exploration and iterative refinement, guided by Self-Verified Feedback for text–image alignment evaluation. (b) Hierarchical Trajectory Search (HTS) performs coarse-to-fine generation by starting with broad exploration, pruning low-potential trajectories, and refining high-potential trajectories. samples N distinct sto… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative improvement ratio in TTS performance across various text prompt complexities examined through diverse dMLLMs on GenEval benchmark dimensions. TTS markedly enhances performance across all measured dimensions. hierarchical thinning with geometric decay factor d > 1, and (iii) final refinement over the K surviving trajectories. In practice, we start from a wide exploration with a large number of t… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison between linear and hierarchical trajectory search. The red curve illustrates linear trajectory search, while the blue curve depicts hierarchical trajectory search, with a dashed line indicating predictions based on a geometric series decay approximation. Curve fitting shows that similar subsequent trends tend to converge towards an upper limit. centric generation using compositional prompts with… view at source ↗
Figure 6
Figure 6. Figure 6: Trajectory Exploration Scaling (Left) and Iterative Refinement Scaling (Right). Increasing the number of explored trajecto￾ries (N = 1→32) refinement steps (T = 8→64) consistently improves performance across all dMLLMs. Lumina-DiMOO Text Prompt: “ a photo of a white toilet and a red apple ” Unsatisfactory Generation Process Satisfactory Generation Process with dMLLM-TTS Bad Initial Good Initial Wrong Traje… view at source ↗
Figure 7
Figure 7. Figure 7: Image Generation Process without (Top) and with (Bottom) dMLLM-TTS. The baseline models produce unsatisfactory text-to-image results. However, by incorporating our TTS strategies, the generation process is significantly improved [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Diffusion Multi-modal Large Language Models (dMLLMs) have recently emerged as a novel architecture unifying image generation and understanding. However, developing effective and efficient Test-Time Scaling (TTS) methods to unlock their full generative potential remains an underexplored challenge. To address this, we propose dMLLM-TTS, a novel framework operating on two complementary scaling axes: (1) trajectory exploration scaling to enhance the diversity of generated hypotheses, and (2) iterative refinement scaling for stable generation. Conventional TTS approaches typically perform linear search across these two dimensions, incurring substantial computational costs of O(NT) and requiring an external verifier for best-of-N selection. To overcome these limitations, we propose two innovations. First, we design an efficient hierarchical search algorithm with O(N+T) complexity that adaptively expands and prunes sampling trajectories. Second, we introduce a self-verified feedback mechanism that leverages the dMLLMs' intrinsic image understanding capabilities to assess text-image alignment, eliminating the need for external verifier. Extensive experiments on the GenEval benchmark across three representative dMLLMs (e.g., Lumina-DiMOO, MMaDA, Muddit) show that our framework substantially improves generation quality while achieving up to 6x greater efficiency than linear search. Project page: https://github.com/Alpha-VLLM/Lumina-DiMOO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces dMLLM-TTS, a test-time scaling framework for diffusion multi-modal LLMs. It proposes a hierarchical search algorithm with O(N+T) complexity for trajectory exploration and iterative refinement, combined with a self-verified feedback mechanism that leverages the model's intrinsic image-understanding capabilities to assess text-image alignment without external verifiers. Experiments on the GenEval benchmark across three dMLLMs (Lumina-DiMOO, MMaDA, Muddit) report consistent quality gains and up to 6x efficiency improvement over linear O(NT) search.

Significance. If the self-verification step reliably substitutes for external scoring, the framework offers a practical route to efficient test-time compute scaling for unified generation-understanding models. The O(N+T) complexity reduction and elimination of external verifiers are potentially impactful contributions, provided the quality gains are robustly attributable to the proposed mechanisms rather than unverified assumptions.

major comments (3)
  1. [§3.2] §3.2 (self-verified feedback mechanism): the central claim that the dMLLM's intrinsic image-understanding capabilities can reliably replace an external verifier rests on an unverified assumption. No quantitative evidence (agreement rates, correlation with CLIPScore/VQAScore, or human judgments) is provided to show that self-assessment preserves selection accuracy; without this, both the reported quality gains and the O(N+T) efficiency justification are at risk.
  2. [§4] §4 (experimental results): the GenEval improvements are reported as consistent across three models, but the section lacks error bars, statistical significance tests, or ablations isolating the contribution of self-verification versus trajectory exploration. This makes it impossible to assess whether the gains are load-bearing or could be explained by increased sampling alone.
  3. [§3.1] §3.1 (hierarchical search algorithm): the O(N+T) complexity claim is central to the efficiency advantage, yet no formal analysis, recurrence relation, or empirical timing breakdown is supplied to confirm that adaptive pruning actually achieves this scaling in practice rather than in the best case.
minor comments (2)
  1. Ensure that N (number of trajectories) and T (refinement steps) are defined with consistent notation in both the method description and the complexity analysis.
  2. Add a brief comparison table or paragraph situating dMLLM-TTS against prior TTS methods for diffusion or multimodal models to clarify novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have addressed each major comment by strengthening the manuscript with additional quantitative evidence, statistical analyses, and formal derivations as requested.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (self-verified feedback mechanism): the central claim that the dMLLM's intrinsic image-understanding capabilities can reliably replace an external verifier rests on an unverified assumption. No quantitative evidence (agreement rates, correlation with CLIPScore/VQAScore, or human judgments) is provided to show that self-assessment preserves selection accuracy; without this, both the reported quality gains and the O(N+T) efficiency justification are at risk.

    Authors: We acknowledge that the original manuscript did not include direct quantitative validation of self-verification accuracy. While the unified architecture of dMLLMs provides a principled basis for leveraging intrinsic image-understanding for alignment assessment, we agree that empirical corroboration is essential. In the revised version we have added a dedicated paragraph and table in §3.2 reporting agreement rates with external verifiers (CLIPScore and VQAScore), Pearson correlation coefficients, and human judgment agreement on a 200-sample subset. These results indicate that self-verification preserves selection accuracy at >85% relative to external scores, thereby supporting the reported quality gains and efficiency claims. revision: yes

  2. Referee: [§4] §4 (experimental results): the GenEval improvements are reported as consistent across three models, but the section lacks error bars, statistical significance tests, or ablations isolating the contribution of self-verification versus trajectory exploration. This makes it impossible to assess whether the gains are load-bearing or could be explained by increased sampling alone.

    Authors: We agree that the experimental presentation would be strengthened by statistical rigor and component ablations. The revised §4 now includes error bars (standard deviation across five independent runs) for all GenEval metrics, reports p-values from paired t-tests confirming statistical significance of improvements, and adds a new ablation subsection (§4.3) that isolates the contributions of hierarchical search and self-verified feedback. These ablations show that each component contributes measurably beyond increased sampling budget alone. revision: yes

  3. Referee: [§3.1] §3.1 (hierarchical search algorithm): the O(N+T) complexity claim is central to the efficiency advantage, yet no formal analysis, recurrence relation, or empirical timing breakdown is supplied to confirm that adaptive pruning actually achieves this scaling in practice rather than in the best case.

    Authors: We thank the referee for noting the absence of a formal derivation. The O(N+T) scaling follows from the adaptive pruning strategy, which can be expressed by the recurrence C(N,T) = O(N) + C(N/2,T-1) with base cases yielding linear total cost. In the revision we have expanded §3.1 with this recurrence relation and added an appendix table with wall-clock timings for N ∈ {4,8,16} and T ∈ {4,8,16}, confirming that observed runtimes track the claimed O(N+T) scaling across the tested range. revision: yes

Circularity Check

0 steps flagged

No circularity: framework claims rest on algorithmic complexity and empirical results, not self-referential definitions or fitted predictions

full rationale

The paper introduces a hierarchical search algorithm with stated O(N+T) complexity and a self-verified feedback step that directly invokes the dMLLM's pre-existing image-understanding capabilities. No equations define a quantity in terms of itself, no fitted parameters are relabeled as predictions, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The reported quality gains and efficiency improvements are presented as outcomes of experiments on GenEval rather than quantities forced by construction from the same inputs. The self-verification mechanism is an assumption whose reliability is not proven in the abstract, but this constitutes an evidentiary gap rather than circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that the base dMLLM can serve as its own reliable verifier and that early pruning in the hierarchical search does not discard high-quality trajectories.

free parameters (1)
  • N (number of trajectories) and T (refinement steps)
    Hyperparameters controlling search budget; values chosen per model and benchmark.
axioms (1)
  • domain assumption dMLLM intrinsic understanding provides accurate text-image alignment scores
    Invoked to replace external verifier; appears in the self-verified feedback description.

pith-pipeline@v0.9.0 · 5608 in / 1194 out tokens · 65529 ms · 2026-05-16T20:35:48.921837+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 11 internal anchors

  1. [1]

    Johnson, Jonathan Ho, Daniel Tar- low, and Rianne van den Berg

    Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tar- low, and Rianne van den Berg. Structured denoising diffu- sion models in discrete state-spaces. InAdvances in Neural Information Processing Systems (NeurIPS), 2021. 3

  2. [2]

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-α: Fast training of diffusion trans- former for photorealistic text-to-image synthesis.Proceed- ings of the International Conference on Learning Represen- tations (ICLR), 2023. 2, 3

  3. [3]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

  4. [4]

    Tts-var: A test- time scaling framework for visual auto-regressive genera- tion.arXiv preprint arXiv:2507.18537, 2025

    Zhekai Chen, Ruihang Chu, Yukang Chen, Shiwei Zhang, Yujie Wei, Yingya Zhang, and Xihui Liu. Tts-var: A test- time scaling framework for visual auto-regressive genera- tion.arXiv preprint arXiv:2507.18537, 2025. 3

  5. [5]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 6

  6. [6]

    Lumina-t2x: Scalable flow-based large diffusion transformer for flexible resolution generation

    Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Rongjie Huang, Shijie Geng, Renrui Zhang, et al. Lumina-t2x: Scalable flow-based large diffusion transformer for flexible resolution generation. In Proceedings of the International Conference on Learning Representations (ICLR), 2025. 3

  7. [7]

    Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems (NeurIPS), 36, 2024

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems (NeurIPS), 36, 2024. 2, 6

  8. [8]

    Scaling diffusion language models via adap- tation from autoregressive models

    Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Ji- awei Han, et al. Scaling diffusion language models via adap- tation from autoregressive models. InProceedings of the In- ternational Conference on Learning Representations (ICLR),

  9. [9]

    Clipscore: A reference-free evaluation met- ric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InProceedings of the Confer- ence on Empirical Methods in Natural Language Processing (EMNLP), 2021. 2, 3

  10. [10]

    Denoising diffu- sion probabilistic models.Advances in Neural Information Processing Systems (NeurIPS), 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.Advances in Neural Information Processing Systems (NeurIPS), 2020. 3

  11. [11]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 3

  12. [12]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 2, 4, 6

  13. [13]

    Flux.https://github.com/ black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 2, 6

  14. [14]

    Reflect-dit: Inference-time scaling for text-to-image diffu- sion transformers via in-context reflection.arXiv preprint arXiv:2503.12271, 2025

    Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Arsh Koneru, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Reflect-dit: Inference-time scaling for text-to-image diffu- sion transformers via in-context reflection.arXiv preprint arXiv:2503.12271, 2025. 3

  15. [15]

    Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining.arXiv preprint arXiv:2408.02657, 2024

    Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yu Qiao, Hongsheng Li, and Peng Gao. Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining.arXiv preprint arXiv:2408.02657, 2024. 3

  16. [16]

    Discrete diffusion modeling by estimating the ratios of the data dis- tribution

    Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data dis- tribution. InProceedings of the International Conference on Machine Learning (ICML), 2024. 3

  17. [17]

    Inference-time scaling for diffu- sion models beyond scaling denoising steps.arXiv preprint arXiv:2501.09732, 2025

    Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu- Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, et al. Inference-time scaling for diffu- sion models beyond scaling denoising steps.arXiv preprint arXiv:2501.09732, 2025. 2, 3, 7

  18. [18]

    Large Language Diffusion Models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025. 3

  19. [19]

    amused: An open muse reproduction.arXiv preprint arXiv:2401.01808, 2024

    Suraj Patil, William Berman, Robin Rombach, and Patrick von Platen. amused: An open muse reproduction.arXiv preprint arXiv:2401.01808, 2024. 3

  20. [20]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis.Proceedings of the In- ternational Conference on Learning Representations (ICLR),

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.Proceedings of the In- ternational Conference on Learning Representations (ICLR),

  21. [21]

    Lumina- image 2.0: A unified and efficient image generative frame- work.Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2025

    Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina- image 2.0: A unified and efficient image generative frame- work.Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2025. 2, 3

  22. [22]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InProceedings of the International Conference on Machine Learning (ICML), 2021. 4

  23. [23]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2022. 3

  24. [24]

    Muddit: Liber- ating generation beyond text-to-image with a unified discrete diffusion model, 2025

    Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, and Shuicheng Yan. Muddit: Liber- ating generation beyond text-to-image with a unified discrete diffusion model, 2025. 2, 3, 6

  25. [25]

    A general framework for inference- time scaling and steering of diffusion models.arXiv preprint arXiv:2501.06848, 2025

    Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, et al. A general framework for inference- time scaling and steering of diffusion models.arXiv preprint arXiv:2501.06848, 2025. 2, 3

  26. [26]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 3

  27. [27]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 6

  28. [28]

    Sana: Efficient high- resolution image synthesis with linear diffusion transform- ers.Proceedings of the International Conference on Learn- ing Representations (ICLR), 2025

    Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, et al. Sana: Efficient high- resolution image synthesis with linear diffusion transform- ers.Proceedings of the International Conference on Learn- ing Representations (ICLR), 2025. 3

  29. [29]

    Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.Proceedings of the International Conference on Machine Learning (ICML), 2025

    Enze Xie, Junsong Chen, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.Proceedings of the International Conference on Machine Learning (ICML), 2025. 2, 3, 4, 5

  30. [30]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024. 3

  31. [31]

    Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding.arXiv preprint arXiv:2510.06308, 2025

    Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, Yan Tai, Jiayi Lei, Yuewen Cao, Keqi Wang, Yibin Wang, et al. Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding.arXiv preprint arXiv:2510.06308, 2025. 2, 3, 6

  32. [32]

    Lumina-mgpt 2.0: Stand- alone autoregressive image modeling.arXiv preprint arXiv:2507.17801, 2025

    Yi Xin, Juncheng Yan, Qi Qin, Zhen Li, Dongyang Liu, Shicheng Li, Victor Shea-Jay Huang, Yupeng Zhou, Ren- rui Zhang, Le Zhuo, et al. Lumina-mgpt 2.0: Stand- alone autoregressive image modeling.arXiv preprint arXiv:2507.17801, 2025. 2, 3, 5, 7

  33. [33]

    Resurrect mask autoregressive modeling for efficient and scalable image generation.arXiv preprint arXiv:2507.13032, 2025

    Yi Xin, Le Zhuo, Qi Qin, Siqi Luo, Yuewen Cao, Bin Fu, Yangfan He, Hongsheng Li, Guangtao Zhai, Xiaohong Liu, et al. Resurrect mask autoregressive modeling for efficient and scalable image generation.arXiv preprint arXiv:2507.13032, 2025. 3

  34. [34]

    Imagere- ward: Learning and evaluating human preferences for text- to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 3

  35. [35]

    MMaDA: Multimodal Large Diffusion Language Models

    Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Mul- timodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025. 2, 3, 6

  36. [36]

    Towards understanding the working mechanism of text-to-image dif- fusion model.Advances in Neural Information Processing Systems (NeurIPS), 2024

    Mingyang Yi, Aoxue Li, Yi Xin, and Zhenguo Li. Towards understanding the working mechanism of text-to-image dif- fusion model.Advances in Neural Information Processing Systems (NeurIPS), 2024. 3

  37. [37]

    LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

    Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. LLaDA-V: Large language diffusion models with visual instruction tun- ing.arXiv preprint arXiv:2505.16933, 2025. 3

  38. [38]

    LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

    Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. LLaDA 1.5: Variance- reduced preference optimization for large language diffusion models.arXiv preprint arXiv:2505.19223, 2025. 3

  39. [39]

    Lumina-next: Making lumina-t2x stronger and faster with next-dit

    Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Xiangyang Zhu, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina-t2x stronger and faster with next-dit. InAdvances in Neural In- formation Processing Systems (NeurIPS), 2024. 3

  40. [40]

    Le Zhuo, Liangbing Zhao, Sayak Paul, Yue Liao, Renrui Zhang, Yi Xin, et al. From reflection to perfection: Scal- ing inference-time optimization for text-to-image diffusion models via reflection tuning.Proceedings of the IEEE Inter- national Conference on Computer Vision (ICCV), 2025. 2, 3, 7