arxiv: 2512.19433 · v2 · submitted 2025-12-22 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models

Yi Xin , Siqi Luo , Tianxiang Xu , Qi Qin , Haoxing Chen , Kaiwen Zhu , Zhiwei Zhang , Yangfan He

show 8 more authors

Rongchao Zhang Jinbin Bai Shuo Cao Bin Fu Junjun He Yihao Liu Yuewen Cao Xiaohong Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords test-time scalingdiffusion multi-modal LLMsimage generationhierarchical searchself-verificationefficiency

0 comments

The pith

A hierarchical search with self-verification lets diffusion multi-modal LLMs improve image quality at up to 6x the efficiency of linear test-time scaling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents dMLLM-TTS, a framework that scales test-time compute in diffusion multi-modal large language models along two axes: exploring varied generation trajectories and iteratively refining outputs. Instead of the conventional linear search that costs O(NT) and depends on an external judge, it introduces a hierarchical search that expands and prunes trajectories at O(N+T) cost. The framework also lets the dMLLM itself judge how well each generated image matches the text prompt, removing the external verifier. Experiments on the GenEval benchmark with models including Lumina-DiMOO, MMaDA, and Muddit report higher-quality results together with efficiency gains reaching 6x over linear search.

Core claim

The central claim is that replacing linear search across trajectory exploration and iterative refinement with an adaptive hierarchical algorithm of O(N+T) complexity, combined with self-verified feedback drawn from the dMLLM's own image-understanding capabilities, produces higher-quality images while cutting computational cost by up to a factor of six compared with standard test-time scaling.

What carries the argument

The hierarchical search algorithm that adaptively expands promising trajectories and prunes others, paired with self-verified feedback that uses the dMLLM's intrinsic image-understanding to score text-image alignment.

If this is right

Generation quality improves on the GenEval benchmark across three different dMLLMs.
Compute cost drops from O(NT) to O(N+T) while maintaining or exceeding the quality of linear search.
No external verifier is required, since the model itself supplies the alignment signal.
The same two-axis scaling (trajectory diversity plus iterative refinement) becomes practical at larger N and T values.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The self-verification step may transfer to other multi-modal generation tasks where the model already possesses understanding capabilities.
Efficiency gains could make test-time scaling viable for interactive or resource-constrained image generation settings.
The hierarchical pruning strategy might extend to non-diffusion architectures that share similar trajectory-based generation.

Load-bearing premise

The dMLLM's built-in image-understanding capabilities can reliably judge how well a generated image matches the input text prompt without help from an external verifier.

What would settle it

A controlled test in which the images selected by the model's self-verification score measurably lower on human or external automatic metrics than the best images chosen by a separate verifier on the same candidate set.

Figures

Figures reproduced from arXiv: 2512.19433 by Bin Fu, Haoxing Chen, Jinbin Bai, Junjun He, Kaiwen Zhu, Qi Qin, Rongchao Zhang, Shuo Cao, Siqi Luo, Tianxiang Xu, Xiaohong Liu, Yangfan He, Yihao Liu, Yi Xin, Yuewen Cao, Zhiwei Zhang.

**Figure 1.** Figure 1: dMLLM-TTS: We present the generative effects and performance improvements achieved by applying Test-Time Scaling (TTS) to dMLLMs. Images generated with TTS exhibit higher quality and stronger prompt alignment than those generated without TTS. Abstract Diffusion Multi-modal Large Language Models (dMLLMs) have recently emerged as a novel architecture unifying image generation and understanding. However, dev… view at source ↗

**Figure 2.** Figure 2: Visualization of the image generation process in dMLLMs. The first row shows the input latent masks at each step, and the second row depicts the corresponding outputs. Sampling begins with fully masked tokens (gray) and gradually fills the discrete multimodal token space with increasingly confident predictions (blue). ation. Early discrete diffusion methods [1, 16] established the foundation for token-leve… view at source ↗

**Figure 3.** Figure 3: Overview of dMLLM-TTS framework. (a) dMLLM-TTS scales compute along two axes: trajectory exploration and iterative refinement, guided by Self-Verified Feedback for text–image alignment evaluation. (b) Hierarchical Trajectory Search (HTS) performs coarse-to-fine generation by starting with broad exploration, pruning low-potential trajectories, and refining high-potential trajectories. samples N distinct sto… view at source ↗

**Figure 4.** Figure 4: Qualitative improvement ratio in TTS performance across various text prompt complexities examined through diverse dMLLMs on GenEval benchmark dimensions. TTS markedly enhances performance across all measured dimensions. hierarchical thinning with geometric decay factor d > 1, and (iii) final refinement over the K surviving trajectories. In practice, we start from a wide exploration with a large number of t… view at source ↗

**Figure 5.** Figure 5: Comparison between linear and hierarchical trajectory search. The red curve illustrates linear trajectory search, while the blue curve depicts hierarchical trajectory search, with a dashed line indicating predictions based on a geometric series decay approximation. Curve fitting shows that similar subsequent trends tend to converge towards an upper limit. centric generation using compositional prompts with… view at source ↗

**Figure 6.** Figure 6: Trajectory Exploration Scaling (Left) and Iterative Refinement Scaling (Right). Increasing the number of explored trajectories (N = 1→32) refinement steps (T = 8→64) consistently improves performance across all dMLLMs. Lumina-DiMOO Text Prompt: “ a photo of a white toilet and a red apple ” Unsatisfactory Generation Process Satisfactory Generation Process with dMLLM-TTS Bad Initial Good Initial Wrong Traje… view at source ↗

**Figure 7.** Figure 7: Image Generation Process without (Top) and with (Bottom) dMLLM-TTS. The baseline models produce unsatisfactory text-to-image results. However, by incorporating our TTS strategies, the generation process is significantly improved [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Diffusion Multi-modal Large Language Models (dMLLMs) have recently emerged as a novel architecture unifying image generation and understanding. However, developing effective and efficient Test-Time Scaling (TTS) methods to unlock their full generative potential remains an underexplored challenge. To address this, we propose dMLLM-TTS, a novel framework operating on two complementary scaling axes: (1) trajectory exploration scaling to enhance the diversity of generated hypotheses, and (2) iterative refinement scaling for stable generation. Conventional TTS approaches typically perform linear search across these two dimensions, incurring substantial computational costs of O(NT) and requiring an external verifier for best-of-N selection. To overcome these limitations, we propose two innovations. First, we design an efficient hierarchical search algorithm with O(N+T) complexity that adaptively expands and prunes sampling trajectories. Second, we introduce a self-verified feedback mechanism that leverages the dMLLMs' intrinsic image understanding capabilities to assess text-image alignment, eliminating the need for external verifier. Extensive experiments on the GenEval benchmark across three representative dMLLMs (e.g., Lumina-DiMOO, MMaDA, Muddit) show that our framework substantially improves generation quality while achieving up to 6x greater efficiency than linear search. Project page: https://github.com/Alpha-VLLM/Lumina-DiMOO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

dMLLM-TTS gives a workable O(N+T) hierarchical search plus self-verification for cheaper TTS on diffusion MLLMs, but the quality gains rest on an untested assumption about the model's own scoring accuracy.

read the letter

The main thing here is a practical efficiency move for dMLLMs: instead of the usual linear O(NT) search over trajectories and refinement steps, they use a hierarchical expand-and-prune scheme that runs in O(N+T) and let the model score its own outputs for text-image fit. That combination is the part that does not just restate prior TTS work. The experiments apply it to three models on GenEval and report better generation quality at up to 6x lower cost than the linear baseline. That efficiency angle is the clearest value if the numbers hold up. The paper lays out the two scaling axes cleanly and shows the gains are consistent across the tested models, which is useful for anyone already running these unified generators. The soft spot is the self-verification step. It treats the dMLLM's intrinsic understanding as a reliable substitute for an external verifier, yet the abstract supplies no correlation numbers, agreement rates, or ablation against CLIPScore or human judgments. If that internal scoring only weakly tracks actual alignment, the pruning will either discard strong candidates or keep weak ones, undercutting both the reported quality lift and the claimed complexity savings. The stress-test note flags exactly this gap, and nothing in the provided description closes it. This paper is for groups working on inference scaling for multimodal diffusion models who need lower wall-clock cost at test time. A reader already following TTS methods or deployment constraints on models like Lumina-DiMOO would get concrete algorithmic details and benchmark numbers to try. It deserves a serious referee because the algorithm is straightforward to reimplement and the efficiency claim is timely enough to check, even if the self-verification part will need extra evidence before the gains can be taken at face value. I would send it out with a request for explicit validation of the self-scoring accuracy.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces dMLLM-TTS, a test-time scaling framework for diffusion multi-modal LLMs. It proposes a hierarchical search algorithm with O(N+T) complexity for trajectory exploration and iterative refinement, combined with a self-verified feedback mechanism that leverages the model's intrinsic image-understanding capabilities to assess text-image alignment without external verifiers. Experiments on the GenEval benchmark across three dMLLMs (Lumina-DiMOO, MMaDA, Muddit) report consistent quality gains and up to 6x efficiency improvement over linear O(NT) search.

Significance. If the self-verification step reliably substitutes for external scoring, the framework offers a practical route to efficient test-time compute scaling for unified generation-understanding models. The O(N+T) complexity reduction and elimination of external verifiers are potentially impactful contributions, provided the quality gains are robustly attributable to the proposed mechanisms rather than unverified assumptions.

major comments (3)

[§3.2] §3.2 (self-verified feedback mechanism): the central claim that the dMLLM's intrinsic image-understanding capabilities can reliably replace an external verifier rests on an unverified assumption. No quantitative evidence (agreement rates, correlation with CLIPScore/VQAScore, or human judgments) is provided to show that self-assessment preserves selection accuracy; without this, both the reported quality gains and the O(N+T) efficiency justification are at risk.
[§4] §4 (experimental results): the GenEval improvements are reported as consistent across three models, but the section lacks error bars, statistical significance tests, or ablations isolating the contribution of self-verification versus trajectory exploration. This makes it impossible to assess whether the gains are load-bearing or could be explained by increased sampling alone.
[§3.1] §3.1 (hierarchical search algorithm): the O(N+T) complexity claim is central to the efficiency advantage, yet no formal analysis, recurrence relation, or empirical timing breakdown is supplied to confirm that adaptive pruning actually achieves this scaling in practice rather than in the best case.

minor comments (2)

Ensure that N (number of trajectories) and T (refinement steps) are defined with consistent notation in both the method description and the complexity analysis.
Add a brief comparison table or paragraph situating dMLLM-TTS against prior TTS methods for diffusion or multimodal models to clarify novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have addressed each major comment by strengthening the manuscript with additional quantitative evidence, statistical analyses, and formal derivations as requested.

read point-by-point responses

Referee: [§3.2] §3.2 (self-verified feedback mechanism): the central claim that the dMLLM's intrinsic image-understanding capabilities can reliably replace an external verifier rests on an unverified assumption. No quantitative evidence (agreement rates, correlation with CLIPScore/VQAScore, or human judgments) is provided to show that self-assessment preserves selection accuracy; without this, both the reported quality gains and the O(N+T) efficiency justification are at risk.

Authors: We acknowledge that the original manuscript did not include direct quantitative validation of self-verification accuracy. While the unified architecture of dMLLMs provides a principled basis for leveraging intrinsic image-understanding for alignment assessment, we agree that empirical corroboration is essential. In the revised version we have added a dedicated paragraph and table in §3.2 reporting agreement rates with external verifiers (CLIPScore and VQAScore), Pearson correlation coefficients, and human judgment agreement on a 200-sample subset. These results indicate that self-verification preserves selection accuracy at >85% relative to external scores, thereby supporting the reported quality gains and efficiency claims. revision: yes
Referee: [§4] §4 (experimental results): the GenEval improvements are reported as consistent across three models, but the section lacks error bars, statistical significance tests, or ablations isolating the contribution of self-verification versus trajectory exploration. This makes it impossible to assess whether the gains are load-bearing or could be explained by increased sampling alone.

Authors: We agree that the experimental presentation would be strengthened by statistical rigor and component ablations. The revised §4 now includes error bars (standard deviation across five independent runs) for all GenEval metrics, reports p-values from paired t-tests confirming statistical significance of improvements, and adds a new ablation subsection (§4.3) that isolates the contributions of hierarchical search and self-verified feedback. These ablations show that each component contributes measurably beyond increased sampling budget alone. revision: yes
Referee: [§3.1] §3.1 (hierarchical search algorithm): the O(N+T) complexity claim is central to the efficiency advantage, yet no formal analysis, recurrence relation, or empirical timing breakdown is supplied to confirm that adaptive pruning actually achieves this scaling in practice rather than in the best case.

Authors: We thank the referee for noting the absence of a formal derivation. The O(N+T) scaling follows from the adaptive pruning strategy, which can be expressed by the recurrence C(N,T) = O(N) + C(N/2,T-1) with base cases yielding linear total cost. In the revision we have expanded §3.1 with this recurrence relation and added an appendix table with wall-clock timings for N ∈ {4,8,16} and T ∈ {4,8,16}, confirming that observed runtimes track the claimed O(N+T) scaling across the tested range. revision: yes

Circularity Check

0 steps flagged

No circularity: framework claims rest on algorithmic complexity and empirical results, not self-referential definitions or fitted predictions

full rationale

The paper introduces a hierarchical search algorithm with stated O(N+T) complexity and a self-verified feedback step that directly invokes the dMLLM's pre-existing image-understanding capabilities. No equations define a quantity in terms of itself, no fitted parameters are relabeled as predictions, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The reported quality gains and efficiency improvements are presented as outcomes of experiments on GenEval rather than quantities forced by construction from the same inputs. The self-verification mechanism is an assumption whose reliability is not proven in the abstract, but this constitutes an evidentiary gap rather than circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that the base dMLLM can serve as its own reliable verifier and that early pruning in the hierarchical search does not discard high-quality trajectories.

free parameters (1)

N (number of trajectories) and T (refinement steps)
Hyperparameters controlling search budget; values chosen per model and benchmark.

axioms (1)

domain assumption dMLLM intrinsic understanding provides accurate text-image alignment scores
Invoked to replace external verifier; appears in the self-verified feedback description.

pith-pipeline@v0.9.0 · 5608 in / 1194 out tokens · 65529 ms · 2026-05-16T20:35:48.921837+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Self-Verified Feedback (SVF) mechanism... Φ_SVF = logit_yes(G_θ(Z_t, C))... Hierarchical Trajectory Search (HTS) with O(N+T) complexity
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and embed_strictMono unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two complementary scaling axes: (1) trajectory exploration scaling... (2) iterative refinement scaling

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 11 internal anchors

[1]

Johnson, Jonathan Ho, Daniel Tar- low, and Rianne van den Berg

Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tar- low, and Rianne van den Berg. Structured denoising diffu- sion models in discrete state-spaces. InAdvances in Neural Information Processing Systems (NeurIPS), 2021. 3

work page 2021
[2]

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-α: Fast training of diffusion trans- former for photorealistic text-to-image synthesis.Proceed- ings of the International Conference on Learning Represen- tations (ICLR), 2023. 2, 3

work page 2023
[3]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Tts-var: A test- time scaling framework for visual auto-regressive genera- tion.arXiv preprint arXiv:2507.18537, 2025

Zhekai Chen, Ruihang Chu, Yukang Chen, Shiwei Zhang, Yujie Wei, Yingya Zhang, and Xihui Liu. Tts-var: A test- time scaling framework for visual auto-regressive genera- tion.arXiv preprint arXiv:2507.18537, 2025. 3

work page arXiv 2025
[5]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Lumina-t2x: Scalable flow-based large diffusion transformer for flexible resolution generation

Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Rongjie Huang, Shijie Geng, Renrui Zhang, et al. Lumina-t2x: Scalable flow-based large diffusion transformer for flexible resolution generation. In Proceedings of the International Conference on Learning Representations (ICLR), 2025. 3

work page 2025
[7]

Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems (NeurIPS), 36, 2024

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems (NeurIPS), 36, 2024. 2, 6

work page 2024
[8]

Scaling diffusion language models via adap- tation from autoregressive models

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Ji- awei Han, et al. Scaling diffusion language models via adap- tation from autoregressive models. InProceedings of the In- ternational Conference on Learning Representations (ICLR),

work page
[9]

Clipscore: A reference-free evaluation met- ric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InProceedings of the Confer- ence on Empirical Methods in Natural Language Processing (EMNLP), 2021. 2, 3

work page 2021
[10]

Denoising diffu- sion probabilistic models.Advances in Neural Information Processing Systems (NeurIPS), 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.Advances in Neural Information Processing Systems (NeurIPS), 2020. 3

work page 2020
[11]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 2, 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Flux.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 2, 6

work page 2024
[14]

Reflect-dit: Inference-time scaling for text-to-image diffu- sion transformers via in-context reflection.arXiv preprint arXiv:2503.12271, 2025

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Arsh Koneru, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Reflect-dit: Inference-time scaling for text-to-image diffu- sion transformers via in-context reflection.arXiv preprint arXiv:2503.12271, 2025. 3

work page arXiv 2025
[15]

Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining.arXiv preprint arXiv:2408.02657, 2024

Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yu Qiao, Hongsheng Li, and Peng Gao. Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining.arXiv preprint arXiv:2408.02657, 2024. 3

work page arXiv 2024
[16]

Discrete diffusion modeling by estimating the ratios of the data dis- tribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data dis- tribution. InProceedings of the International Conference on Machine Learning (ICML), 2024. 3

work page 2024
[17]

Inference-time scaling for diffu- sion models beyond scaling denoising steps.arXiv preprint arXiv:2501.09732, 2025

Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu- Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, et al. Inference-time scaling for diffu- sion models beyond scaling denoising steps.arXiv preprint arXiv:2501.09732, 2025. 2, 3, 7

work page arXiv 2025
[18]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

amused: An open muse reproduction.arXiv preprint arXiv:2401.01808, 2024

Suraj Patil, William Berman, Robin Rombach, and Patrick von Platen. amused: An open muse reproduction.arXiv preprint arXiv:2401.01808, 2024. 3

work page arXiv 2024
[20]

Sdxl: Improving latent diffusion models for high-resolution image synthesis.Proceedings of the In- ternational Conference on Learning Representations (ICLR),

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.Proceedings of the In- ternational Conference on Learning Representations (ICLR),

work page
[21]

Lumina- image 2.0: A unified and efficient image generative frame- work.Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2025

Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina- image 2.0: A unified and efficient image generative frame- work.Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2025. 2, 3

work page 2025
[22]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InProceedings of the International Conference on Machine Learning (ICML), 2021. 4

work page 2021
[23]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2022. 3

work page 2022
[24]

Muddit: Liber- ating generation beyond text-to-image with a unified discrete diffusion model, 2025

Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, and Shuicheng Yan. Muddit: Liber- ating generation beyond text-to-image with a unified discrete diffusion model, 2025. 2, 3, 6

work page 2025
[25]

A general framework for inference- time scaling and steering of diffusion models.arXiv preprint arXiv:2501.06848, 2025

Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, et al. A general framework for inference- time scaling and steering of diffusion models.arXiv preprint arXiv:2501.06848, 2025. 2, 3

work page arXiv 2025
[26]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Sana: Efficient high- resolution image synthesis with linear diffusion transform- ers.Proceedings of the International Conference on Learn- ing Representations (ICLR), 2025

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, et al. Sana: Efficient high- resolution image synthesis with linear diffusion transform- ers.Proceedings of the International Conference on Learn- ing Representations (ICLR), 2025. 3

work page 2025
[29]

Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.Proceedings of the International Conference on Machine Learning (ICML), 2025

Enze Xie, Junsong Chen, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.Proceedings of the International Conference on Machine Learning (ICML), 2025. 2, 3, 4, 5

work page 2025
[30]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding.arXiv preprint arXiv:2510.06308, 2025

Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, Yan Tai, Jiayi Lei, Yuewen Cao, Keqi Wang, Yibin Wang, et al. Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding.arXiv preprint arXiv:2510.06308, 2025. 2, 3, 6

work page arXiv 2025
[32]

Lumina-mgpt 2.0: Stand- alone autoregressive image modeling.arXiv preprint arXiv:2507.17801, 2025

Yi Xin, Juncheng Yan, Qi Qin, Zhen Li, Dongyang Liu, Shicheng Li, Victor Shea-Jay Huang, Yupeng Zhou, Ren- rui Zhang, Le Zhuo, et al. Lumina-mgpt 2.0: Stand- alone autoregressive image modeling.arXiv preprint arXiv:2507.17801, 2025. 2, 3, 5, 7

work page arXiv 2025
[33]

Resurrect mask autoregressive modeling for efficient and scalable image generation.arXiv preprint arXiv:2507.13032, 2025

Yi Xin, Le Zhuo, Qi Qin, Siqi Luo, Yuewen Cao, Bin Fu, Yangfan He, Hongsheng Li, Guangtao Zhai, Xiaohong Liu, et al. Resurrect mask autoregressive modeling for efficient and scalable image generation.arXiv preprint arXiv:2507.13032, 2025. 3

work page arXiv 2025
[34]

Imagere- ward: Learning and evaluating human preferences for text- to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 3

work page 2023
[35]

MMaDA: Multimodal Large Diffusion Language Models

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Mul- timodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025. 2, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Towards understanding the working mechanism of text-to-image dif- fusion model.Advances in Neural Information Processing Systems (NeurIPS), 2024

Mingyang Yi, Aoxue Li, Yi Xin, and Zhenguo Li. Towards understanding the working mechanism of text-to-image dif- fusion model.Advances in Neural Information Processing Systems (NeurIPS), 2024. 3

work page 2024
[37]

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. LLaDA-V: Large language diffusion models with visual instruction tun- ing.arXiv preprint arXiv:2505.16933, 2025. 3

work page internal anchor Pith review arXiv 2025
[38]

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. LLaDA 1.5: Variance- reduced preference optimization for large language diffusion models.arXiv preprint arXiv:2505.19223, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Lumina-next: Making lumina-t2x stronger and faster with next-dit

Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Xiangyang Zhu, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina-t2x stronger and faster with next-dit. InAdvances in Neural In- formation Processing Systems (NeurIPS), 2024. 3

work page 2024
[40]

Le Zhuo, Liangbing Zhao, Sayak Paul, Yue Liao, Renrui Zhang, Yi Xin, et al. From reflection to perfection: Scal- ing inference-time optimization for text-to-image diffusion models via reflection tuning.Proceedings of the IEEE Inter- national Conference on Computer Vision (ICCV), 2025. 2, 3, 7

work page 2025