Recognition: 2 theorem links
· Lean TheoremdMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models
Pith reviewed 2026-05-16 20:35 UTC · model grok-4.3
The pith
A hierarchical search with self-verification lets diffusion multi-modal LLMs improve image quality at up to 6x the efficiency of linear test-time scaling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that replacing linear search across trajectory exploration and iterative refinement with an adaptive hierarchical algorithm of O(N+T) complexity, combined with self-verified feedback drawn from the dMLLM's own image-understanding capabilities, produces higher-quality images while cutting computational cost by up to a factor of six compared with standard test-time scaling.
What carries the argument
The hierarchical search algorithm that adaptively expands promising trajectories and prunes others, paired with self-verified feedback that uses the dMLLM's intrinsic image-understanding to score text-image alignment.
If this is right
- Generation quality improves on the GenEval benchmark across three different dMLLMs.
- Compute cost drops from O(NT) to O(N+T) while maintaining or exceeding the quality of linear search.
- No external verifier is required, since the model itself supplies the alignment signal.
- The same two-axis scaling (trajectory diversity plus iterative refinement) becomes practical at larger N and T values.
Where Pith is reading between the lines
- The self-verification step may transfer to other multi-modal generation tasks where the model already possesses understanding capabilities.
- Efficiency gains could make test-time scaling viable for interactive or resource-constrained image generation settings.
- The hierarchical pruning strategy might extend to non-diffusion architectures that share similar trajectory-based generation.
Load-bearing premise
The dMLLM's built-in image-understanding capabilities can reliably judge how well a generated image matches the input text prompt without help from an external verifier.
What would settle it
A controlled test in which the images selected by the model's self-verification score measurably lower on human or external automatic metrics than the best images chosen by a separate verifier on the same candidate set.
Figures
read the original abstract
Diffusion Multi-modal Large Language Models (dMLLMs) have recently emerged as a novel architecture unifying image generation and understanding. However, developing effective and efficient Test-Time Scaling (TTS) methods to unlock their full generative potential remains an underexplored challenge. To address this, we propose dMLLM-TTS, a novel framework operating on two complementary scaling axes: (1) trajectory exploration scaling to enhance the diversity of generated hypotheses, and (2) iterative refinement scaling for stable generation. Conventional TTS approaches typically perform linear search across these two dimensions, incurring substantial computational costs of O(NT) and requiring an external verifier for best-of-N selection. To overcome these limitations, we propose two innovations. First, we design an efficient hierarchical search algorithm with O(N+T) complexity that adaptively expands and prunes sampling trajectories. Second, we introduce a self-verified feedback mechanism that leverages the dMLLMs' intrinsic image understanding capabilities to assess text-image alignment, eliminating the need for external verifier. Extensive experiments on the GenEval benchmark across three representative dMLLMs (e.g., Lumina-DiMOO, MMaDA, Muddit) show that our framework substantially improves generation quality while achieving up to 6x greater efficiency than linear search. Project page: https://github.com/Alpha-VLLM/Lumina-DiMOO.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces dMLLM-TTS, a test-time scaling framework for diffusion multi-modal LLMs. It proposes a hierarchical search algorithm with O(N+T) complexity for trajectory exploration and iterative refinement, combined with a self-verified feedback mechanism that leverages the model's intrinsic image-understanding capabilities to assess text-image alignment without external verifiers. Experiments on the GenEval benchmark across three dMLLMs (Lumina-DiMOO, MMaDA, Muddit) report consistent quality gains and up to 6x efficiency improvement over linear O(NT) search.
Significance. If the self-verification step reliably substitutes for external scoring, the framework offers a practical route to efficient test-time compute scaling for unified generation-understanding models. The O(N+T) complexity reduction and elimination of external verifiers are potentially impactful contributions, provided the quality gains are robustly attributable to the proposed mechanisms rather than unverified assumptions.
major comments (3)
- [§3.2] §3.2 (self-verified feedback mechanism): the central claim that the dMLLM's intrinsic image-understanding capabilities can reliably replace an external verifier rests on an unverified assumption. No quantitative evidence (agreement rates, correlation with CLIPScore/VQAScore, or human judgments) is provided to show that self-assessment preserves selection accuracy; without this, both the reported quality gains and the O(N+T) efficiency justification are at risk.
- [§4] §4 (experimental results): the GenEval improvements are reported as consistent across three models, but the section lacks error bars, statistical significance tests, or ablations isolating the contribution of self-verification versus trajectory exploration. This makes it impossible to assess whether the gains are load-bearing or could be explained by increased sampling alone.
- [§3.1] §3.1 (hierarchical search algorithm): the O(N+T) complexity claim is central to the efficiency advantage, yet no formal analysis, recurrence relation, or empirical timing breakdown is supplied to confirm that adaptive pruning actually achieves this scaling in practice rather than in the best case.
minor comments (2)
- Ensure that N (number of trajectories) and T (refinement steps) are defined with consistent notation in both the method description and the complexity analysis.
- Add a brief comparison table or paragraph situating dMLLM-TTS against prior TTS methods for diffusion or multimodal models to clarify novelty.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have addressed each major comment by strengthening the manuscript with additional quantitative evidence, statistical analyses, and formal derivations as requested.
read point-by-point responses
-
Referee: [§3.2] §3.2 (self-verified feedback mechanism): the central claim that the dMLLM's intrinsic image-understanding capabilities can reliably replace an external verifier rests on an unverified assumption. No quantitative evidence (agreement rates, correlation with CLIPScore/VQAScore, or human judgments) is provided to show that self-assessment preserves selection accuracy; without this, both the reported quality gains and the O(N+T) efficiency justification are at risk.
Authors: We acknowledge that the original manuscript did not include direct quantitative validation of self-verification accuracy. While the unified architecture of dMLLMs provides a principled basis for leveraging intrinsic image-understanding for alignment assessment, we agree that empirical corroboration is essential. In the revised version we have added a dedicated paragraph and table in §3.2 reporting agreement rates with external verifiers (CLIPScore and VQAScore), Pearson correlation coefficients, and human judgment agreement on a 200-sample subset. These results indicate that self-verification preserves selection accuracy at >85% relative to external scores, thereby supporting the reported quality gains and efficiency claims. revision: yes
-
Referee: [§4] §4 (experimental results): the GenEval improvements are reported as consistent across three models, but the section lacks error bars, statistical significance tests, or ablations isolating the contribution of self-verification versus trajectory exploration. This makes it impossible to assess whether the gains are load-bearing or could be explained by increased sampling alone.
Authors: We agree that the experimental presentation would be strengthened by statistical rigor and component ablations. The revised §4 now includes error bars (standard deviation across five independent runs) for all GenEval metrics, reports p-values from paired t-tests confirming statistical significance of improvements, and adds a new ablation subsection (§4.3) that isolates the contributions of hierarchical search and self-verified feedback. These ablations show that each component contributes measurably beyond increased sampling budget alone. revision: yes
-
Referee: [§3.1] §3.1 (hierarchical search algorithm): the O(N+T) complexity claim is central to the efficiency advantage, yet no formal analysis, recurrence relation, or empirical timing breakdown is supplied to confirm that adaptive pruning actually achieves this scaling in practice rather than in the best case.
Authors: We thank the referee for noting the absence of a formal derivation. The O(N+T) scaling follows from the adaptive pruning strategy, which can be expressed by the recurrence C(N,T) = O(N) + C(N/2,T-1) with base cases yielding linear total cost. In the revision we have expanded §3.1 with this recurrence relation and added an appendix table with wall-clock timings for N ∈ {4,8,16} and T ∈ {4,8,16}, confirming that observed runtimes track the claimed O(N+T) scaling across the tested range. revision: yes
Circularity Check
No circularity: framework claims rest on algorithmic complexity and empirical results, not self-referential definitions or fitted predictions
full rationale
The paper introduces a hierarchical search algorithm with stated O(N+T) complexity and a self-verified feedback step that directly invokes the dMLLM's pre-existing image-understanding capabilities. No equations define a quantity in terms of itself, no fitted parameters are relabeled as predictions, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The reported quality gains and efficiency improvements are presented as outcomes of experiments on GenEval rather than quantities forced by construction from the same inputs. The self-verification mechanism is an assumption whose reliability is not proven in the abstract, but this constitutes an evidentiary gap rather than circularity.
Axiom & Free-Parameter Ledger
free parameters (1)
- N (number of trajectories) and T (refinement steps)
axioms (1)
- domain assumption dMLLM intrinsic understanding provides accurate text-image alignment scores
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Self-Verified Feedback (SVF) mechanism... Φ_SVF = logit_yes(G_θ(Z_t, C))... Hierarchical Trajectory Search (HTS) with O(N+T) complexity
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat induction and embed_strictMono unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
two complementary scaling axes: (1) trajectory exploration scaling... (2) iterative refinement scaling
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Johnson, Jonathan Ho, Daniel Tar- low, and Rianne van den Berg
Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tar- low, and Rianne van den Berg. Structured denoising diffu- sion models in discrete state-spaces. InAdvances in Neural Information Processing Systems (NeurIPS), 2021. 3
work page 2021
-
[2]
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-α: Fast training of diffusion trans- former for photorealistic text-to-image synthesis.Proceed- ings of the International Conference on Learning Represen- tations (ICLR), 2023. 2, 3
work page 2023
-
[3]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Zhekai Chen, Ruihang Chu, Yukang Chen, Shiwei Zhang, Yujie Wei, Yingya Zhang, and Xihui Liu. Tts-var: A test- time scaling framework for visual auto-regressive genera- tion.arXiv preprint arXiv:2507.18537, 2025. 3
-
[5]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Lumina-t2x: Scalable flow-based large diffusion transformer for flexible resolution generation
Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Rongjie Huang, Shijie Geng, Renrui Zhang, et al. Lumina-t2x: Scalable flow-based large diffusion transformer for flexible resolution generation. In Proceedings of the International Conference on Learning Representations (ICLR), 2025. 3
work page 2025
-
[7]
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems (NeurIPS), 36, 2024. 2, 6
work page 2024
-
[8]
Scaling diffusion language models via adap- tation from autoregressive models
Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Ji- awei Han, et al. Scaling diffusion language models via adap- tation from autoregressive models. InProceedings of the In- ternational Conference on Learning Representations (ICLR),
-
[9]
Clipscore: A reference-free evaluation met- ric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InProceedings of the Confer- ence on Empirical Methods in Natural Language Processing (EMNLP), 2021. 2, 3
work page 2021
-
[10]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.Advances in Neural Information Processing Systems (NeurIPS), 2020. 3
work page 2020
-
[11]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 2, 4, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Flux.https://github.com/ black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 2, 6
work page 2024
-
[14]
Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Arsh Koneru, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Reflect-dit: Inference-time scaling for text-to-image diffu- sion transformers via in-context reflection.arXiv preprint arXiv:2503.12271, 2025. 3
-
[15]
Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yu Qiao, Hongsheng Li, and Peng Gao. Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining.arXiv preprint arXiv:2408.02657, 2024. 3
-
[16]
Discrete diffusion modeling by estimating the ratios of the data dis- tribution
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data dis- tribution. InProceedings of the International Conference on Machine Learning (ICML), 2024. 3
work page 2024
-
[17]
Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu- Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, et al. Inference-time scaling for diffu- sion models beyond scaling denoising steps.arXiv preprint arXiv:2501.09732, 2025. 2, 3, 7
-
[18]
Large Language Diffusion Models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
amused: An open muse reproduction.arXiv preprint arXiv:2401.01808, 2024
Suraj Patil, William Berman, Robin Rombach, and Patrick von Platen. amused: An open muse reproduction.arXiv preprint arXiv:2401.01808, 2024. 3
-
[20]
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.Proceedings of the In- ternational Conference on Learning Representations (ICLR),
-
[21]
Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina- image 2.0: A unified and efficient image generative frame- work.Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2025. 2, 3
work page 2025
-
[22]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InProceedings of the International Conference on Machine Learning (ICML), 2021. 4
work page 2021
-
[23]
High-resolution image syn- thesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2022. 3
work page 2022
-
[24]
Muddit: Liber- ating generation beyond text-to-image with a unified discrete diffusion model, 2025
Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, and Shuicheng Yan. Muddit: Liber- ating generation beyond text-to-image with a unified discrete diffusion model, 2025. 2, 3, 6
work page 2025
-
[25]
Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, et al. A general framework for inference- time scaling and steering of diffusion models.arXiv preprint arXiv:2501.06848, 2025. 2, 3
-
[26]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, et al. Sana: Efficient high- resolution image synthesis with linear diffusion transform- ers.Proceedings of the International Conference on Learn- ing Representations (ICLR), 2025. 3
work page 2025
-
[29]
Enze Xie, Junsong Chen, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.Proceedings of the International Conference on Machine Learning (ICML), 2025. 2, 3, 4, 5
work page 2025
-
[30]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, Yan Tai, Jiayi Lei, Yuewen Cao, Keqi Wang, Yibin Wang, et al. Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding.arXiv preprint arXiv:2510.06308, 2025. 2, 3, 6
-
[32]
Lumina-mgpt 2.0: Stand- alone autoregressive image modeling.arXiv preprint arXiv:2507.17801, 2025
Yi Xin, Juncheng Yan, Qi Qin, Zhen Li, Dongyang Liu, Shicheng Li, Victor Shea-Jay Huang, Yupeng Zhou, Ren- rui Zhang, Le Zhuo, et al. Lumina-mgpt 2.0: Stand- alone autoregressive image modeling.arXiv preprint arXiv:2507.17801, 2025. 2, 3, 5, 7
-
[33]
Yi Xin, Le Zhuo, Qi Qin, Siqi Luo, Yuewen Cao, Bin Fu, Yangfan He, Hongsheng Li, Guangtao Zhai, Xiaohong Liu, et al. Resurrect mask autoregressive modeling for efficient and scalable image generation.arXiv preprint arXiv:2507.13032, 2025. 3
-
[34]
Imagere- ward: Learning and evaluating human preferences for text- to-image generation
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 3
work page 2023
-
[35]
MMaDA: Multimodal Large Diffusion Language Models
Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Mul- timodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025. 2, 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Mingyang Yi, Aoxue Li, Yi Xin, and Zhenguo Li. Towards understanding the working mechanism of text-to-image dif- fusion model.Advances in Neural Information Processing Systems (NeurIPS), 2024. 3
work page 2024
-
[37]
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. LLaDA-V: Large language diffusion models with visual instruction tun- ing.arXiv preprint arXiv:2505.16933, 2025. 3
work page internal anchor Pith review arXiv 2025
-
[38]
LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. LLaDA 1.5: Variance- reduced preference optimization for large language diffusion models.arXiv preprint arXiv:2505.19223, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Lumina-next: Making lumina-t2x stronger and faster with next-dit
Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Xiangyang Zhu, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina-t2x stronger and faster with next-dit. InAdvances in Neural In- formation Processing Systems (NeurIPS), 2024. 3
work page 2024
-
[40]
Le Zhuo, Liangbing Zhao, Sayak Paul, Yue Liao, Renrui Zhang, Yi Xin, et al. From reflection to perfection: Scal- ing inference-time optimization for text-to-image diffusion models via reflection tuning.Proceedings of the IEEE Inter- national Conference on Computer Vision (ICCV), 2025. 2, 3, 7
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.