Recognition: unknown
Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation
Pith reviewed 2026-05-10 03:02 UTC · model grok-4.3
The pith
Patch-level timesteps and difficulty prediction let diffusion models advance easy image regions first to inform harder ones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Moving from global to patch-level timesteps, controlled by a sampler that limits maximum patch information, already improves generation. Augmenting the model with a lightweight per-patch difficulty head enables adaptive allocation of denoising steps. Combined with noise levels that vary over both space and diffusion time, this yields Patch Forcing, which advances easier regions earlier so they can provide context for harder ones and achieves superior results on class-conditional ImageNet while scaling to text-to-image synthesis.
What carries the argument
Patch Forcing (PF), the framework that pairs spatially varying timesteps with an adaptive sampler driven by a per-patch difficulty head to prioritize context from easy patches before refining hard ones.
If this is right
- Superior sample quality on class-conditional ImageNet generation compared with uniform-timestep baselines.
- Compatibility with existing representation-alignment and classifier-free guidance methods.
- Successful scaling from class-conditional to text-to-image synthesis without architectural overhaul.
- Patch-level denoising schedules form a foundation for further adaptive image generation techniques.
Where Pith is reading between the lines
- The same per-patch difficulty signal could be reused at inference to decide where to spend extra function evaluations, potentially cutting total compute.
- The approach may transfer to video or 3D diffusion where spatial and temporal heterogeneity is even stronger.
- Combining Patch Forcing with model distillation could compound efficiency gains by reducing steps while preserving the adaptive schedule.
Load-bearing premise
The assumption that a specially designed timestep sampler can prevent the model from seeing patch-wise noise combinations during training that never appear at inference time.
What would settle it
Training the same architecture with random per-patch timesteps but without the proposed sampler and comparing FID or perceptual quality on ImageNet against both uniform-timestep baselines and the full Patch Forcing method would directly test whether the sampler is necessary.
Figures
read the original abstract
Diffusion- and flow-based models usually allocate compute uniformly across space, updating all patches with the same timestep and number of function evaluations. While convenient, this ignores the heterogeneity of natural images: some regions are easy to denoise, whereas others benefit from more refinement or additional context. Motivated by this, we explore patch-level noise scales for image synthesis. We find that naively varying timesteps across image tokens performs poorly, as it exposes the model to overly informative training states that do not occur at inference. We therefore introduce a timestep sampler that explicitly controls the maximum patch-level information available during training, and show that moving from global to patch-level timesteps already improves image generation over standard baselines. By further augmenting the model with a lightweight per-patch difficulty head, we enable adaptive samplers that allocate compute dynamically where it is most needed. Combined with noise levels varying over both space and diffusion time, this yields Patch Forcing (PF), a framework that advances easier regions earlier so they can provide context for harder ones. PF achieves superior results on class-conditional ImageNet, remains orthogonal to representation alignment and guidance methods, and scales to text-to-image synthesis. Our results suggest that patch-level denoising schedules provide a promising foundation for adaptive image generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Patch Forcing (PF), a framework for adaptive image generation in diffusion models. It identifies that uniform timestep allocation across patches ignores spatial heterogeneity in denoising difficulty. The authors introduce a timestep sampler that controls the maximum patch-level information during training to avoid overly informative states absent at inference, augment the model with a lightweight per-patch difficulty head, and enable adaptive samplers that advance easier regions first to provide context for harder ones. Combined with spatially and temporally varying noise, PF is claimed to yield superior results on class-conditional ImageNet, remain orthogonal to representation alignment and guidance methods, and scale to text-to-image synthesis.
Significance. If the central claims are substantiated with rigorous experiments, the work could meaningfully advance efficient generative modeling by exploiting per-patch difficulty heterogeneity rather than uniform compute allocation. The orthogonality to existing techniques would make it a useful complement, and the emphasis on aligning training and inference distributions addresses a common pitfall in adaptive sampling methods.
major comments (1)
- [Method description of the timestep sampler and adaptive inference procedure] The skeptic concern is load-bearing: the central claim that gains arise because easier patches provide context for harder ones requires that the timestep sampler produce a joint distribution over patch timesteps (including spatial correlations and conditional dependencies) that is statistically close to the distribution encountered under difficulty-driven adaptive inference. The manuscript does not appear to verify this alignment beyond bounding per-patch maxima; without such verification (e.g., via distribution distance metrics or ablation on higher-order statistics), observed improvements could stem from the added difficulty head or increased conditioning capacity rather than the claimed mechanism.
minor comments (2)
- The abstract asserts performance gains and orthogonality but supplies no quantitative metrics, baselines, or ablation details; the full manuscript should include these in the results section to allow assessment of effect sizes.
- Notation for patch-level timesteps and the difficulty head should be introduced with explicit equations early in the method section to improve clarity for readers unfamiliar with the framework.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for acknowledging the potential of Patch Forcing to advance adaptive sampling in diffusion models by exploiting per-patch difficulty heterogeneity. We address the major comment below and outline revisions that will strengthen the empirical support for the claimed mechanism.
read point-by-point responses
-
Referee: [Method description of the timestep sampler and adaptive inference procedure] The skeptic concern is load-bearing: the central claim that gains arise because easier patches provide context for harder ones requires that the timestep sampler produce a joint distribution over patch timesteps (including spatial correlations and conditional dependencies) that is statistically close to the distribution encountered under difficulty-driven adaptive inference. The manuscript does not appear to verify this alignment beyond bounding per-patch maxima; without such verification (e.g., via distribution distance metrics or ablation on higher-order statistics), observed improvements could stem from the added difficulty head or increased conditioning capacity rather than the claimed mechanism.
Authors: We agree that explicit verification of the joint distribution alignment is necessary to isolate the contribution of the context-providing mechanism. The timestep sampler is constructed to enforce a per-patch upper bound on information content (equivalently, a lower bound on noise level) that mirrors the states reachable under difficulty-driven adaptive inference, where easier patches are denoised first. This bound, combined with the spatially varying noise schedule, is intended to preclude training states in which all patches are simultaneously at low noise while others remain noisy. Nevertheless, we acknowledge that bounding per-patch maxima alone does not automatically guarantee matching higher-order statistics such as spatial correlations or conditional dependencies. In the revised manuscript we will therefore add: (i) an ablation that trains with the difficulty head but disables the adaptive sampler at inference (reverting to uniform timesteps), and (ii) quantitative distribution-alignment diagnostics, including marginal histograms of per-patch timesteps and a simple measure of pairwise spatial correlation between patch timesteps sampled from the training procedure versus trajectories simulated from the adaptive inference policy. These additions will directly test whether the observed gains are attributable to the intended mechanism rather than auxiliary model capacity. revision: yes
Circularity Check
No circularity: empirical method with explicit controls and experimental validation
full rationale
The paper introduces a timestep sampler and per-patch difficulty head as explicit training mechanisms to address observed mismatches between uniform and patch-level denoising. These are not derived from fitted parameters or self-referential equations but are motivated by empirical findings and validated through ablation and benchmark results on ImageNet and text-to-image tasks. No load-bearing step reduces to a self-citation chain, ansatz smuggling, or renaming of known results; the central claims rest on the introduced controls and their measured performance rather than tautological definitions.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Inline Critic Steers Image Editing
Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.
Reference graph
Works this paper leans on
-
[1]
Self-rectifying diffu- sion sampling with perturbed-attention guidance
Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Ky- ong Hwan Jin, and Seungryong Kim. Self-rectifying diffu- sion sampling with perturbed-attention guidance. InEuro- pean Conference on Computer Vision, pages 1–17. Springer,
-
[2]
Stochastic Interpolants: A Unifying Framework for Flows and Diffusions
Michael S Albergo, Nicholas M Boffi, and Eric Vanden- Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.arXiv preprint arXiv:2303.08797,
work page internal anchor Pith review arXiv
-
[3]
Patchmatch: A randomized correspon- dence algorithm for structural image editing.ACM Trans
Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. Patchmatch: A randomized correspon- dence algorithm for structural image editing.ACM Trans. Graph., 28(3):24, 2009. 1
2009
-
[4]
Image inpainting
Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and Coloma Ballester. Image inpainting. InProceedings of the 27th annual conference on Computer graphics and interac- tive techniques, pages 417–424, 2000. 1
2000
-
[5]
Improving image generation with better captions.Computer Science
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023. 8
2023
-
[6]
Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020. 1
1901
-
[7]
Coyo-700m: Image-text pair dataset.https://github.com/ kakaobrain/coyo-dataset, 2022
Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset.https://github.com/ kakaobrain/coyo-dataset, 2022. 8, 5
2022
-
[8]
Maskgit: Masked generative image transformer
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 11315–11325, 2022. 1, 3
2022
-
[9]
Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023
Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023. 1, 8
2023
-
[10]
Self-supervised flow matching for scalable multi-modal synthesis, 2026
Hila Chefer, Patrick Esser, Dominik Lorenz, Dustin Podell, Vikash Raja, Vinh Tong, Antonio Torralba, and Robin Rom- bach. Self-supervised flow matching for scalable multi- modal synthesis.arXiv preprint arXiv:2603.06507, 2026. 3
-
[11]
Diffusion forcing: Next-token prediction meets full-sequence diffu- sion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024
Boyuan Chen, Diego Mart ´ı Mons´o, Yilun Du, Max Sim- chowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffu- sion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024. 2, 3, 4
2024
-
[12]
Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,
-
[13]
Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021. 3, 6, 5
2021
-
[14]
Image quilting for texture synthesis and transfer
Alexei A Efros and William T Freeman. Image quilting for texture synthesis and transfer. InSeminal graphics papers: pushing the boundaries, volume 2, pages 571–576. 2023. 1
2023
-
[15]
Structure and content-guided video synthesis with diffusion models
Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 7346–7356, 2023. 1
2023
-
[16]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learn- ing, 2024. 5, 8, 1, 2, 6
2024
-
[17]
Training-free structured diffusion guidance for compositional text-to-image synthesis, 2023
Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis, 2023. 8
2023
-
[18]
Geneval: An object-focused framework for evaluating text- to-image alignment, 2023
Dhruba Ghosh, Hanna Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment, 2023. 8
2023
-
[19]
Adapting Self-Supervised Representations as a Latent Space for Efficient Generation
Ming Gui, Johannes Schusterbauer, Timy Phan, Felix Krause, Josh Susskind, Miguel Angel Bautista, and Bj ¨orn Ommer. Adapting self-supervised representations as a latent space for efficient generation.arXiv preprint arXiv:2510.14630, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Masked autoencoders are scalable 314 vision learners
K He, X Chen, S Xie, Y Li, P Doll ´ar, and R Girshick. Masked autoencoders are scalable 314 vision learners. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern, pages 16000–16009, 2021. 1
2021
-
[21]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 3, 6, 2, 5
2021
-
[22]
Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 2
2020
-
[23]
Improving sample quality of diffusion models us- ing self-attention guidance
Susung Hong, Gyuseong Lee, Wooseok Jang, and Seungry- ong Kim. Improving sample quality of diffusion models us- ing self-attention guidance. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7462– 7471, 2023. 3
2023
-
[24]
T2i-compbench: A comprehensive bench- mark for open-world compositional text-to-image genera- tion.Advances in Neural Information Processing Systems, 36:78723–78747, 2023
Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive bench- mark for open-world compositional text-to-image genera- tion.Advances in Neural Information Processing Systems, 36:78723–78747, 2023. 8, 3
2023
-
[25]
Diffusion model-based image editing: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
Yi Huang, Jiancheng Huang, Yifan Liu, Mingfu Yan, Jiaxi Lv, Jianzhuang Liu, Wei Xiong, He Zhang, Liangliang Cao, and Shifeng Chen. Diffusion model-based image editing: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 1
2025
-
[26]
Entropy rectifying guidance for diffusion and flow models.arXiv preprint arXiv:2504.13987, 2025
Tariq Berrada Ifriqi, Adriana Romero-Soriano, Michal Drozdzal, Jakob Verbeek, and Karteek Alahari. Entropy rectifying guidance for diffusion and flow models.arXiv preprint arXiv:2504.13987, 2025. 3
-
[27]
A comprehensive review of past and present image inpainting methods.Computer vision and image understanding, 203: 103147, 2021
Jireh Jam, Connah Kendrick, Kevin Walker, Vincent Drouard, Jison Gee-Sern Hsu, and Moi Hoon Yap. A comprehensive review of past and present image inpainting methods.Computer vision and image understanding, 203: 103147, 2021. 1
2021
-
[28]
Myunsoo Kim, Donghyeon Ki, Seong-Woong Shim, and Byung-Jun Lee. Adaptive non-uniform timestep sam- pling for diffusion model training.arXiv preprint arXiv:2411.09998, 2024. 3
-
[29]
Flowedit: Inversion- free text-based editing using pre-trained flow models
Vladimir Kulikov, Matan Kleiner, Inbar Huberman- Spiegelglas, and Tomer Michaeli. Flowedit: Inversion- free text-based editing using pre-trained flow models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19721–19730, 2025. 1
2025
-
[30]
FLUX.2: Frontier Visual Intelligence
Black Forest Labs. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/flux-2, 2025. 5
2025
-
[31]
Improved masked image generation with token-critic
Jos ´e Lezama, Huiwen Chang, Lu Jiang, and Irfan Essa. Improved masked image generation with token-critic. In European Conference on Computer Vision, pages 70–86. Springer, 2022. 3
2022
-
[32]
Mingxin Li, Yanzhao Zhang, Dingkun Long, Chen Keqin, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Qwen3- vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026. 8, 5
work page internal anchor Pith review arXiv 2026
-
[33]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[34]
Tenenbaum
Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B. Tenenbaum. Compositional visual generation with composable diffusion models, 2023. 8
2023
-
[35]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 2, 3
work page internal anchor Pith review arXiv 2022
-
[36]
Patchscaler: An efficient patch-independent diffusion model for image super- resolution
Yong Liu, Hang Dong, Jinshan Pan, Qingji Dong, Kai Chen, Rongxiang Zhang, Lean Fu, and Fei Wang. Patchscaler: An efficient patch-independent diffusion model for image super- resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11283–11293, 2025. 3
2025
-
[37]
Region- adaptive sampling for diffusion transformers.arXiv preprint arXiv:2502.10389, 2025
Ziming Liu, Yifan Yang, Chengruidong Zhang, Yiqi Zhang, Lili Qiu, Yang You, and Yuqing Yang. Region- adaptive sampling for diffusion transformers.arXiv preprint arXiv:2502.10389, 2025. 3, 2
-
[38]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[39]
Repaint: Inpainting using denoising diffusion probabilistic models
Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11461–11471, 2022. 1
2022
-
[40]
Bingqi Ma, Zhuofan Zong, Guanglu Song, Hongsheng Li, and Yu Liu. Exploring the role of large language models in prompt encoding for diffusion models.arXiv preprint arXiv:2406.11831, 2024. 5
-
[41]
Qi Mao, Hao Cheng, Tinghan Yang, Libiao Jin, and Siwei Ma
Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Explor- ing flow and diffusion-based generative models with scalable interpolant transformers.arXiv preprint arXiv:2401.08740,
-
[42]
Sdedit: Guided image synthesis and editing with stochastic differential equa- tions
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equa- tions. InICLR, 2022. 1
2022
-
[43]
Improved denoising diffusion probabilistic models
Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational conference on machine learning, pages 8162–8171. PMLR,
-
[44]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,
-
[45]
Sdxl: Improving latent diffusion models for high-resolution image synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InThe Twelfth Inter- national Conference on Learning Representations, 2024. 8, 5
2024
-
[46]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gen- eration with clip latents.arXiv preprint arXiv:2204.06125,
work page internal anchor Pith review arXiv
-
[47]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685. IEEE, 2022. 1, 3, 8
2022
-
[48]
Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015. 2, 6
2015
-
[49]
On the pitfalls of heteroscedastic uncertainty estimation with probabilistic neural networks
Maximilian Seitzer, Arash Tavakoli, Dimitrije Antic, and Georg Martius. On the pitfalls of heteroscedastic uncertainty estimation with probabilistic neural networks. InInterna- tional Conference on Learning Representations. 2, 3, 5
-
[50]
Denois- ing diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InInternational Conference on Learning Representations, 2021. 2
2021
-
[51]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[52]
Roformer: Enhanced transformer with rotary position embedding, 2021
Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2021. 5
2021
-
[53]
Journeydb: A benchmark for generative im- age understanding.Advances in neural information process- ing systems, 36:49659–49678, 2023
Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, et al. Journeydb: A benchmark for generative im- age understanding.Advances in neural information process- ing systems, 36:49659–49678, 2023. 5
2023
-
[54]
Autoregressive model beats diffusion: Llama for scalable image generation, 2024
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation, 2024. 8
2024
-
[55]
Emu3: Next-token prediction is all you need, 2024
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, and Zhongyuan Wang. Emu3: Next-token prediction is all you need, 2024. 8
2024
-
[56]
Spatial reasoning with denoising models
Christopher Wewer, Bart Pogodzinski, Bernt Schiele, and Jan Eric Lenssen. Spatial reasoning with denoising models. arXiv preprint arXiv:2502.21075, 2025. 2, 3, 4, 5, 7
-
[57]
Representation alignment for generation: Training diffusion transformers is easier than you think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InThe Thirteenth In- ternational Conference on Learning Representations. 1, 6, 5
-
[58]
Adadiff: Adaptive step selection for fast diffusion models
Hui Zhang, Zuxuan Wu, Zhen Xing, Jie Shao, and Yu-Gang Jiang. Adadiff: Adaptive step selection for fast diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 9914–9922, 2025. 3
2025
-
[59]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 1
2023
-
[60]
Diffusion Transformers with Representation Autoencoders
Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoen- coders.arXiv preprint arXiv:2510.11690, 2025. 1
work page internal anchor Pith review arXiv 2025
-
[61]
Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers.arXiv preprint arXiv:2306.09305, 2023. 6
-
[62]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 8, 5
work page internal anchor Pith review arXiv 2025
-
[63]
Rui Zhu, Yingwei Pan, Yehao Li, Ting Yao, Zhenglong Sun, Tao Mei, and Chang Wen Chen. Sd-dit: Unleash- ing the power of self-supervised discrimination in diffusion transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8435– 8445, 2024. 4, 6 Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for I...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.