pith. sign in

arxiv: 2501.09732 · v1 · pith:EMM5433Enew · submitted 2025-01-16 · 💻 cs.CV

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

Pith reviewed 2026-05-20 11:39 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion modelsinference-time scalingnoise searchimage generationverifierssampling algorithmsgenerative models
0
0 comments X

The pith

Diffusion models improve image quality by searching for better starting noises with extra inference compute rather than more denoising steps alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores scaling diffusion model performance at inference time by going beyond the usual practice of adding more denoising steps. It recasts the selection of the initial noise as a search problem that uses verifiers to score candidate noises and algorithms to locate better ones. Experiments across class-conditioned and text-conditioned image benchmarks show that this extra computation produces higher-quality samples even after denoising-step gains have flattened. A sympathetic reader would care because the work suggests a practical route to test-time scaling in generative models without retraining or enlarging the network.

Core claim

By treating noise selection for the diffusion sampling process as a search problem along the two axes of verifiers and search algorithms, increasing inference-time compute leads to substantial improvements in the quality of generated samples on class-conditioned and text-conditioned image generation benchmarks.

What carries the argument

A search procedure over initial noise inputs that combines verifiers to score candidates with algorithms that explore the space to select superior noises for diffusion sampling.

If this is right

  • Sample quality keeps rising as more compute is allocated to noise search after denoising steps plateau.
  • Different verifier-algorithm pairs can be selected to match specific application needs.
  • Gains appear consistently on both class-conditional and text-conditional image tasks.
  • The method provides a way to scale generation performance without changing the trained model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Verifier design may become an important research direction for aligning scores more closely with human judgment.
  • The same search idea could extend to diffusion models used for video or 3D content.
  • System designers might trade off model size against inference-time search budget in future scaling curves.

Load-bearing premise

The verifiers used to score noise candidates give feedback that aligns with actual improvements in image quality rather than rewarding artifacts or overfitting to the verifier.

What would settle it

A controlled human preference study or downstream-task evaluation in which samples chosen by the noise-search method are not rated higher than samples from standard sampling given the same total inference compute.

read the original abstract

Generative models have made significant impacts across various domains, largely due to their ability to scale during training by increasing data, computational resources, and model size, a phenomenon characterized by the scaling laws. Recent research has begun to explore inference-time scaling behavior in Large Language Models (LLMs), revealing how performance can further improve with additional computation during inference. Unlike LLMs, diffusion models inherently possess the flexibility to adjust inference-time computation via the number of denoising steps, although the performance gains typically flatten after a few dozen. In this work, we explore the inference-time scaling behavior of diffusion models beyond increasing denoising steps and investigate how the generation performance can further improve with increased computation. Specifically, we consider a search problem aimed at identifying better noises for the diffusion sampling process. We structure the design space along two axes: the verifiers used to provide feedback, and the algorithms used to find better noise candidates. Through extensive experiments on class-conditioned and text-conditioned image generation benchmarks, our findings reveal that increasing inference-time compute leads to substantial improvements in the quality of samples generated by diffusion models, and with the complicated nature of images, combinations of the components in the framework can be specifically chosen to conform with different application scenario.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that diffusion models exhibit inference-time scaling beyond increasing denoising steps by framing initial noise selection as a search problem over candidates scored by external verifiers (such as CLIP similarity for text-to-image or class-conditional classifiers) and solved via various search algorithms; extensive experiments on class-conditioned and text-conditioned image generation benchmarks show that additional compute yields substantial quality gains, with component choices adaptable to different scenarios.

Significance. If the gains prove robust to independent quality measures, the work would usefully extend scaling ideas from LLMs to diffusion models and supply a practical modular framework for trading inference compute for sample quality. The modular design along verifiers and algorithms, together with the reported benchmark improvements, supplies a concrete starting point for follow-on research on test-time compute in generative vision.

major comments (2)
  1. [§4] §4 (Experiments): the reported quality improvements are obtained by ranking candidates with the same verifiers used for search (CLIP, classifiers); no correlation analysis against verifier-independent metrics (e.g., FID computed without the search verifier) or held-out human preference studies is described, leaving open whether the scaling reflects genuine perceptual improvement or verifier exploitation.
  2. [§3] §3 (Framework): the central claim that search yields better samples rests on the assumption that verifier scores reliably track downstream quality; the manuscript provides no robustness checks against known verifier failure modes such as favoring high-frequency artifacts or prompt shortcuts, which directly affects whether the observed compute-quality curve is load-bearing.
minor comments (2)
  1. [Abstract] Abstract: the quantitative magnitude of the reported gains (e.g., FID or IS deltas) is not stated, making it harder to gauge practical impact.
  2. [§4] Figure captions in §4: several comparison figures would benefit from explicit annotation of which search/verifier combination produced each row to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work exploring inference-time scaling in diffusion models. We address the major concerns regarding evaluation and verifier reliability below, and we will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): the reported quality improvements are obtained by ranking candidates with the same verifiers used for search (CLIP, classifiers); no correlation analysis against verifier-independent metrics (e.g., FID computed without the search verifier) or held-out human preference studies is described, leaving open whether the scaling reflects genuine perceptual improvement or verifier exploitation.

    Authors: We appreciate this observation. Our experiments primarily rely on the verifiers for guiding the search, but the final quality is assessed using benchmark-standard metrics such as FID for class-conditional generation and CLIP similarity for text-to-image. To directly address the concern about verifier-independent evaluation, we will add a correlation analysis in the revised version, where we compute FID scores independently and examine their correlation with the verifier scores used in search. For human preference studies, while not included in the current manuscript due to resource constraints, we acknowledge their value and will include a discussion of this as a limitation along with preliminary results if feasible, or outline plans for future validation. revision: partial

  2. Referee: [§3] §3 (Framework): the central claim that search yields better samples rests on the assumption that verifier scores reliably track downstream quality; the manuscript provides no robustness checks against known verifier failure modes such as favoring high-frequency artifacts or prompt shortcuts, which directly affects whether the observed compute-quality curve is load-bearing.

    Authors: We agree that verifying the robustness of the verifiers is important for the validity of the scaling claims. The manuscript selects verifiers based on their established use in prior work for similar tasks. However, we did not include explicit checks for failure modes like artifact favoring or shortcut exploitation. In the revision, we will add a new subsection discussing potential verifier limitations and include additional experiments, such as evaluating samples under verifier perturbations or comparing with alternative scoring methods, to demonstrate that the observed improvements hold under these considerations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical search framework with external verifiers

full rationale

The paper presents an empirical exploration of inference-time scaling via a search over noise candidates scored by external verifiers (e.g., CLIP or classifiers) and search algorithms. No equations, derivations, or self-referential definitions appear in the provided abstract or description that reduce the reported quality improvements to fitted parameters or inputs by construction. The central findings rest on benchmark experiments, which function as independent external validation rather than self-citation chains or ansatz smuggling. This is the most common honest outcome for an applied empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical observation that verifier-guided search improves samples; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5776 in / 1008 out tokens · 30676 ms · 2026-05-20T11:39:52.193723+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Do generative video models understand physical principles?

    cs.CV 2025-01 unverdicted novelty 8.0

    Physics-IQ benchmark reveals that generative video models exhibit limited physical understanding unrelated to their visual quality.

  2. TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion

    cs.CV 2026-05 unverdicted novelty 7.0

    TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.

  3. Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization

    cs.CV 2026-04 unverdicted novelty 7.0

    Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.

  4. Personalizing Text-to-Image Generation to Individual Taste

    cs.CV 2026-04 unverdicted novelty 7.0

    PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.

  5. VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion

    cs.AI 2026-04 unverdicted novelty 7.0

    FVD applies Fleming-Viot population dynamics to diffusion model sampling at inference time to reduce diversity collapse while improving reward alignment and FID scores.

  6. PhaseFlow4D: Physically Constrained 4D Beam Reconstruction via Feedback-Guided Latent Diffusion

    physics.acc-ph 2026-04 unverdicted novelty 7.0

    PhaseFlow4D reconstructs and tracks 4D beam phase space from 2D projections via a latent diffusion model with built-in physics constraints, achieving 11000x speedup over full simulations while following time-varying c...

  7. Reflective Flow Sampling Enhancement

    cs.CV 2026-03 unverdicted novelty 7.0

    RF-Sampling enhances flow matching models by implicitly performing gradient ascent on text-image alignment scores via linear textual combinations and flow inversion.

  8. It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models

    cs.CV 2025-12 unverdicted novelty 7.0

    Noise optimization during sampling recovers diversity in mode-collapsed diffusion models while preserving output fidelity.

  9. Iterative Inference-time Scaling with Adaptive Frequency Steering for Image Super-Resolution

    cs.CV 2025-12 unverdicted novelty 7.0

    IAFS is a training-free iterative inference-time scaling framework that uses adaptive frequency-aware particle fusion to resolve the perception-fidelity conflict in diffusion super-resolution models, outperforming pri...

  10. dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models

    cs.CV 2025-12 conditional novelty 7.0

    dMLLM-TTS delivers up to 6x more efficient test-time scaling for diffusion MLLMs via O(N+T) hierarchical search and self-verified feedback, improving generation quality on GenEval across three models.

  11. Inference-Time Scaling of Diffusion Language Models via Trajectory Refinement

    cs.LG 2025-07 conditional novelty 7.0

    PG-DLM applies particle Gibbs sampling over full trajectories in diffusion language models to enable iterative refinement, yielding higher accuracy on reward-guided generation with theoretical convergence guarantees.

  12. In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

    cs.CV 2025-04 unverdicted novelty 7.0

    ICEdit achieves state-of-the-art instructional image editing in Diffusion Transformers via in-context generation, requiring only 0.1% of prior training data and 1% trainable parameters.

  13. LatentBox: An Efficient Latent-First Storage System for AI-Generated Images

    cs.DC 2026-05 conditional novelty 6.0

    LatentBox is a latent-first storage system that stores compressed latents as durable objects with on-demand GPU reconstruction and dynamic image/latent caching, achieving 78.7% storage reduction with competitive laten...

  14. Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos

    cs.CV 2026-05 unverdicted novelty 6.0

    MIGA introduces two-stage alignment to close train-inference gaps and dual consistency enhancement via self-reflection and long-range guidance to achieve SOTA temporal consistency in infinite-frame video generation on...

  15. EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    EPIC introduces predicate-guided inference-time search that lifts compositional T2I prompt accuracy from 34% to 71% on GenEval2 with 31-81% lower execution costs.

  16. Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

    cs.CV 2026-05 unverdicted novelty 6.0

    DeScore decouples CoT reasoning from reward scoring in video reward models using a two-stage training process to improve generalization and avoid optimization bottlenecks of coupled generative RMs.

  17. FASTER: Value-Guided Sampling for Fast RL

    cs.LG 2026-04 unverdicted novelty 6.0

    FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.

  18. FRIGID: Scaling Diffusion-Based Molecular Generation from Mass Spectra at Training and Inference Time

    cs.LG 2026-04 unverdicted novelty 6.0

    FRIGID scales a diffusion-based model for de novo molecular structure generation from mass spectra, reaching over 18% top-1 accuracy on MassSpecGym and tripling prior bests on NPLIB1 via large unlabeled training and i...

  19. VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion

    cs.AI 2026-04 unverdicted novelty 6.0

    VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and M...

  20. CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning

    cs.LG 2026-03 unverdicted novelty 6.0

    CellFluxRL post-trains the CellFlux generative model with reinforcement learning driven by biologically meaningful reward functions, yielding virtual cell images that better satisfy physical and biological constraints...

  21. RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

    cs.CV 2025-10 unverdicted novelty 6.0

    RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.

  22. Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

    cs.CV 2026-05 unverdicted novelty 5.0

    DeScore decouples explicit CoT reasoning from reward regression in video reward models via a two-stage cold-start plus dual-objective RL training pipeline.

  23. The Serial Scaling Hypothesis

    cs.LG 2025-07 unverdicted novelty 5.0

    The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.

Reference graph

Works this paper leans on

106 extracted references · 106 canonical work pages · cited by 21 Pith papers · 22 internal anchors

  1. [1]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    D. Ahn, J. Kang, S. Lee, J. Min, M. Kim, W. Jang, H. Cho, S. Paul, S. Kim, E. Cha, K. H. Jin, and S. Kim. A noise is worth diffusion guidance.arXiv preprint arXiv:2412.03895, 2024

  3. [3]

    Buildingnormalizingflowswithstochasticinterpolants

    M.S.AlbergoandE.Vanden-Eijnden. Buildingnormalizingflowswithstochasticinterpolants. In The Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview. net/forum?id=li7qeBbCR1t

  4. [4]

    B. D. Anderson. Reverse-time diffusion equation models.Stochastic Processes and their Applications, 12(3):313–326, 1982

  5. [5]

    Ben-Hamu, O

    H. Ben-Hamu, O. Puny, I. Gat, B. Karrer, U. Singer, and Y. Lipman. D-flow: Differentiating through flows for controlled generation.arXiv preprint arXiv:2402.14017, 2024

  6. [6]

    Training Diffusion Models with Reinforcement Learning

    K. Black, M. Janner, Y. Du, I. Kostrikov, and S. Levine. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023

  7. [7]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    B. Brown, J. Juravsky, R. Ehrlich, R. Clark, Q. V. Le, C. Ré, and A. Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024

  8. [8]

    J. Chen, C. Ge, E. Xie, Y. Wu, L. Yao, X. Ren, Z. Wang, P. Luo, H. Lu, and Z. Li. Pixart-\sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation.arXiv preprint arXiv:2403.04692, 2024

  9. [9]

    T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training deep nets with sublinear memory cost.arXiv preprint arXiv:1604.06174, 2016

  10. [10]

    Z. Chen, Y. Du, Z. Wen, Y. Zhou, C. Cui, Z. Weng, H. Tu, C. Wang, Z. Tong, Q. Huang, et al. Mj-bench: Is your multimodal reward model really a good judge for text-to-image generation?arXiv preprint arXiv:2407.04842, 2024

  11. [11]

    Clark, P

    K. Clark, P. Vicol, K. Swersky, and D. J. Fleet. Directly fine-tuning diffusion models on differentiable rewards. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps: //openreview.net/forum?id=1vmSEVL19f

  12. [12]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 17

  13. [13]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  14. [14]

    Domingo-Enrich, M

    C. Domingo-Enrich, M. Drozdzal, B. Karrer, and R. T. Q. Chen. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control.arXiv preprint arXiv:2409.08861, 2024

  15. [15]

    Esser, S

    P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learning, 2024

  16. [16]

    Eyring, S

    L. Eyring, S. Karthik, K. Roth, A. Dosovitskiy, and Z. Akata. Reno: Enhancing one-step text-to-image models through reward-based noise optimization.arXiv preprint arXiv:2406.04312, 2024

  17. [17]

    Fan and K

    Y. Fan and K. Lee. Optimizing DDPM sampling with shortcut fine-tuning. InICML, volume 202 of Proceedings of Machine Learning Research, pages 9623–9639. PMLR, 2023

  18. [18]

    Y. Fan, O. Watkins, Y. Du, H. Liu, M. Ryu, C. Boutilier, P. Abbeel, M. Ghavamzadeh, K. Lee, and K. Lee. Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36, 2024

  19. [19]

    A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient.arXiv preprint cs/0408007, 2004

  20. [20]

    Gandhi, D

    K. Gandhi, D. Lee, G. Grand, M. Liu, W. Cheng, A. Sharma, and N. D. Goodman. Stream of search (sos): Learning to search in language.arXiv preprint arXiv:2404.03683, 2024

  21. [21]

    L. Gao, J. Schulman, and J. Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835–10866. PMLR, 2023

  22. [22]

    Goodfellow, J

    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014

  23. [23]

    CLIPScore: A Reference-free Evaluation Metric for Image Captioning

    J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y. Choi. Clipscore: A reference-free evaluation metric for image captioning.arXiv preprint arXiv:2104.08718, 2021

  24. [24]

    Heusel, H

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time- scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

  25. [25]

    Classifier-Free Diffusion Guidance

    J. Ho and T. Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  26. [26]

    Denoisingdiffusionprobabilisticmodels

    J.Ho,A.Jain,andP.Abbeel. Denoisingdiffusionprobabilisticmodels. AdvancesinNeuralInformation Processing Systems, 33:6840–6851, 2020

  27. [27]

    Training Compute-Optimal Large Language Models

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

  28. [28]

    X. Hu, T. He, and D. Wipf. New desiderata for direct preference optimization.arXiv preprint arXiv:2407.09072, 2024. 18

  29. [29]

    Y. Hu, B. Liu, J. Kasai, Y. Wang, M. Ostendorf, R. Krishna, and N. A. Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20406–20417, 2023

  30. [30]

    Huang, K

    K. Huang, K. Sun, E. Xie, Z. Li, and X. Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

  31. [31]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  32. [32]

    Karras, M

    T. Karras, M. Aittala, T. Aila, and S. Laine. Elucidating the design space of diffusion-based generative models. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors,Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=k7FuTOWMOc7

  33. [33]

    Karras, M

    T. Karras, M. Aittala, J. Lehtinen, J. Hellsten, T. Aila, and S. Laine. Analyzing and improving the training dynamics of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24174–24184, 2024

  34. [34]

    Karthik, K

    S. Karthik, K. Roth, M. Mancini, and Z. Akata. If at first you don’t succeed, try, try again: Faithful diffusion-based text-to-image generation by selection.arXiv preprint arXiv:2305.13308, 2023

  35. [35]

    Karunratanakul, K

    K. Karunratanakul, K. Preechakul, E. Aksan, T. Beeler, S. Suwajanakorn, and S. Tang. Optimizing diffusion noise can serve as universal motion priors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1334–1345, 2024

  36. [36]

    Khader, G

    F. Khader, G. Mueller-Franzes, S. T. Arasteh, T. Han, C. Haarburger, M. Schulze-Hagen, P. Schad, S. Engelhardt, B. Baessler, S. Foersch, et al. Medical diffusion: denoising diffusion probabilistic models for 3d medical image generation.arXiv preprint arXiv:2211.03364, 2022

  37. [37]

    K. Kim, J. Jeong, M. An, M. Ghavamzadeh, K. Dvijotham, J. Shin, and K. Lee. Confidence-aware reward optimization for fine-tuning text-to-image models.arXiv preprint arXiv:2404.01863, 2024

  38. [38]

    D. P. Kingma. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

  39. [39]

    Kirstain, A

    Y. Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023

  40. [40]

    Kynkäänniemi, T

    T. Kynkäänniemi, T. Karras, S. Laine, J. Lehtinen, and T. Aila. Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019

  41. [41]

    B. F. Labs. Flux.1 [dev].https://blackforestlabs.ai/

  42. [42]

    T. Lee, M. Yasunaga, C. Meng, Y. Mai, J. S. Park, A. Gupta, Y. Zhang, D. Narayanan, H. Teufel, M. Bellagente, et al. Holistic evaluation of text-to-image models.Advances in Neural Information Processing Systems, 36, 2024

  43. [43]

    J. Li, D. Li, C. Xiong, and S. Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022. 19

  44. [44]

    Process reward model with q- value rankings.arXiv preprint arXiv:2410.11287, 2024

    W. Li and Y. Li. Process reward model with q-value rankings.arXiv preprint arXiv:2410.11287, 2024

  45. [45]

    Let's Verify Step by Step

    H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

  46. [46]

    Lipman, R

    Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps: //openreview.net/forum?id=PqvMRDCJT9t

  47. [47]

    X. Liu, C. Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=XVjTT1nw5z

  48. [48]

    Y. Liu, Y. Zhang, T. Jaakkola, and S. Chang. Correcting diffusion generation through resampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8713–8723, 2024

  49. [49]

    Y. Lu, X. Yang, X. Li, X. E. Wang, and W. Y. Wang. Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation.Advances in Neural Information Processing Systems, 36, 2024

  50. [50]

    N. Ma, M. Goldstein, M. S. Albergo, N. M. Boffi, E. Vanden-Eijnden, and S. Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers.arXiv preprint arXiv:2401.08740, 2024

  51. [51]

    B. Na, Y. Kim, M. Park, D. Shin, W. Kang, and I.-C. Moon. Diffusion rejection sampling.arXiv preprint arXiv:2405.17880, 2024

  52. [52]

    Novack, J

    Z. Novack, J. McAuley, T. Berg-Kirkpatrick, and N. J. Bryan. Ditto: Diffusion inference-time t- optimization for music generation.arXiv preprint arXiv:2401.12179, 2024

  53. [53]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  54. [54]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  55. [55]

    A. Pan, K. Bhatia, and J. Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models.arXiv preprint arXiv:2201.03544, 2022

  56. [56]

    Movie Gen: A Cast of Media Foundation Models

    A. Polyak, A. Zohar, A. Brown, A. Tjandra, A. Sinha, A. Lee, A. Vyas, B. Shi, C.-Y. Ma, C.-Y. Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

  57. [57]

    Z. Qi, L. Bai, H. Xiong, et al. Not all noises are created equally: Diffusion noise selection and optimization. arXiv preprint arXiv:2407.14041, 2024

  58. [58]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInterna- tional conference on machine learning, pages 8748–8763. PMLR, 2021. 20

  59. [59]

    Rafailov, A

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, 2023

  60. [60]

    Ramesh, M

    A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever. Zero-shot text-to-image generation. InInternational conference on machine learning, pages 8821–8831. Pmlr, 2021

  61. [61]

    Rombach, A

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  62. [62]

    S. Ruder. An overview of gradient descent optimization algorithms.arXiv preprint arXiv:1609.04747, 2016

  63. [63]

    Saharia, W

    C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B.KaragolAyan, T.Salimans, etal. Photorealistictext-to-imagediffusionmodelswithdeeplanguage understanding. Advances in neural information processing systems, 35:36479–36494, 2022

  64. [64]

    Salimans and J

    T. Salimans and J. Ho. Progressive distillation for fast sampling of diffusion models. InInternational Conference on Learning Representations, 2022. URL https://openreview.net/forum?id= TIdIXIpzhoI

  65. [65]

    Salimans, I

    T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

  66. [66]

    Samuel, R

    D. Samuel, R. Ben-Ari, S. Raviv, N. Darshan, and G. Chechik. Generating images of rare concepts using pre-trained diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4695–4703, 2024

  67. [67]

    Schneider

    F. Schneider. Archisound: Audio generation with diffusion.arXiv preprint arXiv:2301.13267, 2023

  68. [68]

    Schuhmann, R

    C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in Neural Information Processing Systems, 35:25278–25294, 2022

  69. [69]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

  70. [70]

    Sohl-Dickstein, E

    J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational Conference on Machine Learning, pages 2256–2265. PMLR, 2015

  71. [71]

    J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=St1giarCHLP

  72. [72]

    Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based genera- tive modeling through stochastic differential equations. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=PxTIG12RRHS. 21

  73. [73]

    Y. Song, P. Dhariwal, M. Chen, and I. Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023

  74. [74]

    D. Su, S. Sukhbaatar, M. Rabbat, Y. Tian, and Q. Zheng. Dualformer: Controllable fast and slow thinking by learning with randomized reasoning traces.arXiv preprint arXiv:2410.09918, 2024

  75. [75]

    Szegedy, V

    C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016

  76. [76]

    Z. Tan, X. Yang, L. Qin, M. Yang, C. Zhang, and H. Li. Evalalign: Supervised fine-tuning multimodal llms with human-aligned data for evaluating text-to-image models.arXiv preprint arXiv:2406.16562, 2024

  77. [77]

    L. Tang, N. Ruiz, Q. Chu, Y. Li, A. Holynski, D. E. Jacobs, B. Hariharan, Y. Pritch, N. Wadhwa, K. Aberman, et al. Realfill: Reference-driven generation for authentic image completion.ACM Transactions on Graphics (TOG), 43(4):1–12, 2024

  78. [78]

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  79. [79]

    K. Tian, Y. Jiang, Z. Yuan, B. Peng, and L. Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. InAdvances in Neural Information Processing Systems, 2024

  80. [80]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

Showing first 80 references.