Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

Haolin Jia; Hexiang Hu; Mingda Zhang; Nanye Ma; Saining Xie; Shangyuan Tong; Tommi Jaakkola; Xuan Yang; Xuhui Jia; Yandong Li

arxiv: 2501.09732 · v1 · pith:EMM5433Enew · submitted 2025-01-16 · 💻 cs.CV

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

Nanye Ma , Shangyuan Tong , Haolin Jia , Hexiang Hu , Yu-Chuan Su , Mingda Zhang , Xuan Yang , Yandong Li

show 3 more authors

Tommi Jaakkola Xuhui Jia Saining Xie

This is my paper

Pith reviewed 2026-05-20 11:39 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion modelsinference-time scalingnoise searchimage generationverifierssampling algorithmsgenerative models

0 comments

The pith

Diffusion models improve image quality by searching for better starting noises with extra inference compute rather than more denoising steps alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores scaling diffusion model performance at inference time by going beyond the usual practice of adding more denoising steps. It recasts the selection of the initial noise as a search problem that uses verifiers to score candidate noises and algorithms to locate better ones. Experiments across class-conditioned and text-conditioned image benchmarks show that this extra computation produces higher-quality samples even after denoising-step gains have flattened. A sympathetic reader would care because the work suggests a practical route to test-time scaling in generative models without retraining or enlarging the network.

Core claim

By treating noise selection for the diffusion sampling process as a search problem along the two axes of verifiers and search algorithms, increasing inference-time compute leads to substantial improvements in the quality of generated samples on class-conditioned and text-conditioned image generation benchmarks.

What carries the argument

A search procedure over initial noise inputs that combines verifiers to score candidates with algorithms that explore the space to select superior noises for diffusion sampling.

If this is right

Sample quality keeps rising as more compute is allocated to noise search after denoising steps plateau.
Different verifier-algorithm pairs can be selected to match specific application needs.
Gains appear consistently on both class-conditional and text-conditional image tasks.
The method provides a way to scale generation performance without changing the trained model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Verifier design may become an important research direction for aligning scores more closely with human judgment.
The same search idea could extend to diffusion models used for video or 3D content.
System designers might trade off model size against inference-time search budget in future scaling curves.

Load-bearing premise

The verifiers used to score noise candidates give feedback that aligns with actual improvements in image quality rather than rewarding artifacts or overfitting to the verifier.

What would settle it

A controlled human preference study or downstream-task evaluation in which samples chosen by the noise-search method are not rated higher than samples from standard sampling given the same total inference compute.

read the original abstract

Generative models have made significant impacts across various domains, largely due to their ability to scale during training by increasing data, computational resources, and model size, a phenomenon characterized by the scaling laws. Recent research has begun to explore inference-time scaling behavior in Large Language Models (LLMs), revealing how performance can further improve with additional computation during inference. Unlike LLMs, diffusion models inherently possess the flexibility to adjust inference-time computation via the number of denoising steps, although the performance gains typically flatten after a few dozen. In this work, we explore the inference-time scaling behavior of diffusion models beyond increasing denoising steps and investigate how the generation performance can further improve with increased computation. Specifically, we consider a search problem aimed at identifying better noises for the diffusion sampling process. We structure the design space along two axes: the verifiers used to provide feedback, and the algorithms used to find better noise candidates. Through extensive experiments on class-conditioned and text-conditioned image generation benchmarks, our findings reveal that increasing inference-time compute leads to substantial improvements in the quality of samples generated by diffusion models, and with the complicated nature of images, combinations of the components in the framework can be specifically chosen to conform with different application scenario.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Searching over initial noises with verifiers gives diffusion models extra inference-time gains beyond more denoising steps, but the results rest on whether those verifiers track real quality.

read the letter

The main thing to know is that this paper treats the starting noise as a searchable variable and shows that extra inference compute spent on finding better noises can improve sample quality in diffusion models after denoising steps stop helping much. They organize the space around verifiers for scoring candidates and algorithms for searching, then test combinations on class-conditional and text-to-image benchmarks. The experiments report clear improvements as compute increases, which is a direct and useful demonstration of the scaling effect they claim. This framing is new relative to prior work that mostly varied step count or model size. The setup is straightforward and the benchmark results give a concrete sense of what the gains look like in practice. The soft spot is the verifier step. The abstract and stress-test note both flag that common choices like CLIP or class classifiers can favor artifacts or shortcuts instead of human-visible quality. No details appear on human preference studies, correlation with independent metrics such as FID computed outside the verifier, or robustness ablations. If those checks are missing from the full paper, the reported scaling benefit could be partly illusory. The central claim still holds up on its own terms as an engineering result on the benchmarks they ran. This is for groups working on test-time compute for generative models or anyone trying to squeeze more performance out of existing diffusion checkpoints without retraining. A reader focused on inference scaling would get practical value from the framework and the reported trade-offs. It deserves a serious referee because the idea is timely, the experiments are on standard tasks, and the design space is laid out clearly enough to review.

Referee Report

2 major / 2 minor

Summary. The paper claims that diffusion models exhibit inference-time scaling beyond increasing denoising steps by framing initial noise selection as a search problem over candidates scored by external verifiers (such as CLIP similarity for text-to-image or class-conditional classifiers) and solved via various search algorithms; extensive experiments on class-conditioned and text-conditioned image generation benchmarks show that additional compute yields substantial quality gains, with component choices adaptable to different scenarios.

Significance. If the gains prove robust to independent quality measures, the work would usefully extend scaling ideas from LLMs to diffusion models and supply a practical modular framework for trading inference compute for sample quality. The modular design along verifiers and algorithms, together with the reported benchmark improvements, supplies a concrete starting point for follow-on research on test-time compute in generative vision.

major comments (2)

[§4] §4 (Experiments): the reported quality improvements are obtained by ranking candidates with the same verifiers used for search (CLIP, classifiers); no correlation analysis against verifier-independent metrics (e.g., FID computed without the search verifier) or held-out human preference studies is described, leaving open whether the scaling reflects genuine perceptual improvement or verifier exploitation.
[§3] §3 (Framework): the central claim that search yields better samples rests on the assumption that verifier scores reliably track downstream quality; the manuscript provides no robustness checks against known verifier failure modes such as favoring high-frequency artifacts or prompt shortcuts, which directly affects whether the observed compute-quality curve is load-bearing.

minor comments (2)

[Abstract] Abstract: the quantitative magnitude of the reported gains (e.g., FID or IS deltas) is not stated, making it harder to gauge practical impact.
[§4] Figure captions in §4: several comparison figures would benefit from explicit annotation of which search/verifier combination produced each row to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work exploring inference-time scaling in diffusion models. We address the major concerns regarding evaluation and verifier reliability below, and we will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§4] §4 (Experiments): the reported quality improvements are obtained by ranking candidates with the same verifiers used for search (CLIP, classifiers); no correlation analysis against verifier-independent metrics (e.g., FID computed without the search verifier) or held-out human preference studies is described, leaving open whether the scaling reflects genuine perceptual improvement or verifier exploitation.

Authors: We appreciate this observation. Our experiments primarily rely on the verifiers for guiding the search, but the final quality is assessed using benchmark-standard metrics such as FID for class-conditional generation and CLIP similarity for text-to-image. To directly address the concern about verifier-independent evaluation, we will add a correlation analysis in the revised version, where we compute FID scores independently and examine their correlation with the verifier scores used in search. For human preference studies, while not included in the current manuscript due to resource constraints, we acknowledge their value and will include a discussion of this as a limitation along with preliminary results if feasible, or outline plans for future validation. revision: partial
Referee: [§3] §3 (Framework): the central claim that search yields better samples rests on the assumption that verifier scores reliably track downstream quality; the manuscript provides no robustness checks against known verifier failure modes such as favoring high-frequency artifacts or prompt shortcuts, which directly affects whether the observed compute-quality curve is load-bearing.

Authors: We agree that verifying the robustness of the verifiers is important for the validity of the scaling claims. The manuscript selects verifiers based on their established use in prior work for similar tasks. However, we did not include explicit checks for failure modes like artifact favoring or shortcut exploitation. In the revision, we will add a new subsection discussing potential verifier limitations and include additional experiments, such as evaluating samples under verifier perturbations or comparing with alternative scoring methods, to demonstrate that the observed improvements hold under these considerations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical search framework with external verifiers

full rationale

The paper presents an empirical exploration of inference-time scaling via a search over noise candidates scored by external verifiers (e.g., CLIP or classifiers) and search algorithms. No equations, derivations, or self-referential definitions appear in the provided abstract or description that reduce the reported quality improvements to fitted parameters or inputs by construction. The central findings rest on benchmark experiments, which function as independent external validation rather than self-citation chains or ansatz smuggling. This is the most common honest outcome for an applied empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical observation that verifier-guided search improves samples; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5776 in / 1008 out tokens · 30676 ms · 2026-05-20T11:39:52.193723+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Do generative video models understand physical principles?
cs.CV 2025-01 unverdicted novelty 8.0

Physics-IQ benchmark reveals that generative video models exhibit limited physical understanding unrelated to their visual quality.
TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization
cs.CV 2026-04 unverdicted novelty 7.0

Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
Personalizing Text-to-Image Generation to Individual Taste
cs.CV 2026-04 unverdicted novelty 7.0

PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
cs.AI 2026-04 unverdicted novelty 7.0

FVD applies Fleming-Viot population dynamics to diffusion model sampling at inference time to reduce diversity collapse while improving reward alignment and FID scores.
PhaseFlow4D: Physically Constrained 4D Beam Reconstruction via Feedback-Guided Latent Diffusion
physics.acc-ph 2026-04 unverdicted novelty 7.0

PhaseFlow4D reconstructs and tracks 4D beam phase space from 2D projections via a latent diffusion model with built-in physics constraints, achieving 11000x speedup over full simulations while following time-varying c...
Reflective Flow Sampling Enhancement
cs.CV 2026-03 unverdicted novelty 7.0

RF-Sampling enhances flow matching models by implicitly performing gradient ascent on text-image alignment scores via linear textual combinations and flow inversion.
It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models
cs.CV 2025-12 unverdicted novelty 7.0

Noise optimization during sampling recovers diversity in mode-collapsed diffusion models while preserving output fidelity.
Iterative Inference-time Scaling with Adaptive Frequency Steering for Image Super-Resolution
cs.CV 2025-12 unverdicted novelty 7.0

IAFS is a training-free iterative inference-time scaling framework that uses adaptive frequency-aware particle fusion to resolve the perception-fidelity conflict in diffusion super-resolution models, outperforming pri...
dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models
cs.CV 2025-12 conditional novelty 7.0

dMLLM-TTS delivers up to 6x more efficient test-time scaling for diffusion MLLMs via O(N+T) hierarchical search and self-verified feedback, improving generation quality on GenEval across three models.
Inference-Time Scaling of Diffusion Language Models via Trajectory Refinement
cs.LG 2025-07 conditional novelty 7.0

PG-DLM applies particle Gibbs sampling over full trajectories in diffusion language models to enable iterative refinement, yielding higher accuracy on reward-guided generation with theoretical convergence guarantees.
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer
cs.CV 2025-04 unverdicted novelty 7.0

ICEdit achieves state-of-the-art instructional image editing in Diffusion Transformers via in-context generation, requiring only 0.1% of prior training data and 1% trainable parameters.
LatentBox: An Efficient Latent-First Storage System for AI-Generated Images
cs.DC 2026-05 conditional novelty 6.0

LatentBox is a latent-first storage system that stores compressed latents as durable objects with on-demand GPU reconstruction and dynamic image/latent caching, achieving 78.7% storage reduction with competitive laten...
Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos
cs.CV 2026-05 unverdicted novelty 6.0

MIGA introduces two-stage alignment to close train-inference gaps and dual consistency enhancement via self-reflection and long-range guidance to achieve SOTA temporal consistency in infinite-frame video generation on...
EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

EPIC introduces predicate-guided inference-time search that lifts compositional T2I prompt accuracy from 34% to 71% on GenEval2 with 31-81% lower execution costs.
Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling
cs.CV 2026-05 unverdicted novelty 6.0

DeScore decouples CoT reasoning from reward scoring in video reward models using a two-stage training process to improve generalization and avoid optimization bottlenecks of coupled generative RMs.
FASTER: Value-Guided Sampling for Fast RL
cs.LG 2026-04 unverdicted novelty 6.0

FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.
FRIGID: Scaling Diffusion-Based Molecular Generation from Mass Spectra at Training and Inference Time
cs.LG 2026-04 unverdicted novelty 6.0

FRIGID scales a diffusion-based model for de novo molecular structure generation from mass spectra, reaching over 18% top-1 accuracy on MassSpecGym and tripling prior bests on NPLIB1 via large unlabeled training and i...
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
cs.AI 2026-04 unverdicted novelty 6.0

VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and M...
CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning
cs.LG 2026-03 unverdicted novelty 6.0

CellFluxRL post-trains the CellFlux generative model with reinforcement learning driven by biologically meaningful reward functions, yielding virtual cell images that better satisfy physical and biological constraints...
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
cs.CV 2025-10 unverdicted novelty 6.0

RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.
Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling
cs.CV 2026-05 unverdicted novelty 5.0

DeScore decouples explicit CoT reasoning from reward regression in video reward models via a two-stage cold-start plus dual-objective RL training pipeline.
The Serial Scaling Hypothesis
cs.LG 2025-07 unverdicted novelty 5.0

The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.

Reference graph

Works this paper leans on

106 extracted references · 106 canonical work pages · cited by 21 Pith papers · 22 internal anchors

[1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

D. Ahn, J. Kang, S. Lee, J. Min, M. Kim, W. Jang, H. Cho, S. Paul, S. Kim, E. Cha, K. H. Jin, and S. Kim. A noise is worth diffusion guidance.arXiv preprint arXiv:2412.03895, 2024

work page arXiv 2024
[3]

Buildingnormalizingflowswithstochasticinterpolants

M.S.AlbergoandE.Vanden-Eijnden. Buildingnormalizingflowswithstochasticinterpolants. In The Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview. net/forum?id=li7qeBbCR1t

work page 2023
[4]

B. D. Anderson. Reverse-time diffusion equation models.Stochastic Processes and their Applications, 12(3):313–326, 1982

work page 1982
[5]

Ben-Hamu, O

H. Ben-Hamu, O. Puny, I. Gat, B. Karrer, U. Singer, and Y. Lipman. D-flow: Differentiating through flows for controlled generation.arXiv preprint arXiv:2402.14017, 2024

work page arXiv 2024
[6]

Training Diffusion Models with Reinforcement Learning

K. Black, M. Janner, Y. Du, I. Kostrikov, and S. Levine. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

B. Brown, J. Juravsky, R. Ehrlich, R. Clark, Q. V. Le, C. Ré, and A. Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

J. Chen, C. Ge, E. Xie, Y. Wu, L. Yao, X. Ren, Z. Wang, P. Luo, H. Lu, and Z. Li. Pixart-\sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation.arXiv preprint arXiv:2403.04692, 2024

work page arXiv 2024
[9]

T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training deep nets with sublinear memory cost.arXiv preprint arXiv:1604.06174, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[10]

Z. Chen, Y. Du, Z. Wen, Y. Zhou, C. Cui, Z. Weng, H. Tu, C. Wang, Z. Tong, Q. Huang, et al. Mj-bench: Is your multimodal reward model really a good judge for text-to-image generation?arXiv preprint arXiv:2407.04842, 2024

work page arXiv 2024
[11]

Clark, P

K. Clark, P. Vicol, K. Swersky, and D. J. Fleet. Directly fine-tuning diffusion models on differentiable rewards. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps: //openreview.net/forum?id=1vmSEVL19f

work page 2024
[12]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 17

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009
[14]

Domingo-Enrich, M

C. Domingo-Enrich, M. Drozdzal, B. Karrer, and R. T. Q. Chen. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control.arXiv preprint arXiv:2409.08861, 2024

work page arXiv 2024
[15]

Esser, S

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learning, 2024

work page 2024
[16]

Eyring, S

L. Eyring, S. Karthik, K. Roth, A. Dosovitskiy, and Z. Akata. Reno: Enhancing one-step text-to-image models through reward-based noise optimization.arXiv preprint arXiv:2406.04312, 2024

work page arXiv 2024
[17]

Fan and K

Y. Fan and K. Lee. Optimizing DDPM sampling with shortcut fine-tuning. InICML, volume 202 of Proceedings of Machine Learning Research, pages 9623–9639. PMLR, 2023

work page 2023
[18]

Y. Fan, O. Watkins, Y. Du, H. Liu, M. Ryu, C. Boutilier, P. Abbeel, M. Ghavamzadeh, K. Lee, and K. Lee. Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[19]

A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient.arXiv preprint cs/0408007, 2004

work page internal anchor Pith review Pith/arXiv arXiv 2004
[20]

Gandhi, D

K. Gandhi, D. Lee, G. Grand, M. Liu, W. Cheng, A. Sharma, and N. D. Goodman. Stream of search (sos): Learning to search in language.arXiv preprint arXiv:2404.03683, 2024

work page arXiv 2024
[21]

L. Gao, J. Schulman, and J. Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835–10866. PMLR, 2023

work page 2023
[22]

Goodfellow, J

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014

work page 2014
[23]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y. Choi. Clipscore: A reference-free evaluation metric for image captioning.arXiv preprint arXiv:2104.08718, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

Heusel, H

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time- scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017
[25]

Classifier-Free Diffusion Guidance

J. Ho and T. Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Denoisingdiffusionprobabilisticmodels

J.Ho,A.Jain,andP.Abbeel. Denoisingdiffusionprobabilisticmodels. AdvancesinNeuralInformation Processing Systems, 33:6840–6851, 2020

work page 2020
[27]

Training Compute-Optimal Large Language Models

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

X. Hu, T. He, and D. Wipf. New desiderata for direct preference optimization.arXiv preprint arXiv:2407.09072, 2024. 18

work page arXiv 2024
[29]

Y. Hu, B. Liu, J. Kasai, Y. Wang, M. Ostendorf, R. Krishna, and N. A. Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20406–20417, 2023

work page 2023
[30]

Huang, K

K. Huang, K. Sun, E. Xie, Z. Li, and X. Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

work page 2023
[31]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[32]

Karras, M

T. Karras, M. Aittala, T. Aila, and S. Laine. Elucidating the design space of diffusion-based generative models. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors,Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=k7FuTOWMOc7

work page 2022
[33]

Karras, M

T. Karras, M. Aittala, J. Lehtinen, J. Hellsten, T. Aila, and S. Laine. Analyzing and improving the training dynamics of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24174–24184, 2024

work page 2024
[34]

Karthik, K

S. Karthik, K. Roth, M. Mancini, and Z. Akata. If at first you don’t succeed, try, try again: Faithful diffusion-based text-to-image generation by selection.arXiv preprint arXiv:2305.13308, 2023

work page arXiv 2023
[35]

Karunratanakul, K

K. Karunratanakul, K. Preechakul, E. Aksan, T. Beeler, S. Suwajanakorn, and S. Tang. Optimizing diffusion noise can serve as universal motion priors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1334–1345, 2024

work page 2024
[36]

Khader, G

F. Khader, G. Mueller-Franzes, S. T. Arasteh, T. Han, C. Haarburger, M. Schulze-Hagen, P. Schad, S. Engelhardt, B. Baessler, S. Foersch, et al. Medical diffusion: denoising diffusion probabilistic models for 3d medical image generation.arXiv preprint arXiv:2211.03364, 2022

work page arXiv 2022
[37]

K. Kim, J. Jeong, M. An, M. Ghavamzadeh, K. Dvijotham, J. Shin, and K. Lee. Confidence-aware reward optimization for fine-tuning text-to-image models.arXiv preprint arXiv:2404.01863, 2024

work page arXiv 2024
[38]

D. P. Kingma. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[39]

Kirstain, A

Y. Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023

work page 2023
[40]

Kynkäänniemi, T

T. Kynkäänniemi, T. Karras, S. Laine, J. Lehtinen, and T. Aila. Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019

work page 2019
[41]

B. F. Labs. Flux.1 [dev].https://blackforestlabs.ai/

work page
[42]

T. Lee, M. Yasunaga, C. Meng, Y. Mai, J. S. Park, A. Gupta, Y. Zhang, D. Narayanan, H. Teufel, M. Bellagente, et al. Holistic evaluation of text-to-image models.Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[43]

J. Li, D. Li, C. Xiong, and S. Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022. 19

work page 2022
[44]

Process reward model with q- value rankings.arXiv preprint arXiv:2410.11287, 2024

W. Li and Y. Li. Process reward model with q-value rankings.arXiv preprint arXiv:2410.11287, 2024

work page arXiv 2024
[45]

Let's Verify Step by Step

H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Lipman, R

Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps: //openreview.net/forum?id=PqvMRDCJT9t

work page 2023
[47]

X. Liu, C. Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=XVjTT1nw5z

work page 2023
[48]

Y. Liu, Y. Zhang, T. Jaakkola, and S. Chang. Correcting diffusion generation through resampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8713–8723, 2024

work page 2024
[49]

Y. Lu, X. Yang, X. Li, X. E. Wang, and W. Y. Wang. Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation.Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[50]

N. Ma, M. Goldstein, M. S. Albergo, N. M. Boffi, E. Vanden-Eijnden, and S. Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers.arXiv preprint arXiv:2401.08740, 2024

work page arXiv 2024
[51]

B. Na, Y. Kim, M. Park, D. Shin, W. Kang, and I.-C. Moon. Diffusion rejection sampling.arXiv preprint arXiv:2405.17880, 2024

work page arXiv 2024
[52]

Novack, J

Z. Novack, J. McAuley, T. Berg-Kirkpatrick, and N. J. Bryan. Ditto: Diffusion inference-time t- optimization for music generation.arXiv preprint arXiv:2401.12179, 2024

work page arXiv 2024
[53]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[55]

A. Pan, K. Bhatia, and J. Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models.arXiv preprint arXiv:2201.03544, 2022

work page arXiv 2022
[56]

Movie Gen: A Cast of Media Foundation Models

A. Polyak, A. Zohar, A. Brown, A. Tjandra, A. Sinha, A. Lee, A. Vyas, B. Shi, C.-Y. Ma, C.-Y. Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Z. Qi, L. Bai, H. Xiong, et al. Not all noises are created equally: Diffusion noise selection and optimization. arXiv preprint arXiv:2407.14041, 2024

work page arXiv 2024
[58]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInterna- tional conference on machine learning, pages 8748–8763. PMLR, 2021. 20

work page 2021
[59]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[60]

Ramesh, M

A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever. Zero-shot text-to-image generation. InInternational conference on machine learning, pages 8821–8831. Pmlr, 2021

work page 2021
[61]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[62]

S. Ruder. An overview of gradient descent optimization algorithms.arXiv preprint arXiv:1609.04747, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[63]

Saharia, W

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B.KaragolAyan, T.Salimans, etal. Photorealistictext-to-imagediffusionmodelswithdeeplanguage understanding. Advances in neural information processing systems, 35:36479–36494, 2022

work page 2022
[64]

Salimans and J

T. Salimans and J. Ho. Progressive distillation for fast sampling of diffusion models. InInternational Conference on Learning Representations, 2022. URL https://openreview.net/forum?id= TIdIXIpzhoI

work page 2022
[65]

Salimans, I

T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

work page 2016
[66]

Samuel, R

D. Samuel, R. Ben-Ari, S. Raviv, N. Darshan, and G. Chechik. Generating images of rare concepts using pre-trained diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4695–4703, 2024

work page 2024
[67]

Schneider

F. Schneider. Archisound: Audio generation with diffusion.arXiv preprint arXiv:2301.13267, 2023

work page arXiv 2023
[68]

Schuhmann, R

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in Neural Information Processing Systems, 35:25278–25294, 2022

work page 2022
[69]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

Sohl-Dickstein, E

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational Conference on Machine Learning, pages 2256–2265. PMLR, 2015

work page 2015
[71]

J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=St1giarCHLP

work page 2021
[72]

Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based genera- tive modeling through stochastic differential equations. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=PxTIG12RRHS. 21

work page 2021
[73]

Y. Song, P. Dhariwal, M. Chen, and I. Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

D. Su, S. Sukhbaatar, M. Rabbat, Y. Tian, and Q. Zheng. Dualformer: Controllable fast and slow thinking by learning with randomized reasoning traces.arXiv preprint arXiv:2410.09918, 2024

work page arXiv 2024
[75]

Szegedy, V

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016

work page 2016
[76]

Z. Tan, X. Yang, L. Qin, M. Yang, C. Zhang, and H. Li. Evalalign: Supervised fine-tuning multimodal llms with human-aligned data for evaluating text-to-image models.arXiv preprint arXiv:2406.16562, 2024

work page arXiv 2024
[77]

L. Tang, N. Ruiz, Q. Chu, Y. Li, A. Holynski, D. E. Jacobs, B. Hariharan, Y. Pritch, N. Wadhwa, K. Aberman, et al. Realfill: Reference-driven generation for authentic image completion.ACM Transactions on Graphics (TOG), 43(4):1–12, 2024

work page 2024
[78]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

K. Tian, Y. Jiang, Z. Yuan, B. Peng, and L. Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[80]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

D. Ahn, J. Kang, S. Lee, J. Min, M. Kim, W. Jang, H. Cho, S. Paul, S. Kim, E. Cha, K. H. Jin, and S. Kim. A noise is worth diffusion guidance.arXiv preprint arXiv:2412.03895, 2024

work page arXiv 2024

[3] [3]

Buildingnormalizingflowswithstochasticinterpolants

M.S.AlbergoandE.Vanden-Eijnden. Buildingnormalizingflowswithstochasticinterpolants. In The Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview. net/forum?id=li7qeBbCR1t

work page 2023

[4] [4]

B. D. Anderson. Reverse-time diffusion equation models.Stochastic Processes and their Applications, 12(3):313–326, 1982

work page 1982

[5] [5]

Ben-Hamu, O

H. Ben-Hamu, O. Puny, I. Gat, B. Karrer, U. Singer, and Y. Lipman. D-flow: Differentiating through flows for controlled generation.arXiv preprint arXiv:2402.14017, 2024

work page arXiv 2024

[6] [6]

Training Diffusion Models with Reinforcement Learning

K. Black, M. Janner, Y. Du, I. Kostrikov, and S. Levine. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

B. Brown, J. Juravsky, R. Ehrlich, R. Clark, Q. V. Le, C. Ré, and A. Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

J. Chen, C. Ge, E. Xie, Y. Wu, L. Yao, X. Ren, Z. Wang, P. Luo, H. Lu, and Z. Li. Pixart-\sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation.arXiv preprint arXiv:2403.04692, 2024

work page arXiv 2024

[9] [9]

T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training deep nets with sublinear memory cost.arXiv preprint arXiv:1604.06174, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[10] [10]

Z. Chen, Y. Du, Z. Wen, Y. Zhou, C. Cui, Z. Weng, H. Tu, C. Wang, Z. Tong, Q. Huang, et al. Mj-bench: Is your multimodal reward model really a good judge for text-to-image generation?arXiv preprint arXiv:2407.04842, 2024

work page arXiv 2024

[11] [11]

Clark, P

K. Clark, P. Vicol, K. Swersky, and D. J. Fleet. Directly fine-tuning diffusion models on differentiable rewards. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps: //openreview.net/forum?id=1vmSEVL19f

work page 2024

[12] [12]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 17

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009

[14] [14]

Domingo-Enrich, M

C. Domingo-Enrich, M. Drozdzal, B. Karrer, and R. T. Q. Chen. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control.arXiv preprint arXiv:2409.08861, 2024

work page arXiv 2024

[15] [15]

Esser, S

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learning, 2024

work page 2024

[16] [16]

Eyring, S

L. Eyring, S. Karthik, K. Roth, A. Dosovitskiy, and Z. Akata. Reno: Enhancing one-step text-to-image models through reward-based noise optimization.arXiv preprint arXiv:2406.04312, 2024

work page arXiv 2024

[17] [17]

Fan and K

Y. Fan and K. Lee. Optimizing DDPM sampling with shortcut fine-tuning. InICML, volume 202 of Proceedings of Machine Learning Research, pages 9623–9639. PMLR, 2023

work page 2023

[18] [18]

Y. Fan, O. Watkins, Y. Du, H. Liu, M. Ryu, C. Boutilier, P. Abbeel, M. Ghavamzadeh, K. Lee, and K. Lee. Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[19] [19]

A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient.arXiv preprint cs/0408007, 2004

work page internal anchor Pith review Pith/arXiv arXiv 2004

[20] [20]

Gandhi, D

K. Gandhi, D. Lee, G. Grand, M. Liu, W. Cheng, A. Sharma, and N. D. Goodman. Stream of search (sos): Learning to search in language.arXiv preprint arXiv:2404.03683, 2024

work page arXiv 2024

[21] [21]

L. Gao, J. Schulman, and J. Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835–10866. PMLR, 2023

work page 2023

[22] [22]

Goodfellow, J

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014

work page 2014

[23] [23]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y. Choi. Clipscore: A reference-free evaluation metric for image captioning.arXiv preprint arXiv:2104.08718, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[24] [24]

Heusel, H

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time- scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017

[25] [25]

Classifier-Free Diffusion Guidance

J. Ho and T. Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

Denoisingdiffusionprobabilisticmodels

J.Ho,A.Jain,andP.Abbeel. Denoisingdiffusionprobabilisticmodels. AdvancesinNeuralInformation Processing Systems, 33:6840–6851, 2020

work page 2020

[27] [27]

Training Compute-Optimal Large Language Models

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[28] [28]

X. Hu, T. He, and D. Wipf. New desiderata for direct preference optimization.arXiv preprint arXiv:2407.09072, 2024. 18

work page arXiv 2024

[29] [29]

Y. Hu, B. Liu, J. Kasai, Y. Wang, M. Ostendorf, R. Krishna, and N. A. Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20406–20417, 2023

work page 2023

[30] [30]

Huang, K

K. Huang, K. Sun, E. Xie, Z. Li, and X. Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

work page 2023

[31] [31]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[32] [32]

Karras, M

T. Karras, M. Aittala, T. Aila, and S. Laine. Elucidating the design space of diffusion-based generative models. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors,Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=k7FuTOWMOc7

work page 2022

[33] [33]

Karras, M

T. Karras, M. Aittala, J. Lehtinen, J. Hellsten, T. Aila, and S. Laine. Analyzing and improving the training dynamics of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24174–24184, 2024

work page 2024

[34] [34]

Karthik, K

S. Karthik, K. Roth, M. Mancini, and Z. Akata. If at first you don’t succeed, try, try again: Faithful diffusion-based text-to-image generation by selection.arXiv preprint arXiv:2305.13308, 2023

work page arXiv 2023

[35] [35]

Karunratanakul, K

K. Karunratanakul, K. Preechakul, E. Aksan, T. Beeler, S. Suwajanakorn, and S. Tang. Optimizing diffusion noise can serve as universal motion priors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1334–1345, 2024

work page 2024

[36] [36]

Khader, G

F. Khader, G. Mueller-Franzes, S. T. Arasteh, T. Han, C. Haarburger, M. Schulze-Hagen, P. Schad, S. Engelhardt, B. Baessler, S. Foersch, et al. Medical diffusion: denoising diffusion probabilistic models for 3d medical image generation.arXiv preprint arXiv:2211.03364, 2022

work page arXiv 2022

[37] [37]

K. Kim, J. Jeong, M. An, M. Ghavamzadeh, K. Dvijotham, J. Shin, and K. Lee. Confidence-aware reward optimization for fine-tuning text-to-image models.arXiv preprint arXiv:2404.01863, 2024

work page arXiv 2024

[38] [38]

D. P. Kingma. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[39] [39]

Kirstain, A

Y. Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023

work page 2023

[40] [40]

Kynkäänniemi, T

T. Kynkäänniemi, T. Karras, S. Laine, J. Lehtinen, and T. Aila. Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019

work page 2019

[41] [41]

B. F. Labs. Flux.1 [dev].https://blackforestlabs.ai/

work page

[42] [42]

T. Lee, M. Yasunaga, C. Meng, Y. Mai, J. S. Park, A. Gupta, Y. Zhang, D. Narayanan, H. Teufel, M. Bellagente, et al. Holistic evaluation of text-to-image models.Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[43] [43]

J. Li, D. Li, C. Xiong, and S. Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022. 19

work page 2022

[44] [44]

Process reward model with q- value rankings.arXiv preprint arXiv:2410.11287, 2024

W. Li and Y. Li. Process reward model with q-value rankings.arXiv preprint arXiv:2410.11287, 2024

work page arXiv 2024

[45] [45]

Let's Verify Step by Step

H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [46]

Lipman, R

Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps: //openreview.net/forum?id=PqvMRDCJT9t

work page 2023

[47] [47]

X. Liu, C. Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=XVjTT1nw5z

work page 2023

[48] [48]

Y. Liu, Y. Zhang, T. Jaakkola, and S. Chang. Correcting diffusion generation through resampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8713–8723, 2024

work page 2024

[49] [49]

Y. Lu, X. Yang, X. Li, X. E. Wang, and W. Y. Wang. Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation.Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[50] [50]

N. Ma, M. Goldstein, M. S. Albergo, N. M. Boffi, E. Vanden-Eijnden, and S. Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers.arXiv preprint arXiv:2401.08740, 2024

work page arXiv 2024

[51] [51]

B. Na, Y. Kim, M. Park, D. Shin, W. Kang, and I.-C. Moon. Diffusion rejection sampling.arXiv preprint arXiv:2405.17880, 2024

work page arXiv 2024

[52] [52]

Novack, J

Z. Novack, J. McAuley, T. Berg-Kirkpatrick, and N. J. Bryan. Ditto: Diffusion inference-time t- optimization for music generation.arXiv preprint arXiv:2401.12179, 2024

work page arXiv 2024

[53] [53]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [54]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[55] [55]

A. Pan, K. Bhatia, and J. Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models.arXiv preprint arXiv:2201.03544, 2022

work page arXiv 2022

[56] [56]

Movie Gen: A Cast of Media Foundation Models

A. Polyak, A. Zohar, A. Brown, A. Tjandra, A. Sinha, A. Lee, A. Vyas, B. Shi, C.-Y. Ma, C.-Y. Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

Z. Qi, L. Bai, H. Xiong, et al. Not all noises are created equally: Diffusion noise selection and optimization. arXiv preprint arXiv:2407.14041, 2024

work page arXiv 2024

[58] [58]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInterna- tional conference on machine learning, pages 8748–8763. PMLR, 2021. 20

work page 2021

[59] [59]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[60] [60]

Ramesh, M

A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever. Zero-shot text-to-image generation. InInternational conference on machine learning, pages 8821–8831. Pmlr, 2021

work page 2021

[61] [61]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022

[62] [62]

S. Ruder. An overview of gradient descent optimization algorithms.arXiv preprint arXiv:1609.04747, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[63] [63]

Saharia, W

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B.KaragolAyan, T.Salimans, etal. Photorealistictext-to-imagediffusionmodelswithdeeplanguage understanding. Advances in neural information processing systems, 35:36479–36494, 2022

work page 2022

[64] [64]

Salimans and J

T. Salimans and J. Ho. Progressive distillation for fast sampling of diffusion models. InInternational Conference on Learning Representations, 2022. URL https://openreview.net/forum?id= TIdIXIpzhoI

work page 2022

[65] [65]

Salimans, I

T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

work page 2016

[66] [66]

Samuel, R

D. Samuel, R. Ben-Ari, S. Raviv, N. Darshan, and G. Chechik. Generating images of rare concepts using pre-trained diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4695–4703, 2024

work page 2024

[67] [67]

Schneider

F. Schneider. Archisound: Audio generation with diffusion.arXiv preprint arXiv:2301.13267, 2023

work page arXiv 2023

[68] [68]

Schuhmann, R

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in Neural Information Processing Systems, 35:25278–25294, 2022

work page 2022

[69] [69]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[70] [70]

Sohl-Dickstein, E

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational Conference on Machine Learning, pages 2256–2265. PMLR, 2015

work page 2015

[71] [71]

J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=St1giarCHLP

work page 2021

[72] [72]

Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based genera- tive modeling through stochastic differential equations. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=PxTIG12RRHS. 21

work page 2021

[73] [73]

Y. Song, P. Dhariwal, M. Chen, and I. Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[74] [74]

D. Su, S. Sukhbaatar, M. Rabbat, Y. Tian, and Q. Zheng. Dualformer: Controllable fast and slow thinking by learning with randomized reasoning traces.arXiv preprint arXiv:2410.09918, 2024

work page arXiv 2024

[75] [75]

Szegedy, V

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016

work page 2016

[76] [76]

Z. Tan, X. Yang, L. Qin, M. Yang, C. Zhang, and H. Li. Evalalign: Supervised fine-tuning multimodal llms with human-aligned data for evaluating text-to-image models.arXiv preprint arXiv:2406.16562, 2024

work page arXiv 2024

[77] [77]

L. Tang, N. Ruiz, Q. Chu, Y. Li, A. Holynski, D. E. Jacobs, B. Hariharan, Y. Pritch, N. Wadhwa, K. Aberman, et al. Realfill: Reference-driven generation for authentic image completion.ACM Transactions on Graphics (TOG), 43(4):1–12, 2024

work page 2024

[78] [78]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[79] [79]

K. Tian, Y. Jiang, Z. Yuan, B. Peng, and L. Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[80] [80]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023