Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps
Pith reviewed 2026-05-20 11:39 UTC · model grok-4.3
The pith
Diffusion models improve image quality by searching for better starting noises with extra inference compute rather than more denoising steps alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By treating noise selection for the diffusion sampling process as a search problem along the two axes of verifiers and search algorithms, increasing inference-time compute leads to substantial improvements in the quality of generated samples on class-conditioned and text-conditioned image generation benchmarks.
What carries the argument
A search procedure over initial noise inputs that combines verifiers to score candidates with algorithms that explore the space to select superior noises for diffusion sampling.
If this is right
- Sample quality keeps rising as more compute is allocated to noise search after denoising steps plateau.
- Different verifier-algorithm pairs can be selected to match specific application needs.
- Gains appear consistently on both class-conditional and text-conditional image tasks.
- The method provides a way to scale generation performance without changing the trained model.
Where Pith is reading between the lines
- Verifier design may become an important research direction for aligning scores more closely with human judgment.
- The same search idea could extend to diffusion models used for video or 3D content.
- System designers might trade off model size against inference-time search budget in future scaling curves.
Load-bearing premise
The verifiers used to score noise candidates give feedback that aligns with actual improvements in image quality rather than rewarding artifacts or overfitting to the verifier.
What would settle it
A controlled human preference study or downstream-task evaluation in which samples chosen by the noise-search method are not rated higher than samples from standard sampling given the same total inference compute.
read the original abstract
Generative models have made significant impacts across various domains, largely due to their ability to scale during training by increasing data, computational resources, and model size, a phenomenon characterized by the scaling laws. Recent research has begun to explore inference-time scaling behavior in Large Language Models (LLMs), revealing how performance can further improve with additional computation during inference. Unlike LLMs, diffusion models inherently possess the flexibility to adjust inference-time computation via the number of denoising steps, although the performance gains typically flatten after a few dozen. In this work, we explore the inference-time scaling behavior of diffusion models beyond increasing denoising steps and investigate how the generation performance can further improve with increased computation. Specifically, we consider a search problem aimed at identifying better noises for the diffusion sampling process. We structure the design space along two axes: the verifiers used to provide feedback, and the algorithms used to find better noise candidates. Through extensive experiments on class-conditioned and text-conditioned image generation benchmarks, our findings reveal that increasing inference-time compute leads to substantial improvements in the quality of samples generated by diffusion models, and with the complicated nature of images, combinations of the components in the framework can be specifically chosen to conform with different application scenario.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that diffusion models exhibit inference-time scaling beyond increasing denoising steps by framing initial noise selection as a search problem over candidates scored by external verifiers (such as CLIP similarity for text-to-image or class-conditional classifiers) and solved via various search algorithms; extensive experiments on class-conditioned and text-conditioned image generation benchmarks show that additional compute yields substantial quality gains, with component choices adaptable to different scenarios.
Significance. If the gains prove robust to independent quality measures, the work would usefully extend scaling ideas from LLMs to diffusion models and supply a practical modular framework for trading inference compute for sample quality. The modular design along verifiers and algorithms, together with the reported benchmark improvements, supplies a concrete starting point for follow-on research on test-time compute in generative vision.
major comments (2)
- [§4] §4 (Experiments): the reported quality improvements are obtained by ranking candidates with the same verifiers used for search (CLIP, classifiers); no correlation analysis against verifier-independent metrics (e.g., FID computed without the search verifier) or held-out human preference studies is described, leaving open whether the scaling reflects genuine perceptual improvement or verifier exploitation.
- [§3] §3 (Framework): the central claim that search yields better samples rests on the assumption that verifier scores reliably track downstream quality; the manuscript provides no robustness checks against known verifier failure modes such as favoring high-frequency artifacts or prompt shortcuts, which directly affects whether the observed compute-quality curve is load-bearing.
minor comments (2)
- [Abstract] Abstract: the quantitative magnitude of the reported gains (e.g., FID or IS deltas) is not stated, making it harder to gauge practical impact.
- [§4] Figure captions in §4: several comparison figures would benefit from explicit annotation of which search/verifier combination produced each row to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our work exploring inference-time scaling in diffusion models. We address the major concerns regarding evaluation and verifier reliability below, and we will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): the reported quality improvements are obtained by ranking candidates with the same verifiers used for search (CLIP, classifiers); no correlation analysis against verifier-independent metrics (e.g., FID computed without the search verifier) or held-out human preference studies is described, leaving open whether the scaling reflects genuine perceptual improvement or verifier exploitation.
Authors: We appreciate this observation. Our experiments primarily rely on the verifiers for guiding the search, but the final quality is assessed using benchmark-standard metrics such as FID for class-conditional generation and CLIP similarity for text-to-image. To directly address the concern about verifier-independent evaluation, we will add a correlation analysis in the revised version, where we compute FID scores independently and examine their correlation with the verifier scores used in search. For human preference studies, while not included in the current manuscript due to resource constraints, we acknowledge their value and will include a discussion of this as a limitation along with preliminary results if feasible, or outline plans for future validation. revision: partial
-
Referee: [§3] §3 (Framework): the central claim that search yields better samples rests on the assumption that verifier scores reliably track downstream quality; the manuscript provides no robustness checks against known verifier failure modes such as favoring high-frequency artifacts or prompt shortcuts, which directly affects whether the observed compute-quality curve is load-bearing.
Authors: We agree that verifying the robustness of the verifiers is important for the validity of the scaling claims. The manuscript selects verifiers based on their established use in prior work for similar tasks. However, we did not include explicit checks for failure modes like artifact favoring or shortcut exploitation. In the revision, we will add a new subsection discussing potential verifier limitations and include additional experiments, such as evaluating samples under verifier perturbations or comparing with alternative scoring methods, to demonstrate that the observed improvements hold under these considerations. revision: yes
Circularity Check
No circularity: empirical search framework with external verifiers
full rationale
The paper presents an empirical exploration of inference-time scaling via a search over noise candidates scored by external verifiers (e.g., CLIP or classifiers) and search algorithms. No equations, derivations, or self-referential definitions appear in the provided abstract or description that reduce the reported quality improvements to fitted parameters or inputs by construction. The central findings rest on benchmark experiments, which function as independent external validation rather than self-citation chains or ansatz smuggling. This is the most common honest outcome for an applied empirical study.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 23 Pith papers
-
Do generative video models understand physical principles?
Physics-IQ benchmark reveals that generative video models exhibit limited physical understanding unrelated to their visual quality.
-
TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
-
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization
Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
-
Personalizing Text-to-Image Generation to Individual Taste
PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
-
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
FVD applies Fleming-Viot population dynamics to diffusion model sampling at inference time to reduce diversity collapse while improving reward alignment and FID scores.
-
PhaseFlow4D: Physically Constrained 4D Beam Reconstruction via Feedback-Guided Latent Diffusion
PhaseFlow4D reconstructs and tracks 4D beam phase space from 2D projections via a latent diffusion model with built-in physics constraints, achieving 11000x speedup over full simulations while following time-varying c...
-
Reflective Flow Sampling Enhancement
RF-Sampling enhances flow matching models by implicitly performing gradient ascent on text-image alignment scores via linear textual combinations and flow inversion.
-
It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models
Noise optimization during sampling recovers diversity in mode-collapsed diffusion models while preserving output fidelity.
-
Iterative Inference-time Scaling with Adaptive Frequency Steering for Image Super-Resolution
IAFS is a training-free iterative inference-time scaling framework that uses adaptive frequency-aware particle fusion to resolve the perception-fidelity conflict in diffusion super-resolution models, outperforming pri...
-
dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models
dMLLM-TTS delivers up to 6x more efficient test-time scaling for diffusion MLLMs via O(N+T) hierarchical search and self-verified feedback, improving generation quality on GenEval across three models.
-
Inference-Time Scaling of Diffusion Language Models via Trajectory Refinement
PG-DLM applies particle Gibbs sampling over full trajectories in diffusion language models to enable iterative refinement, yielding higher accuracy on reward-guided generation with theoretical convergence guarantees.
-
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer
ICEdit achieves state-of-the-art instructional image editing in Diffusion Transformers via in-context generation, requiring only 0.1% of prior training data and 1% trainable parameters.
-
LatentBox: An Efficient Latent-First Storage System for AI-Generated Images
LatentBox is a latent-first storage system that stores compressed latents as durable objects with on-demand GPU reconstruction and dynamic image/latent caching, achieving 78.7% storage reduction with competitive laten...
-
Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos
MIGA introduces two-stage alignment to close train-inference gaps and dual consistency enhancement via self-reflection and long-range guidance to achieve SOTA temporal consistency in infinite-frame video generation on...
-
EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation
EPIC introduces predicate-guided inference-time search that lifts compositional T2I prompt accuracy from 34% to 71% on GenEval2 with 31-81% lower execution costs.
-
Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling
DeScore decouples CoT reasoning from reward scoring in video reward models using a two-stage training process to improve generalization and avoid optimization bottlenecks of coupled generative RMs.
-
FASTER: Value-Guided Sampling for Fast RL
FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.
-
FRIGID: Scaling Diffusion-Based Molecular Generation from Mass Spectra at Training and Inference Time
FRIGID scales a diffusion-based model for de novo molecular structure generation from mass spectra, reaching over 18% top-1 accuracy on MassSpecGym and tripling prior bests on NPLIB1 via large unlabeled training and i...
-
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and M...
-
CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning
CellFluxRL post-trains the CellFlux generative model with reinforcement learning driven by biologically meaningful reward functions, yielding virtual cell images that better satisfy physical and biological constraints...
-
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.
-
Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling
DeScore decouples explicit CoT reasoning from reward regression in video reward models via a two-stage cold-start plus dual-objective RL training pipeline.
-
The Serial Scaling Hypothesis
The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.
Reference graph
Works this paper leans on
-
[1]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [2]
-
[3]
Buildingnormalizingflowswithstochasticinterpolants
M.S.AlbergoandE.Vanden-Eijnden. Buildingnormalizingflowswithstochasticinterpolants. In The Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview. net/forum?id=li7qeBbCR1t
work page 2023
-
[4]
B. D. Anderson. Reverse-time diffusion equation models.Stochastic Processes and their Applications, 12(3):313–326, 1982
work page 1982
-
[5]
H. Ben-Hamu, O. Puny, I. Gat, B. Karrer, U. Singer, and Y. Lipman. D-flow: Differentiating through flows for controlled generation.arXiv preprint arXiv:2402.14017, 2024
-
[6]
Training Diffusion Models with Reinforcement Learning
K. Black, M. Janner, Y. Du, I. Kostrikov, and S. Levine. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
B. Brown, J. Juravsky, R. Ehrlich, R. Clark, Q. V. Le, C. Ré, and A. Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [8]
-
[9]
T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training deep nets with sublinear memory cost.arXiv preprint arXiv:1604.06174, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [10]
- [11]
-
[12]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 17
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[13]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009
work page 2009
-
[14]
C. Domingo-Enrich, M. Drozdzal, B. Karrer, and R. T. Q. Chen. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control.arXiv preprint arXiv:2409.08861, 2024
- [15]
- [16]
- [17]
-
[18]
Y. Fan, O. Watkins, Y. Du, H. Liu, M. Ryu, C. Boutilier, P. Abbeel, M. Ghavamzadeh, K. Lee, and K. Lee. Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[19]
A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient.arXiv preprint cs/0408007, 2004
work page internal anchor Pith review Pith/arXiv arXiv 2004
- [20]
-
[21]
L. Gao, J. Schulman, and J. Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835–10866. PMLR, 2023
work page 2023
-
[22]
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014
work page 2014
-
[23]
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y. Choi. Clipscore: A reference-free evaluation metric for image captioning.arXiv preprint arXiv:2104.08718, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [24]
-
[25]
Classifier-Free Diffusion Guidance
J. Ho and T. Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
Denoisingdiffusionprobabilisticmodels
J.Ho,A.Jain,andP.Abbeel. Denoisingdiffusionprobabilisticmodels. AdvancesinNeuralInformation Processing Systems, 33:6840–6851, 2020
work page 2020
-
[27]
Training Compute-Optimal Large Language Models
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [28]
-
[29]
Y. Hu, B. Liu, J. Kasai, Y. Wang, M. Ostendorf, R. Krishna, and N. A. Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20406–20417, 2023
work page 2023
- [30]
-
[31]
Scaling Laws for Neural Language Models
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
- [32]
- [33]
-
[34]
S. Karthik, K. Roth, M. Mancini, and Z. Akata. If at first you don’t succeed, try, try again: Faithful diffusion-based text-to-image generation by selection.arXiv preprint arXiv:2305.13308, 2023
-
[35]
K. Karunratanakul, K. Preechakul, E. Aksan, T. Beeler, S. Suwajanakorn, and S. Tang. Optimizing diffusion noise can serve as universal motion priors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1334–1345, 2024
work page 2024
-
[36]
F. Khader, G. Mueller-Franzes, S. T. Arasteh, T. Han, C. Haarburger, M. Schulze-Hagen, P. Schad, S. Engelhardt, B. Baessler, S. Foersch, et al. Medical diffusion: denoising diffusion probabilistic models for 3d medical image generation.arXiv preprint arXiv:2211.03364, 2022
- [37]
-
[38]
D. P. Kingma. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[39]
Y. Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023
work page 2023
-
[40]
T. Kynkäänniemi, T. Karras, S. Laine, J. Lehtinen, and T. Aila. Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019
work page 2019
-
[41]
B. F. Labs. Flux.1 [dev].https://blackforestlabs.ai/
-
[42]
T. Lee, M. Yasunaga, C. Meng, Y. Mai, J. S. Park, A. Gupta, Y. Zhang, D. Narayanan, H. Teufel, M. Bellagente, et al. Holistic evaluation of text-to-image models.Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[43]
J. Li, D. Li, C. Xiong, and S. Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022. 19
work page 2022
-
[44]
Process reward model with q- value rankings.arXiv preprint arXiv:2410.11287, 2024
W. Li and Y. Li. Process reward model with q-value rankings.arXiv preprint arXiv:2410.11287, 2024
-
[45]
H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [46]
-
[47]
X. Liu, C. Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=XVjTT1nw5z
work page 2023
-
[48]
Y. Liu, Y. Zhang, T. Jaakkola, and S. Chang. Correcting diffusion generation through resampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8713–8723, 2024
work page 2024
-
[49]
Y. Lu, X. Yang, X. Li, X. E. Wang, and W. Y. Wang. Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation.Advances in Neural Information Processing Systems, 36, 2024
work page 2024
- [50]
- [51]
- [52]
-
[53]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [54]
- [55]
-
[56]
Movie Gen: A Cast of Media Foundation Models
A. Polyak, A. Zohar, A. Brown, A. Tjandra, A. Sinha, A. Lee, A. Vyas, B. Shi, C.-Y. Ma, C.-Y. Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [57]
-
[58]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInterna- tional conference on machine learning, pages 8748–8763. PMLR, 2021. 20
work page 2021
-
[59]
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, 2023
work page 2023
- [60]
-
[61]
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[62]
S. Ruder. An overview of gradient descent optimization algorithms.arXiv preprint arXiv:1609.04747, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[63]
C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B.KaragolAyan, T.Salimans, etal. Photorealistictext-to-imagediffusionmodelswithdeeplanguage understanding. Advances in neural information processing systems, 35:36479–36494, 2022
work page 2022
-
[64]
T. Salimans and J. Ho. Progressive distillation for fast sampling of diffusion models. InInternational Conference on Learning Representations, 2022. URL https://openreview.net/forum?id= TIdIXIpzhoI
work page 2022
-
[65]
T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016
work page 2016
- [66]
- [67]
-
[68]
C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in Neural Information Processing Systems, 35:25278–25294, 2022
work page 2022
-
[69]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[70]
J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational Conference on Machine Learning, pages 2256–2265. PMLR, 2015
work page 2015
-
[71]
J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=St1giarCHLP
work page 2021
-
[72]
Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based genera- tive modeling through stochastic differential equations. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=PxTIG12RRHS. 21
work page 2021
-
[73]
Y. Song, P. Dhariwal, M. Chen, and I. Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [74]
-
[75]
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016
work page 2016
- [76]
-
[77]
L. Tang, N. Ruiz, Q. Chu, Y. Li, A. Holynski, D. E. Jacobs, B. Hariharan, Y. Pritch, N. Wadhwa, K. Aberman, et al. Realfill: Reference-driven generation for authentic image completion.ACM Transactions on Graphics (TOG), 43(4):1–12, 2024
work page 2024
-
[78]
G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[79]
K. Tian, Y. Jiang, Z. Yuan, B. Peng, and L. Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. InAdvances in Neural Information Processing Systems, 2024
work page 2024
-
[80]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.