arxiv: 2403.03206 · v1 · submitted 2024-03-05 · 💻 cs.CV

Recognition: 3 theorem links

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser , Sumith Kulal , Andreas Blattmann , Rahim Entezari , Jonas M\"uller , Harry Saini , Yam Levi , Dominik Lorenz

show 9 more authors

Axel Sauer Frederic Boesel Dustin Podell Tim Dockhorn Zion English Kyle Lacey Alex Goodwin Yannik Marek Robin Rombach

Authors on Pith no claims yet

Pith reviewed 2026-05-12 08:21 UTC · model grok-4.3

classification 💻 cs.CV

keywords rectified flowtext-to-image synthesistransformersdiffusion modelshigh-resolution imagesbidirectional architecturescaling laws

0 comments

The pith

Biased sampling in rectified flow combined with bidirectional transformers yields superior high-resolution text-to-image synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Rectified flow offers a simpler alternative to diffusion models by linking data to noise along straight paths. This paper improves it by biasing the choice of noise levels during training toward those most important for visual quality. It also proposes a transformer architecture with separate parameters for text and image tokens that permits information exchange in both directions. Large-scale tests show this combination beats standard diffusion approaches on high-resolution text-to-image tasks. The models exhibit reliable scaling where reduced validation loss translates to better metrics and human judgments.

Core claim

The central claim is that biasing noise sampling toward perceptually relevant scales in rectified flow training, together with a novel transformer architecture featuring separate weights for modalities and bidirectional token interactions, leads to better performance in high-resolution text-to-image synthesis than diffusion models, with clear scaling laws and correlation to human preferences.

What carries the argument

The biased noise sampling technique for rectified flow and the bidirectional transformer architecture with modality-specific weights.

Load-bearing premise

The gains result specifically from the biased sampling and bidirectional architecture, not from larger scale, better data, or other unstated optimizations.

What would settle it

Finding that standard diffusion models match or exceed the performance when using similar scale and data would indicate that the proposed changes are not the decisive factor.

read the original abstract

Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension, typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations. Our largest models outperform state-of-the-art models, and we will make our experimental data, code, and model weights publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Biased perceptual noise sampling plus the separate-weight bidirectional transformer deliver measurable gains in high-res text-to-image, but the paper needs tighter matched ablations to isolate those gains from scale and data effects.

read the letter

The two concrete moves here are the bias toward perceptually relevant scales when sampling noise for rectified-flow training and the transformer that keeps separate weights for text and image tokens while letting information flow both ways. Both produce better high-resolution outputs than standard diffusion setups, and the biggest models beat current leaders on metrics and human preference scores. The scaling curves are clean enough to read, and the link between validation loss and downstream quality is shown across several sizes. Releasing the full code, data, and weights is the right move and will let the field check the numbers directly. That part is useful and straightforward. The architecture change is simple to implement and the sampling bias is a reasonable way to spend compute on the scales that matter for perception. The soft spots sit in the attribution. The stress-test concern holds: without ablations that hold model size, data distribution, optimizer, and total training compute fixed while varying only the bias and the bidirectional design, it is hard to know how much of the lift comes from the new pieces versus simply training larger or better-tuned models. The reported correlations between loss and preference are expected but do not prove the mechanism. The paper shows the overall trends and the final performance numbers, yet the controls are not exhaustive enough to rule out confounds at every scale they test. This work is for groups already running large generative-model experiments who want a practical recipe for rectified flows and a transformer variant that handles text better. A reader who cares about scaling behavior in these models will find the curves worth looking at. It is coherent on its own terms and the empirical claims are falsifiable once the release happens, so it deserves a serious referee even if the revisions will focus on tighter controls and more detailed ablations.

Referee Report

2 major / 2 minor

Summary. The paper proposes biasing noise sampling towards perceptually relevant scales in rectified flow models for training, combined with a novel bidirectional transformer architecture using separate weights for text and image modalities. Through a large-scale empirical study on high-resolution text-to-image synthesis, it claims superior performance over standard diffusion formulations, predictable scaling trends, correlation between lower validation loss and improved metrics/human preferences, and that the largest models outperform current state-of-the-art.

Significance. If the gains are shown to stem specifically from the biased sampling and bidirectional architecture (rather than scale, data, or tuning differences), the work would strengthen the case for rectified flows as a practical alternative to diffusion and demonstrate benefits of modality-specific bidirectional transformers for text alignment and typography. The planned public release of experimental data, code, and model weights is a clear strength for reproducibility.

major comments (2)

[Experiments] The central comparisons to diffusion baselines (e.g., in the large-scale study sections) do not report matched controls holding model capacity, data distribution, optimizer, and total compute fixed while varying only the sampling bias and architecture directionality. This leaves open whether observed metric and preference improvements are attributable to the proposed methods.
[Scaling and Evaluation] The reported correlation between validation loss and human preference ratings (and scaling trends) is presented as predictive, but without ablations testing this relationship across regimes or confirming it is not driven by unstated data quality differences, the claim that lower loss directly improves synthesis remains vulnerable.

minor comments (2)

[Abstract] The abstract states that code, data, and weights will be released, but the main text should include a dedicated reproducibility section with specific links or timelines.
[Architecture] Clarify the implementation details of bidirectional token flow versus standard cross-attention in the architecture description to improve clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for highlighting areas where our experimental design and claims could be clarified. We address the major comments point by point below, providing additional context from our study while acknowledging where further elaboration in the manuscript would strengthen the presentation.

read point-by-point responses

Referee: [Experiments] The central comparisons to diffusion baselines (e.g., in the large-scale study sections) do not report matched controls holding model capacity, data distribution, optimizer, and total compute fixed while varying only the sampling bias and architecture directionality. This leaves open whether observed metric and preference improvements are attributable to the proposed methods.

Authors: We agree that rigorous isolation of variables strengthens attribution. Our large-scale experiments (Sections 4–6) held model capacity, data distribution, optimizer, and total training compute fixed to the extent feasible at this scale while varying the noise sampling bias and architecture directionality. All models used the same training data pipeline, batch sizes, and learning rate schedules; differences were limited to the rectified flow formulation with biased sampling versus standard diffusion and the bidirectional versus standard transformer blocks. Scaling curves and multiple independent runs show consistent gains aligned with these changes. We will add an explicit table of controlled variables and a short discussion of residual differences in the revised manuscript. revision: partial
Referee: [Scaling and Evaluation] The reported correlation between validation loss and human preference ratings (and scaling trends) is presented as predictive, but without ablations testing this relationship across regimes or confirming it is not driven by unstated data quality differences, the claim that lower loss directly improves synthesis remains vulnerable.

Authors: The manuscript reports an observed correlation (not direct causation) between lower validation loss and improved metrics/human preferences across model scales, documented in Figure 8 and the scaling analysis. All models were trained on identical data with the same curation pipeline, reducing the likelihood of data-quality confounds driving the trend. The scaling behavior spans multiple orders of magnitude in compute and holds across independent runs. We will expand the text to explicitly state that the correlation is empirical within our controlled setup, note the absence of exhaustive cross-regime ablations on data quality, and qualify the predictive language accordingly. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical scaling study with observed outcomes

full rationale

The paper reports results from training and evaluating large rectified flow transformer models on text-to-image tasks. Its central claims rest on measured performance metrics, human preference ratings, and observed scaling trends across model sizes, not on any first-principles derivation, fitted parameter renamed as prediction, or self-referential definition. No equations are presented that reduce to their own inputs by construction, and no load-bearing uniqueness theorems or ansatzes are imported via self-citation. The architecture and sampling bias are described as design choices whose benefits are validated externally through controlled comparisons and public release of weights, making the work self-contained against benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Work is empirical; relies on standard assumptions of transformer training and rectified flow formulation rather than new theoretical axioms or invented entities.

free parameters (1)

perceptual relevance bias in noise sampling
Chosen via large-scale study to favor scales important for human perception; exact values or selection procedure not detailed in abstract.

axioms (1)

domain assumption Rectified flow connects data and noise via straight-line paths with better theoretical properties than diffusion
Invoked as established recent formulation in the abstract.

pith-pipeline@v0.9.0 · 5553 in / 1253 out tokens · 66052 ms · 2026-05-12T08:21:19.006606+00:00 · methodology

discussion (0)

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation
cs.DC 2026-04 unverdicted novelty 8.0

Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR
cs.CV 2026-05 unverdicted novelty 7.0

LatentHDR generates structurally consistent panoramic HDR images by producing one scene latent with a diffusion backbone then deterministically mapping it to multiple exposure latents via a lightweight conditional head.
Generate "Normal", Edit Poisoned: Branding Injection via Hint Embedding in Image Editing
cs.CR 2026-05 unverdicted novelty 7.0

Invisible hints such as logos embedded in images are re-rendered by diffusion models during text-guided editing, enabling phishing and model-poisoning attacks with average success rates of 44.4% and 32.2%.
Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs
cs.CV 2026-05 unverdicted novelty 7.0

PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
Concurrence of Symmetry Breaking and Nonlocality Phase Transitions in Diffusion Models
cs.LG 2026-05 unverdicted novelty 7.0

Symmetry breaking and nonlocality phase transitions occur nearly simultaneously during diffusion model generation in modern transformers.
How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance
cs.LG 2026-04 unverdicted novelty 7.0

FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization
cs.CV 2026-04 unverdicted novelty 7.0

Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
Thermodynamic Diffusion Inference with Minimal Digital Conditioning
cs.LG 2026-04 unverdicted novelty 7.0

Thermodynamic diffusion inference at production scale is shown using hierarchical bilinear coupling for U-Net skips and a 2,560-parameter digital bottleneck, attaining 0.9906 cosine similarity with theoretical 10^7x e...
FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent Decoding
cs.CV 2026-04 unverdicted novelty 7.0

FlowGuard detects unsafe content during diffusion image generation via linear latent decoding and curriculum learning, outperforming prior methods by over 30% F1 while reducing GPU memory by 97% and projection time to...
Personalizing Text-to-Image Generation to Individual Taste
cs.CV 2026-04 unverdicted novelty 7.0

PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation
cs.CV 2026-04 conditional novelty 7.0

SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.
Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

Process-driven image generation decomposes text-to-image synthesis into interleaved cycles of textual planning, visual drafting, textual reflection, and visual refinement with dense consistency supervision.
UENR-600K: A Large-Scale Physically Grounded Dataset for Nighttime Video Deraining
cs.CV 2026-04 unverdicted novelty 7.0

UENR-600K is a 600,000-frame synthetic dataset for nighttime video deraining that uses 3D rain particle simulation in Unreal Engine to enable better generalization to real scenes.
VOSR: A Vision-Only Generative Model for Image Super-Resolution
cs.CV 2026-04 conditional novelty 7.0

VOSR shows that competitive generative image super-resolution with faithful structures can be achieved by training a diffusion-style model from scratch on visual data alone, using a vision encoder for guidance and a r...
A plug-and-play generative framework for multi-satellite precipitation estimation
physics.ao-ph 2026-05 unverdicted novelty 6.0

PRISMA introduces a plug-and-play latent generative model that improves multi-sensor precipitation estimates by learning an unconditional prior from IMERG data and constraining it with independent sensor-specific branches.
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
cs.CV 2026-05 unverdicted novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
L2P: Unlocking Latent Potential for Pixel Generation
cs.CV 2026-05 unverdicted novelty 6.0

L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
cs.CV 2026-05 unverdicted novelty 6.0

Fashion130K dataset and UMC framework align text and visual prompts with embedding refiner, Fusion Transformer, and redesigned attention to generate more consistent outfits than prior methods.
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
cs.CV 2026-05 unverdicted novelty 6.0

Fashion130K dataset and UMC framework align text and visual prompts to generate more consistent fashion outfits than prior state-of-the-art methods.
Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation
cs.RO 2026-05 unverdicted novelty 6.0

VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.
DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models
cs.CV 2026-05 unverdicted novelty 6.0

DynT2I-Eval creates fresh prompts via dimension decomposition and dynamic sampling to evaluate text-to-image models on text alignment, quality, and aesthetics while maintaining a stable leaderboard.
Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

SAMG uses spatially adaptive guidance scales derived from a geometric analysis of classifier-free guidance to resolve the detail-artifact dilemma in diffusion-based image and video generation.
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.
FASTER: Value-Guided Sampling for Fast RL
cs.LG 2026-04 unverdicted novelty 6.0

FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.
Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models
cs.LG 2026-04 unverdicted novelty 6.0

Reward Score Matching unifies reward-based fine-tuning for flow and diffusion models by recasting alignment as score matching to a value-guided target.
WM-DAgger: Enabling Efficient Data Aggregation for Imitation Learning with World Models
cs.RO 2026-04 unverdicted novelty 6.0

WM-DAgger uses world models with corrective action synthesis and consistency-guided filtering to aggregate OOD recovery data for imitation learning, reporting 93.3% success in soft bag pushing with five demonstrations.
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
cs.LG 2026-04 unverdicted novelty 6.0

Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
cs.GR 2025-06 unverdicted novelty 6.0

FLUX.1 Kontext unifies image generation and editing via flow matching and sequence concatenation, delivering improved multi-turn consistency and speed on the new KontextBench benchmark.
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
cs.AI 2024-08 unverdicted novelty 6.0

A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
eess.AS 2024-06 unverdicted novelty 6.0

Seed-TTS models produce speech matching human naturalness and speaker similarity, with added controllability via self-distillation and reinforcement learning.
CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation
physics.ins-det 2026-05 unverdicted novelty 5.0

CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...
Deterministic Decomposition of Stochastic Generative Dynamics
cs.LG 2026-05 unverdicted novelty 5.0

Stochastic generative dynamics admit a transport-osmotic decomposition of the deterministic field, supporting Bridge Matching for interpretable and tunable generation.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
cs.CV 2026-04 unverdicted novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
Conditional Flow-VAE for Safety-Critical Traffic Scenario Generation
cs.RO 2026-05 unverdicted novelty 4.0

A conditional flow matching model generates realistic safety-critical traffic scenarios by turning nominal scenes into dangerous rollouts using combined simulation and real data.
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
cs.CV 2026-05 unverdicted novelty 4.0

Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
cs.CV 2024-06 unverdicted novelty 4.0

VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
cs.AI 2025-01 conditional novelty 3.0

Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

Reference graph

Works this paper leans on

178 extracted references · 178 canonical work pages · cited by 36 Pith papers · 9 internal anchors

[1]

FirstName LastName , title =

work page
[2]

FirstName Alpher , title =

work page
[3]

Journal of Foo , volume = 13, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =

work page
[4]

Journal of Foo , volume = 14, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =

work page
[5]

FirstName Alpher and FirstName Gamow , title =

work page
[6]

Biometrika , volume=

Logistic-normal distributions: Some properties and uses , author=. Biometrika , volume=. 1980 , publisher=

work page 1980
[7]

2019 , publisher=

Applied Stochastic Differential Equations , author=. 2019 , publisher=

work page 2019
[8]

Entropy , year=

Interacting Particle Solutions of Fokker–Planck Equations Through Gradient–Log–Density Estimation , author=. Entropy , year=

work page
[9]

ArXiv , year=

Score-Based Generative Modeling through Stochastic Differential Equations , author=. ArXiv , year=

work page
[10]

Estimation of Non-Normalized Statistical Models by Score Matching , author=. J. Mach. Learn. Res. , year=

work page
[11]

Conference on Uncertainty in Artificial Intelligence , year=

Sliced Score Matching: A Scalable Approach to Density and Score Estimation , author=. Conference on Uncertainty in Artificial Intelligence , year=

work page
[12]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Understanding diffusion objectives as the ELBO with simple data augmentation , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[13]

Neural Computation , year=

A Connection Between Score Matching and Denoising Autoencoders , author=. Neural Computation , year=

work page
[14]

ArXiv , year=

Elucidating the Design Space of Diffusion-Based Generative Models , author=. ArXiv , year=

work page
[15]

ArXiv , year=

Fixing Weight Decay Regularization in Adam , author=. ArXiv , year=

work page
[16]

32nd USENIX Security Symposium (USENIX Security 23) , pages=

Extracting training data from diffusion models , author=. 32nd USENIX Security Symposium (USENIX Security 23) , pages=

work page
[17]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Diffusion art or digital forgery? investigating data replication in diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[18]

International Conference on Machine Learning , pages=

Deduplicating training data mitigates privacy risks in language models , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[21]

Nichol, Alex , title =

work page
[22]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

A self-supervised descriptor for image copy detection , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[23]

Advances in Neural Information Processing Systems , volume=

Laion-5b: An open large-scale dataset for training next generation image-text models , author=. Advances in Neural Information Processing Systems , volume=

work page
[24]

International Conference on Machine Learning , pages=

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[25]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. arXiv preprint arXiv:2301.12597 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Instructblip: Towards general-purpose vision-language models with instruction tuning. arxiv 2023 , author=. arXiv preprint arXiv:2305.06500 , volume=

work page internal anchor Pith review arXiv 2023
[29]

2023 , eprint=

Scaling Vision Transformers to 22 Billion Parameters , author=. 2023 , eprint=

work page 2023
[30]

2023 , eprint=

Small-scale proxies for large-scale Transformer training instabilities , author=. 2023 , eprint=

work page 2023
[31]

2019 , eprint=

Root Mean Square Layer Normalization , author=. 2019 , eprint=

work page 2019
[32]

BFloat16: The secret to high performance on Cloud TPUs , year=

Dehao Chen and Chiachen Chou and Yuanzhong Xu and Jonathan Hseu , url=. BFloat16: The secret to high performance on Cloud TPUs , year=

work page
[33]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

Common diffusion noise schedules and sample steps are flawed , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

work page
[34]

Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Ermon, Stefano and Manning, Christopher D and Finn, Chelsea , journal=

work page
[36]

Wallace, Bram and Dang, Meihua and Rafailov, Rafael and Zhou, Linqi and Lou, Aaron and Purushwalkam, Senthil and Ermon, Stefano and Xiong, Caiming and Joty, Shafiq and Naik, Nikhil , journal=

work page
[37]

Kirstain, Yuval and Polyak, Adam and Singer, Uriel and Matiana, Shahbuland and Penna, Joe and Levy, Omer , journal=

work page
[38]

Yu, Jiahui and Xu, Yuanzhong and Koh, Jing Yu and Luong, Thang and Baid, Gunjan and Wang, Zirui and Vasudevan, Vijay and Ku, Alexander and Yang, Yinfei and Ayan, Burcu Karagol and others , journal=

work page
[39]

Neural Information Processing Systems , year=

Neural Ordinary Differential Equations , author=. Neural Information Processing Systems , year=

work page
[40]

2023 , eprint=

Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models , author=. 2023 , eprint=

work page 2023
[41]

Ideogram V0.2 announcement , year =

work page
[42]

Playground V2.5 announcement , year =

work page
[43]

Ideogram V1.0 announcement , year =

work page
[44]

ArXiv , year=

Deep Unsupervised Learning using Nonequilibrium Thermodynamics , author=. ArXiv , year=

work page
[45]

2021 , eprint=

Diffusion Models Beat GANs on Image Synthesis , author=. 2021 , eprint=

work page 2021
[46]

2020 , eprint=

Generative Modeling by Estimating Gradients of the Data Distribution , author=. 2020 , eprint=

work page 2020
[47]

2022 , eprint=

Classifier-Free Diffusion Guidance , author=. 2022 , eprint=

work page 2022
[48]

2022 , eprint=

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow , author=. 2022 , eprint=

work page 2022
[49]

2023 , eprint=

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models , author=. 2023 , eprint=

work page 2023
[50]

2022 , eprint=

Denoising Diffusion Implicit Models , author=. 2022 , eprint=

work page 2022
[51]

2023 , eprint=

Structure and Content-Guided Video Synthesis with Diffusion Models , author=. 2023 , eprint=

work page 2023
[52]

2022 , eprint=

Building Normalizing Flows with Stochastic Interpolants , author=. 2022 , eprint=

work page 2022
[53]

2022 , eprint=

Hierarchical Text-Conditional Image Generation with CLIP Latents , author=. 2022 , eprint=

work page 2022
[55]

2022 , eprint=

Make-A-Video: Text-to-Video Generation without Text-Video Data , author=. 2022 , eprint=

work page 2022
[56]

2023 , eprint=

Photorealistic Video Generation with Diffusion Models , author=. 2023 , eprint=

work page 2023
[57]

2024 , eprint=

Lumiere: A Space-Time Diffusion Model for Video Generation , author=. 2024 , eprint=

work page 2024
[58]

2022 , eprint=

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , author=. 2022 , eprint=

work page 2022
[59]

2022 , eprint=

Imagen Video: High Definition Video Generation with Diffusion Models , author=. 2022 , eprint=

work page 2022
[60]

2021 , eprint=

Improved Denoising Diffusion Probabilistic Models , author=. 2021 , eprint=

work page 2021
[62]

2022 , eprint=

GENIE: Higher-Order Denoising Diffusion Solvers , author=. 2022 , eprint=

work page 2022
[63]

2023 , eprint=

DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models , author=. 2023 , eprint=

work page 2023
[64]

2020 , eprint=

Denoising Diffusion Probabilistic Models , author=. 2020 , eprint=

work page 2020
[65]

2024 , eprint=

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers , author=. 2024 , eprint=

work page 2024
[66]

2022 , eprint=

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers , author=. 2022 , eprint=

work page 2022
[67]

2021 , eprint=

Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint=

work page 2021
[69]

2019 , eprint=

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. 2019 , eprint=

work page 2019
[70]

2018 , eprint=

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , author=. 2018 , eprint=

work page 2018
[72]

2017 , eprint=

Attention Is All You Need , author=. 2017 , eprint=

work page 2017
[73]

2020 , eprint=

Scaling Laws for Neural Language Models , author=. 2020 , eprint=

work page 2020
[74]

Computer Science

Improving image generation with better captions , author=. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf , volume=

work page
[76]

2023 , eprint=

PixArt-a: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis , author=. 2023 , eprint=

work page 2023
[77]

2023 , eprint=

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis , author=. 2023 , eprint=

work page 2023
[78]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Instructpix2pix: Learning to follow image editing instructions , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[79]

International Journal of Computer Vision , year=

ImageNet Large Scale Visual Recognition Challenge , author=. International Journal of Computer Vision , year=

work page
[80]

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , author=. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2021
[83]

2017 , eprint=

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , author=. 2017 , eprint=

work page 2017
[84]

1768 , publisher=

Institutionum calculi integralis , author=. 1768 , publisher=

work page
[85]

2023 , eprint=

Simple diffusion: End-to-end diffusion for high resolution images , author=. 2023 , eprint=

work page 2023
[86]

2023 , eprint=

Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack , author=. 2023 , eprint=

work page 2023
[88]

An image is worth 16x16 words: Transformers for image recognition at scale , author=

work page
[89]

Scaling vision transformers , author=

work page
[90]

autofaiss , author =

work page
[93]

2023 , eprint=

Minimizing Trajectory Curvature of ODE-based Generative Models , author=. 2023 , eprint=

work page 2023
[94]

2022 , url =

NovelAI Improvements on Stable Diffusion , author=. 2022 , url =

work page 2022
[95]

2023 , eprint=

Improving and generalizing flow-based generative models with minibatch optimal transport , author=. 2023 , eprint=

work page 2023

Showing first 80 references.