Recognition: 3 theorem links
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Pith reviewed 2026-05-12 08:21 UTC · model grok-4.3
The pith
Biased sampling in rectified flow combined with bidirectional transformers yields superior high-resolution text-to-image synthesis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that biasing noise sampling toward perceptually relevant scales in rectified flow training, together with a novel transformer architecture featuring separate weights for modalities and bidirectional token interactions, leads to better performance in high-resolution text-to-image synthesis than diffusion models, with clear scaling laws and correlation to human preferences.
What carries the argument
The biased noise sampling technique for rectified flow and the bidirectional transformer architecture with modality-specific weights.
Load-bearing premise
The gains result specifically from the biased sampling and bidirectional architecture, not from larger scale, better data, or other unstated optimizations.
What would settle it
Finding that standard diffusion models match or exceed the performance when using similar scale and data would indicate that the proposed changes are not the decisive factor.
read the original abstract
Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension, typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations. Our largest models outperform state-of-the-art models, and we will make our experimental data, code, and model weights publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes biasing noise sampling towards perceptually relevant scales in rectified flow models for training, combined with a novel bidirectional transformer architecture using separate weights for text and image modalities. Through a large-scale empirical study on high-resolution text-to-image synthesis, it claims superior performance over standard diffusion formulations, predictable scaling trends, correlation between lower validation loss and improved metrics/human preferences, and that the largest models outperform current state-of-the-art.
Significance. If the gains are shown to stem specifically from the biased sampling and bidirectional architecture (rather than scale, data, or tuning differences), the work would strengthen the case for rectified flows as a practical alternative to diffusion and demonstrate benefits of modality-specific bidirectional transformers for text alignment and typography. The planned public release of experimental data, code, and model weights is a clear strength for reproducibility.
major comments (2)
- [Experiments] The central comparisons to diffusion baselines (e.g., in the large-scale study sections) do not report matched controls holding model capacity, data distribution, optimizer, and total compute fixed while varying only the sampling bias and architecture directionality. This leaves open whether observed metric and preference improvements are attributable to the proposed methods.
- [Scaling and Evaluation] The reported correlation between validation loss and human preference ratings (and scaling trends) is presented as predictive, but without ablations testing this relationship across regimes or confirming it is not driven by unstated data quality differences, the claim that lower loss directly improves synthesis remains vulnerable.
minor comments (2)
- [Abstract] The abstract states that code, data, and weights will be released, but the main text should include a dedicated reproducibility section with specific links or timelines.
- [Architecture] Clarify the implementation details of bidirectional token flow versus standard cross-attention in the architecture description to improve clarity for readers.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for highlighting areas where our experimental design and claims could be clarified. We address the major comments point by point below, providing additional context from our study while acknowledging where further elaboration in the manuscript would strengthen the presentation.
read point-by-point responses
-
Referee: [Experiments] The central comparisons to diffusion baselines (e.g., in the large-scale study sections) do not report matched controls holding model capacity, data distribution, optimizer, and total compute fixed while varying only the sampling bias and architecture directionality. This leaves open whether observed metric and preference improvements are attributable to the proposed methods.
Authors: We agree that rigorous isolation of variables strengthens attribution. Our large-scale experiments (Sections 4–6) held model capacity, data distribution, optimizer, and total training compute fixed to the extent feasible at this scale while varying the noise sampling bias and architecture directionality. All models used the same training data pipeline, batch sizes, and learning rate schedules; differences were limited to the rectified flow formulation with biased sampling versus standard diffusion and the bidirectional versus standard transformer blocks. Scaling curves and multiple independent runs show consistent gains aligned with these changes. We will add an explicit table of controlled variables and a short discussion of residual differences in the revised manuscript. revision: partial
-
Referee: [Scaling and Evaluation] The reported correlation between validation loss and human preference ratings (and scaling trends) is presented as predictive, but without ablations testing this relationship across regimes or confirming it is not driven by unstated data quality differences, the claim that lower loss directly improves synthesis remains vulnerable.
Authors: The manuscript reports an observed correlation (not direct causation) between lower validation loss and improved metrics/human preferences across model scales, documented in Figure 8 and the scaling analysis. All models were trained on identical data with the same curation pipeline, reducing the likelihood of data-quality confounds driving the trend. The scaling behavior spans multiple orders of magnitude in compute and holds across independent runs. We will expand the text to explicitly state that the correlation is empirical within our controlled setup, note the absence of exhaustive cross-regime ablations on data quality, and qualify the predictive language accordingly. revision: partial
Circularity Check
No circularity: empirical scaling study with observed outcomes
full rationale
The paper reports results from training and evaluating large rectified flow transformer models on text-to-image tasks. Its central claims rest on measured performance metrics, human preference ratings, and observed scaling trends across model sizes, not on any first-principles derivation, fitted parameter renamed as prediction, or self-referential definition. No equations are presented that reduce to their own inputs by construction, and no load-bearing uniqueness theorems or ansatzes are imported via self-citation. The architecture and sampling bias are described as design choices whose benefits are validated externally through controlled comparisons and public release of weights, making the work self-contained against benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- perceptual relevance bias in noise sampling
axioms (1)
- domain assumption Rectified flow connects data and noise via straight-line paths with better theoretical properties than diffusion
Forward citations
Cited by 37 Pith papers
-
Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation
Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
-
LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR
LatentHDR generates structurally consistent panoramic HDR images by producing one scene latent with a diffusion backbone then deterministically mapping it to multiple exposure latents via a lightweight conditional head.
-
Generate "Normal", Edit Poisoned: Branding Injection via Hint Embedding in Image Editing
Invisible hints such as logos embedded in images are re-rendered by diffusion models during text-guided editing, enabling phishing and model-poisoning attacks with average success rates of 44.4% and 32.2%.
-
Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs
PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
-
Concurrence of Symmetry Breaking and Nonlocality Phase Transitions in Diffusion Models
Symmetry breaking and nonlocality phase transitions occur nearly simultaneously during diffusion model generation in modern transformers.
-
How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance
FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.
-
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization
Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
-
Thermodynamic Diffusion Inference with Minimal Digital Conditioning
Thermodynamic diffusion inference at production scale is shown using hierarchical bilinear coupling for U-Net skips and a 2,560-parameter digital bottleneck, attaining 0.9906 cosine similarity with theoretical 10^7x e...
-
FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent Decoding
FlowGuard detects unsafe content during diffusion image generation via linear latent decoding and curriculum learning, outperforming prior methods by over 30% F1 while reducing GPU memory by 97% and projection time to...
-
Personalizing Text-to-Image Generation to Individual Taste
PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
-
SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation
SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.
-
Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning
Process-driven image generation decomposes text-to-image synthesis into interleaved cycles of textual planning, visual drafting, textual reflection, and visual refinement with dense consistency supervision.
-
UENR-600K: A Large-Scale Physically Grounded Dataset for Nighttime Video Deraining
UENR-600K is a 600,000-frame synthetic dataset for nighttime video deraining that uses 3D rain particle simulation in Unreal Engine to enable better generalization to real scenes.
-
VOSR: A Vision-Only Generative Model for Image Super-Resolution
VOSR shows that competitive generative image super-resolution with faithful structures can be achieved by training a diffusion-style model from scratch on visual data alone, using a vision encoder for guidance and a r...
-
A plug-and-play generative framework for multi-satellite precipitation estimation
PRISMA introduces a plug-and-play latent generative model that improves multi-sensor precipitation estimates by learning an unconditional prior from IMERG data and constraining it with independent sensor-specific branches.
-
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
-
L2P: Unlocking Latent Potential for Pixel Generation
L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
-
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
Fashion130K dataset and UMC framework align text and visual prompts with embedding refiner, Fusion Transformer, and redesigned attention to generate more consistent outfits than prior methods.
-
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
Fashion130K dataset and UMC framework align text and visual prompts to generate more consistent fashion outfits than prior state-of-the-art methods.
-
Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation
VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.
-
DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models
DynT2I-Eval creates fresh prompts via dimension decomposition and dynamic sampling to evaluate text-to-image models on text alignment, quality, and aesthetics while maintaining a stable leaderboard.
-
Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models
SAMG uses spatially adaptive guidance scales derived from a geometric analysis of classifier-free guidance to resolve the detail-artifact dilemma in diffusion-based image and video generation.
-
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.
-
FASTER: Value-Guided Sampling for Fast RL
FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.
-
Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models
Reward Score Matching unifies reward-based fine-tuning for flow and diffusion models by recasting alignment as score matching to a value-guided target.
-
WM-DAgger: Enabling Efficient Data Aggregation for Imitation Learning with World Models
WM-DAgger uses world models with corrective action synthesis and consistency-guided filtering to aggregate OOD recovery data for imitation learning, reporting 93.3% success in soft bag pushing with five demonstrations.
-
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
-
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
FLUX.1 Kontext unifies image generation and editing via flow matching and sequence concatenation, delivering improved multi-turn consistency and speed on the new KontextBench benchmark.
-
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.
-
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Seed-TTS models produce speech matching human naturalness and speaker similarity, with added controllability via self-distillation and reinforcement learning.
-
CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation
CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...
-
Deterministic Decomposition of Stochastic Generative Dynamics
Stochastic generative dynamics admit a transport-osmotic decomposition of the deterministic field, supporting Bridge Matching for interpretable and tunable generation.
-
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
-
Conditional Flow-VAE for Safety-Critical Traffic Scenario Generation
A conditional flow matching model generates realistic safety-critical traffic scenarios by turning nominal scenes into dangerous rollouts using combined simulation and real data.
-
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
-
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
-
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
Reference graph
Works this paper leans on
-
[1]
FirstName LastName , title =
-
[2]
FirstName Alpher , title =
-
[3]
Journal of Foo , volume = 13, number = 1, pages =
FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =
-
[4]
Journal of Foo , volume = 14, number = 1, pages =
FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =
-
[5]
FirstName Alpher and FirstName Gamow , title =
-
[6]
Logistic-normal distributions: Some properties and uses , author=. Biometrika , volume=. 1980 , publisher=
work page 1980
-
[7]
Applied Stochastic Differential Equations , author=. 2019 , publisher=
work page 2019
-
[8]
Interacting Particle Solutions of Fokker–Planck Equations Through Gradient–Log–Density Estimation , author=. Entropy , year=
-
[9]
Score-Based Generative Modeling through Stochastic Differential Equations , author=. ArXiv , year=
-
[10]
Estimation of Non-Normalized Statistical Models by Score Matching , author=. J. Mach. Learn. Res. , year=
-
[11]
Conference on Uncertainty in Artificial Intelligence , year=
Sliced Score Matching: A Scalable Approach to Density and Score Estimation , author=. Conference on Uncertainty in Artificial Intelligence , year=
-
[12]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Understanding diffusion objectives as the ELBO with simple data augmentation , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[13]
A Connection Between Score Matching and Denoising Autoencoders , author=. Neural Computation , year=
-
[14]
Elucidating the Design Space of Diffusion-Based Generative Models , author=. ArXiv , year=
- [15]
-
[16]
32nd USENIX Security Symposium (USENIX Security 23) , pages=
Extracting training data from diffusion models , author=. 32nd USENIX Security Symposium (USENIX Security 23) , pages=
-
[17]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Diffusion art or digital forgery? investigating data replication in diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[18]
International Conference on Machine Learning , pages=
Deduplicating training data mitigates privacy risks in language models , author=. International Conference on Machine Learning , pages=. 2022 , organization=
work page 2022
-
[21]
Nichol, Alex , title =
-
[22]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
A self-supervised descriptor for image copy detection , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[23]
Advances in Neural Information Processing Systems , volume=
Laion-5b: An open large-scale dataset for training next generation image-text models , author=. Advances in Neural Information Processing Systems , volume=
-
[24]
International Conference on Machine Learning , pages=
Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. International Conference on Machine Learning , pages=. 2022 , organization=
work page 2022
-
[25]
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. arXiv preprint arXiv:2301.12597 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Instructblip: Towards general-purpose vision-language models with instruction tuning. arxiv 2023 , author=. arXiv preprint arXiv:2305.06500 , volume=
work page internal anchor Pith review arXiv 2023
-
[29]
Scaling Vision Transformers to 22 Billion Parameters , author=. 2023 , eprint=
work page 2023
-
[30]
Small-scale proxies for large-scale Transformer training instabilities , author=. 2023 , eprint=
work page 2023
- [31]
-
[32]
BFloat16: The secret to high performance on Cloud TPUs , year=
Dehao Chen and Chiachen Chou and Yuanzhong Xu and Jonathan Hseu , url=. BFloat16: The secret to high performance on Cloud TPUs , year=
-
[33]
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=
Common diffusion noise schedules and sample steps are flawed , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=
-
[34]
Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Ermon, Stefano and Manning, Christopher D and Finn, Chelsea , journal=
-
[36]
Wallace, Bram and Dang, Meihua and Rafailov, Rafael and Zhou, Linqi and Lou, Aaron and Purushwalkam, Senthil and Ermon, Stefano and Xiong, Caiming and Joty, Shafiq and Naik, Nikhil , journal=
-
[37]
Kirstain, Yuval and Polyak, Adam and Singer, Uriel and Matiana, Shahbuland and Penna, Joe and Levy, Omer , journal=
-
[38]
Yu, Jiahui and Xu, Yuanzhong and Koh, Jing Yu and Luong, Thang and Baid, Gunjan and Wang, Zirui and Vasudevan, Vijay and Ku, Alexander and Yang, Yinfei and Ayan, Burcu Karagol and others , journal=
-
[39]
Neural Information Processing Systems , year=
Neural Ordinary Differential Equations , author=. Neural Information Processing Systems , year=
-
[40]
Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models , author=. 2023 , eprint=
work page 2023
-
[41]
Ideogram V0.2 announcement , year =
-
[42]
Playground V2.5 announcement , year =
-
[43]
Ideogram V1.0 announcement , year =
-
[44]
Deep Unsupervised Learning using Nonequilibrium Thermodynamics , author=. ArXiv , year=
-
[45]
Diffusion Models Beat GANs on Image Synthesis , author=. 2021 , eprint=
work page 2021
-
[46]
Generative Modeling by Estimating Gradients of the Data Distribution , author=. 2020 , eprint=
work page 2020
- [47]
-
[48]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow , author=. 2022 , eprint=
work page 2022
-
[49]
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models , author=. 2023 , eprint=
work page 2023
- [50]
-
[51]
Structure and Content-Guided Video Synthesis with Diffusion Models , author=. 2023 , eprint=
work page 2023
-
[52]
Building Normalizing Flows with Stochastic Interpolants , author=. 2022 , eprint=
work page 2022
-
[53]
Hierarchical Text-Conditional Image Generation with CLIP Latents , author=. 2022 , eprint=
work page 2022
-
[55]
Make-A-Video: Text-to-Video Generation without Text-Video Data , author=. 2022 , eprint=
work page 2022
-
[56]
Photorealistic Video Generation with Diffusion Models , author=. 2023 , eprint=
work page 2023
-
[57]
Lumiere: A Space-Time Diffusion Model for Video Generation , author=. 2024 , eprint=
work page 2024
-
[58]
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , author=. 2022 , eprint=
work page 2022
-
[59]
Imagen Video: High Definition Video Generation with Diffusion Models , author=. 2022 , eprint=
work page 2022
-
[60]
Improved Denoising Diffusion Probabilistic Models , author=. 2021 , eprint=
work page 2021
-
[62]
GENIE: Higher-Order Denoising Diffusion Solvers , author=. 2022 , eprint=
work page 2022
-
[63]
DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models , author=. 2023 , eprint=
work page 2023
- [64]
-
[65]
SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers , author=. 2024 , eprint=
work page 2024
-
[66]
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers , author=. 2022 , eprint=
work page 2022
-
[67]
Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint=
work page 2021
-
[69]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. 2019 , eprint=
work page 2019
-
[70]
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , author=. 2018 , eprint=
work page 2018
- [72]
- [73]
-
[74]
Improving image generation with better captions , author=. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf , volume=
-
[76]
PixArt-a: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis , author=. 2023 , eprint=
work page 2023
-
[77]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis , author=. 2023 , eprint=
work page 2023
-
[78]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Instructpix2pix: Learning to follow image editing instructions , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[79]
International Journal of Computer Vision , year=
ImageNet Large Scale Visual Recognition Challenge , author=. International Journal of Computer Vision , year=
-
[80]
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , author=. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
work page 2021
-
[83]
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , author=. 2017 , eprint=
work page 2017
- [84]
-
[85]
Simple diffusion: End-to-end diffusion for high resolution images , author=. 2023 , eprint=
work page 2023
-
[86]
Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack , author=. 2023 , eprint=
work page 2023
-
[88]
An image is worth 16x16 words: Transformers for image recognition at scale , author=
-
[89]
Scaling vision transformers , author=
-
[90]
autofaiss , author =
-
[93]
Minimizing Trajectory Curvature of ODE-based Generative Models , author=. 2023 , eprint=
work page 2023
- [94]
-
[95]
Improving and generalizing flow-based generative models with minibatch optimal transport , author=. 2023 , eprint=
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.