Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
super hub Canonical reference
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Canonical reference. 81% of citing Pith papers cite this work as background.
abstract
Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity. We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance. We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion parameter text-conditional diffusion model using classifier-free guidance are favored by human evaluators to those from DALL-E, even when the latter uses expensive CLIP reranking. Additionally, we find that our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing. We train a smaller model on a filtered dataset and release the code and weights at https://github.com/openai/glide-text2im.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity. We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance. We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion parameter text-conditional diffusion model using classifier-free guidance are favored
authors
co-cited works
representative citing papers
Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.
Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.
AsyncPatch Diffusion introduces asynchronous per-region noise levels in diffusion models, proves a valid ELBO, and uses a controlled sampler to support spatially adaptive generation and native inpainting.
DRIFT learns a structured invariance manifold from real images via one-class supervision on decomposed robust and fragile subspaces of a frozen VFM to detect AI-generated images through margin violations.
A joint latent diffusion model with cross-layer self-attention and disjoint sampling separates reflection and transmission layers from single images more effectively than prior methods on real-world benchmarks.
GLENS uses diffusion models on solver iterates to generate high-quality and diverse initial guesses for multimodal non-convex optimization, leading to faster solver convergence.
DRM turns a pre-trained diffusion model into a step-wise reward model and uses it for dense RL training (Step-wise GRPO) and guided sampling to improve final image quality.
VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.
AdaMaG is a guidance rule for generative models derived from decomposing continuity-equation effects into divergence and score-parallel terms, with a proof that divergence diverges near the manifold and a time-dependent bound that improves realism at no extra cost.
A multi-exposure video model predicts bracketed linear SDR sequences from single nonlinear SDR input, which a merging model combines into HDR video preserving shadow and highlight detail.
ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.
RLFSeg repurposes pretrained generative models via Rectified Flow for direct latent-space image-to-mask mapping in text-based segmentation, outperforming diffusion-based methods especially in zero-shot cases.
NEO is a probabilistic neural model that induces compositional programs as a learned Language of Thought from non-textual observations and executes them via a shared transition model to enable explanation-driven generalization.
Introduces closed-set C-Bench and open-set O-Bench for layout-guided diffusion models, a unified semantic-spatial scoring protocol, and ranks six models after generating and evaluating 319,086 images.
GENFIG1 is a new benchmark that tests whether vision-language models can create effective Figure 1 visuals capturing the central scientific idea from paper text.
FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.
SVG360 lifts a single SVG to a view-conditioned representation, uses spatial memory to propagate consistent parts across views, and applies structure-aware vectorization to produce editable multiview SVGs.
BeyondMimic combines compact motion tracking with a unified guided latent diffusion model to master diverse agile behaviors from human demos and solve unseen downstream tasks via test-time classifier guidance.
FaSTA* combines LLM fast planning with A* search and inductive subroutine mining to create an efficient agent for multi-turn image editing tasks.
COCO-Inpaint supplies a large-scale dataset and evaluation protocol focused on inpainting-based image forgeries to benchmark existing detection methods.
Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
citing papers explorer
-
Consistency Models
Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
-
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.