super hub

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Alex Nichol, Casey Chu, Mark Chen, Prafulla Dhariwal · 2022 · cs.CV · arXiv 2204.06125

133 Pith papers cite this work. Polarity classification is still indexing.

133 Pith papers citing it

open full Pith review browse 133 citing papers more from Aditya Ramesh arXiv PDF

abstract

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 2 unclear 1

claims ledger

abstract Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve bo

authors

Aditya Ramesh Alex Nichol Casey Chu Mark Chen Prafulla Dhariwal

co-cited works

representative citing papers

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

stat.ML · 2023-10-25 · unverdicted · novelty 8.0

Score entropy loss enables discrete diffusion models (SEDD) that cut perplexity 25-75% versus prior diffusion methods and outperform GPT-2 on language modeling while supporting infilling and compute-quality tradeoffs.

Consistency Models

cs.LG · 2023-03-02 · conditional · novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.

MusicLM: Generating Music From Text

cs.SD · 2023-01-26 · conditional · novelty 8.0

MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.

Building Normalizing Flows with Stochastic Interpolants

cs.LG · 2022-09-30 · conditional · novelty 8.0 · 2 refs

Normalizing flows are constructed by learning the velocity of a stochastic interpolant via a quadratic loss derived from its probability current, yielding an efficient ODE-based alternative to diffusion models.

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

cs.LG · 2022-09-07 · unverdicted · novelty 8.0

Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.

Prompt-to-Prompt Image Editing with Cross Attention Control

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.

HIR-ALIGN: Enhancing Hyperspectral Image Restoration via Diffusion-Based Data Generation

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

HIR-ALIGN augments limited target data for hyperspectral restoration by creating proxy clean images, synthesizing aligned HSIs with blur-robust diffusion and warp-based transfer, then finetuning models to lower target-domain risk.

ImageAttributionBench: How Far Are We from Generalizable Attribution?

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.

Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming baselines on a new PAd1M dataset.

Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.

Coordinated Diffusion: Generating Multi-Agent Behavior Without Multi-Agent Demonstrations

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

CoDi decomposes the multi-agent diffusion score into pre-trained single-agent policies plus a gradient-free cost guidance term to generate coordinated behavior from single-agent data alone.

Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits

math.OC · 2026-05-08 · unverdicted · novelty 7.0

Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.

Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.

Hyperbolic Concept Bottleneck Models

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

HypCBM reformulates concept activations as geometric containment in hyperbolic space to produce sparse, hierarchy-aware signals that match Euclidean models trained on 20 times more data.

A Flow Matching Algorithm for Many-Shot Adaptation to Unseen Distributions

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

FP-FM adapts flow matching models to unseen distributions via least-squares projection onto basis functions spanning training velocity fields, yielding improved precision and recall without inference-time training.

A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping

cs.CV · 2026-05-06 · unverdicted · novelty 7.0

Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and learning-based methods including a proposed diffusion-based V-cache.

LEGO: LoRA-Enabled Generator-Oriented Framework for Synthetic Image Detection

cs.CV · 2026-05-06 · unverdicted · novelty 7.0

LEGO uses multiple generator-specific LoRA modules modulated by an MLP and fused with attention to detect synthetic images, achieving better performance than prior methods while using under 10% of the training data.

Generative Modeling with Orbit-Space Particle Flow Matching

cs.GR · 2026-05-04 · unverdicted · novelty 7.0

OGPP is a particle flow-matching method using orbit-space canonicalization and geometric paths that achieves lower error and fewer steps than prior approaches on 3D benchmarks.

VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

cs.CV · 2026-05-02 · unverdicted · novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

ResetEdit: Precise Text-guided Editing of Generated Image via Resettable Starting Latent

cs.CV · 2026-04-28 · unverdicted · novelty 7.0

ResetEdit embeds a recoverable discrepancy signal during image generation in diffusion models to reconstruct an approximate original latent for high-fidelity text-guided editing.

CA-IDD: Cross-Attention Guided Identity-Conditional Diffusion for Identity-Consistent Face Swapping

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

CA-IDD is the first diffusion model for face swapping that integrates multi-modal cross-attention guidance from identity embeddings, gaze, and facial parsing to achieve better identity consistency and an FID of 11.73 over GAN baselines.

Dream-Cubed: Controllable Generative Modeling in Minecraft by Training on Billions of Cubes

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

Dream-Cubed releases a billion-scale voxel dataset and 3D diffusion models that generate controllable Minecraft worlds by operating directly on blocks.

citing papers explorer

Showing 4 of 4 citing papers after filters.

Prop-Chromeleon: Adaptive Haptic Props in Mixed Reality through Generative Artificial Intelligence cs.HC · 2026-05-01 · unverdicted · none · ref 67 · internal anchor
A generative-AI pipeline dynamically generates and anchors virtual assets to match the shape of physical props, enabling adaptive passive haptics in MR that users rate higher in realism, immersion, and enjoyment than static baselines.
Creo: From One-Shot Image Generation to Progressive, Co-Creative Ideation cs.HC · 2026-04-15 · unverdicted · none · ref 45 · internal anchor
Creo scaffolds text-to-image generation through progressive stages with editable abstractions and decision locking to improve controllability, agency, and output diversity.
Language-Guided Multimodal Texture Authoring via Generative Models cs.HC · 2026-04-07 · unverdicted · none · ref 63 · internal anchor
A language-driven system generates semantically consistent multimodal textures from text prompts by linking autoregressive haptic models and diffusion-based visuals through a shared latent representation.
OOPrompt: Reifying Intents into Structured Artifacts for Modular and Iterative Prompting cs.HC · 2026-04-21 · unverdicted · none · ref 35 · internal anchor
OOPrompt reifies user intents into structured manipulable artifacts to enable modular and iterative prompting in LLM-based interactive systems.

Hierarchical Text-Conditional Image Generation with CLIP Latents

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer