super hub Canonical reference

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Alex Nichol, Casey Chu, Mark Chen, Prafulla Dhariwal · 2022 · cs.CV · arXiv 2204.06125

Canonical reference. 77% of citing Pith papers cite this work as background.

323 Pith papers citing it

Background 77% of classified citations

open full Pith review browse 323 citing papers more from Aditya Ramesh arXiv PDF

abstract

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 53 baseline 6 method 5 other 2

citation-polarity summary

background 51 baseline 6 use method 5 unclear 3 support 1

claims ledger

abstract Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve bo

authors

Aditya Ramesh Alex Nichol Casey Chu Mark Chen Prafulla Dhariwal

co-cited works

representative citing papers

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

stat.ML · 2023-10-25 · unverdicted · novelty 8.0

Score entropy loss enables discrete diffusion models (SEDD) that cut perplexity 25-75% versus prior diffusion methods and outperform GPT-2 on language modeling while supporting infilling and compute-quality tradeoffs.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

cs.CL · 2023-04-03 · accept · novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

Consistency Models

cs.LG · 2023-03-02 · conditional · novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.

MusicLM: Generating Music From Text

cs.SD · 2023-01-26 · conditional · novelty 8.0

MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.

Building Normalizing Flows with Stochastic Interpolants

cs.LG · 2022-09-30 · conditional · novelty 8.0 · 2 refs

Normalizing flows are constructed by learning the velocity of a stochastic interpolant via a quadratic loss derived from its probability current, yielding an efficient ODE-based alternative to diffusion models.

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

cs.LG · 2022-09-07 · unverdicted · novelty 8.0

Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.

Prompt-to-Prompt Image Editing with Cross Attention Control

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.

PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.

Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

SD-MIA is a black-box membership inference attack that detects pre-training data in diffusion models via cross-modal perturbations on images and textual instructions.

Concept Unlearning via Cross-Attention Activation Projection for Diffusion Models

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

PURE builds forget and retain bases from per-layer cross-attention activations along a short denoising trajectory and applies a single linear projector to cross-attention weights, yielding the best forget-retain trade-off on a ten-concept benchmark.

Mosaic: Compositional Multi-Concept Erasure via Vector Field Blending

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

Mosaic is a framework for compositional multi-concept erasure in flow-based T2I models via spatial vector field blending without extra optimization, evaluated on the new CoME-Bench benchmark covering intra- and cross-category cases.

VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.

GeoDiff-SAR II: 3D-Driven Foundation Diffusion Models for SAR Generation via Decoupled Control

eess.IV · 2026-05-20 · unverdicted · novelty 7.0

GeoDiff-SAR II proposes a 3D-driven decoupled diffusion framework using GECM and ControlNet on a FLUX backbone for controllable SAR image generation across large viewpoint gaps.

Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models

cs.CR · 2026-05-19 · conditional · novelty 7.0

ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.

PMF-CL: Pareto-Minimal-Forgetting Continual Learner for Conflicting Tasks

cs.LG · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

PMF-CL derives Pareto-minimal-forgetting algorithms for linear/basis-function regression and quadratic-bounded losses like logistic regression, achieving static O(d²) memory for d-parameter models.

Functionalization via Structure Completion and Motion Rectification

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.

Designing streetscapes from street-view imagery using diffusion models

cs.CV · 2026-05-17 · conditional · novelty 7.0

A multimodal diffusion model generates controllable alternative streetscapes from street-view imagery using visual metrics and text, shown on Chicago and Orlando data with gains in semantic consistency.

A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation

cs.CR · 2026-05-15 · unverdicted · novelty 7.0

CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.

Generating HDR Video from SDR Video

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

A multi-exposure video model predicts bracketed linear SDR sequences from single nonlinear SDR input, which a merging model combines into HDR video preserving shadow and highlight detail.

HIR-ALIGN: Enhancing Hyperspectral Image Restoration via Diffusion-Based Data Generation

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

HIR-ALIGN augments limited target data for hyperspectral restoration by creating proxy clean images, synthesizing aligned HSIs with blur-robust diffusion and warp-based transfer, then finetuning models to lower target-domain risk.

ImageAttributionBench: How Far Are We from Generalizable Attribution?

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.

citing papers explorer

Showing 5 of 5 citing papers after filters.

Hyper-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control cs.RO · 2026-05-02 · unverdicted · none · ref 19 · 4 links · internal anchor
HDP3 is a pocket-scale 3D diffusion policy with a Diffusion Mixer decoder that achieves state-of-the-art visuomotor control using two-step DDIM inference and under 1% of the parameters of prior 3D diffusion policies.
Learning Interactive Real-World Simulators cs.AI · 2023-10-09 · conditional · none · ref 48 · internal anchor
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Prop-Chromeleon: Adaptive Haptic Props in Mixed Reality through Generative Artificial Intelligence cs.HC · 2026-05-01 · unverdicted · none · ref 67 · internal anchor
A generative-AI pipeline dynamically generates and anchors virtual assets to match the shape of physical props, enabling adaptive passive haptics in MR that users rate higher in realism, immersion, and enjoyment than static baselines.
Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding cs.LG · 2026-04-09 · unverdicted · none · ref 86 · internal anchor
A meta-optimized in-context learning approach enables training-free cross-subject semantic visual decoding from fMRI by inferring individual neural encoding patterns via hierarchical inference on a few examples.
DustNET: enabling machine learning and AI models of dusty plasmas physics.plasm-ph · 2026-03-18 · unverdicted · none · ref 19 · internal anchor
DustNET is proposed as a shared dataset to train machine learning models that complement traditional physics equations for predictive modeling of dusty plasmas across laboratory and natural scales.

Hierarchical Text-Conditional Image Generation with CLIP Latents

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer