super hub Canonical reference

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Alex Nichol, Casey Chu, Mark Chen, Prafulla Dhariwal · 2022 · cs.CV · arXiv 2204.06125

Canonical reference. 77% of citing Pith papers cite this work as background.

381 Pith papers citing it

Background 77% of classified citations

open full Pith review browse 381 citing papers more from Aditya Ramesh arXiv PDF

abstract

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 53 baseline 6 method 5 other 2

citation-polarity summary

background 51 baseline 6 use method 5 unclear 3 support 1

claims ledger

abstract Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve bo

authors

Aditya Ramesh Alex Nichol Casey Chu Mark Chen Prafulla Dhariwal

co-cited works

representative citing papers

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

stat.ML · 2023-10-25 · unverdicted · novelty 8.0

Score entropy loss enables discrete diffusion models (SEDD) that cut perplexity 25-75% versus prior diffusion methods and outperform GPT-2 on language modeling while supporting infilling and compute-quality tradeoffs.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

cs.CL · 2023-04-03 · accept · novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

Consistency Models

cs.LG · 2023-03-02 · conditional · novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.

MusicLM: Generating Music From Text

cs.SD · 2023-01-26 · conditional · novelty 8.0

MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.

Building Normalizing Flows with Stochastic Interpolants

cs.LG · 2022-09-30 · conditional · novelty 8.0 · 2 refs

Normalizing flows are constructed by learning the velocity of a stochastic interpolant via a quadratic loss derived from its probability current, yielding an efficient ODE-based alternative to diffusion models.

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

cs.LG · 2022-09-07 · unverdicted · novelty 8.0

Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.

Prompt-to-Prompt Image Editing with Cross Attention Control

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.

QWERTY: Training-Free Motion Control via Query-Warped Video Diffusion Transformers

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

QWERTY enables training-free motion control in pretrained image-to-video DiTs by warping the frame-invariant semantic subspace of queries in 3D full attention and using the predicted noise as self-guidance for latent optimization.

Quantum Generative Diffusion Model for Real-World Time Series

cs.LG · 2026-06-25 · unverdicted · novelty 7.0

QDiffusion-TS is the first quantum generative diffusion model for time series, achieving ~44% lower Wasserstein distance on Apple and Amazon stock data and up to 71% better forecasting RMSE with ~1000x fewer parameters than classical diffusion.

PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing

cs.CV · 2026-06-25 · unverdicted · novelty 7.0 · 2 refs

PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.

Trustworthy Image Authentication using Forensic Knowledge Graphs

cs.CV · 2026-06-22 · unverdicted · novelty 7.0

Forensic Knowledge Graphs integrate forensic traces, causal dependencies, and scene links via a new authentication network and Iterative Context Refinement to outperform standard detectors and VLMs on detection, localization, and justification.

Pulse: Training Acceleration for Large Diffusion Models with Automatic Pipeline Parallelism

cs.DC · 2026-06-17 · unverdicted · novelty 7.0

PULSE collocates skip-connected encoder-decoder layers and uses a skip-aware DP partitioner plus ILP scheduler to reduce communication 89% and raise throughput up to 2.3x versus prior pipeline strategies for diffusion models.

Learning a Maximum Entropy Model for Visual Textures using Diffusion

cs.CV · 2026-06-15 · unverdicted · novelty 7.0

A diffusion-trained maximum entropy model uses 512 learned statistics to synthesize visual textures at quality matching or exceeding prior models that rely on ~177k statistics.

Learning with Simulators: No Regret in a Computationally Bounded World

cs.LG · 2026-06-11 · unverdicted · novelty 7.0

Simulator access for dependent data recovers i.i.d.-style VC bounds and enables a universal no-regret algorithm for time-bounded samplable processes.

Adv-TGD: Adversarial Text-Guided Diffusion for Face Recognition Impersonation Attacks

cs.CV · 2026-06-10 · unverdicted · novelty 7.0

Adv-TGD is a text-guided diffusion attack that achieves 85.9% black-box ASR on four face recognition models while preserving PSNR 28.18 dB and SSIM 0.981.

SSR-Merge: Subspace Signal Routing for Training-Free LoRA Merging in Diffusion Models

cs.CV · 2026-06-09 · unverdicted · novelty 7.0

SSR-Merge merges LoRAs via subspace construction, inverse correlation decorrelation, and directional steering, shown to match the OLS solution with a streaming implementation that outperforms prior merging methods.

H2HMem: A Multimodal Memory Benchmark for Agents in Human-Human Interactions

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

H2HMem is a multimodal memory benchmark evaluating LLM agents on recall, reasoning, and application in dyadic and multi-party human-human conversations with phenomena such as anaphora and deixis.

TrioPose: Native Triple-Stream Diffusion Transformers for Pose-Guided Text-to-Image Generation

cs.CV · 2026-06-05 · unverdicted · novelty 7.0

TrioPose proposes a Triple-Stream Pose-Aware DiT with relational bias masks and spatial loss weighting to achieve SOTA pose-guided text-to-image results on multi-person benchmarks like Human-Art.

Parallel Jacobi Decoding for Fast Autoregressive Image Generation

cs.CV · 2026-06-04 · conditional · novelty 7.0

Parallel Jacobi Decoding accelerates autoregressive image models 4.8x-6.4x by using 2D spatial draft expansion and adjusted attention masks while keeping generation quality competitive.

GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

GeM-NR performs multi-view consistent nonrigid editing by aligning depth-derived point clouds between edited and unedited scenes then refining projections conditioned on the original query view.

Imagine Before You Draw: Visual Prompt Engineering for Image Generation

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

VPE inserts an internal autoregressive visual semantic token generation step to guide image token production in unified models, reporting faster convergence, higher quality, and superior editing preservation (PSNR 26.76 vs 19.92) versus external alternatives.

citing papers explorer

Showing 13 of 13 citing papers after filters.

Agile Deliberation: Concept Deliberation for Subjective Visual Classification cs.AI · 2025-12-11 · conditional · none · ref 30 · internal anchor
Agile Deliberation improves F1 scores by 7.5% over automated baselines and 3% over manual deliberation in 18 user sessions by supporting iterative refinement of subjective visual concepts.
Learning Interactive Real-World Simulators cs.AI · 2023-10-09 · conditional · none · ref 48 · internal anchor
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Generative-Model Predictive Planning for Navigation in Partially Observable Environments cs.AI · 2026-06-17 · unverdicted · none · ref 21 · internal anchor
BeliefDiffusion combines diffusion models for multimodal belief distributions with MPC planning, outperforming RL and generative baselines in synthetic POMDP navigation tasks.
STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training cs.AI · 2026-06-16 · unverdicted · none · ref 26 · internal anchor
STAR uses text-image attention to create dynamic spatial allocation maps that vary across denoising steps and applies the same advantage more strongly to relevant latent regions in RL post-training of diffusion models.
P-Guide: Parameter-Efficient Prior Steering for Single-Pass CFG Inference cs.AI · 2026-05-07 · unverdicted · none · ref 32 · internal anchor
P-Guide achieves single-pass classifier-free guidance in flow matching by modulating the initial latent state and is equivalent to standard CFG under a first-order approximation while cutting latency by half.
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models cs.AI · 2026-04-07 · unverdicted · none · ref 27 · internal anchor
Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
Enhancing Foundation VLM Robustness to Missing Modality: Scalable Diffusion for Bi-directional Feature Restoration cs.AI · 2026-02-03 · unverdicted · none · ref 45 · internal anchor
A diffusion model with dynamic modality gating and cross-modal mutual learning restores missing features in VLMs bi-directionally while preserving the original model's generalization.
PhyDrawGen: Physically Grounded Diagram Generation from Natural Language cs.AI · 2026-05-28 · unverdicted · none · ref 27 · internal anchor
PhyDrawGen is a neuro-symbolic pipeline that extracts typed scene graphs via LLM, converts them to physically constrained PSLGs via deterministic solver, and refines via fine-tuned Qwen-VL, claiming superior performance over GPT-5-image and Gemini models on 1,449 physics problems.
Who Defines Fairness? Target-Based Prompting for Demographic Representation in Generative Models cs.AI · 2026-04-22 · unverdicted · none · ref 29 · internal anchor
Target-based prompting lets users define fairness distributions for skin tones in generative AI, shifting outputs closer to chosen targets across 36 tested prompts for occupations and contexts.
On the Power of Foundation Models cs.AI · 2022-11-29 · unverdicted · none · ref 61 · internal anchor
Category theory proves prompt-based learning on perfect foundation models works only for representable tasks, fine-tuning solves tasks in the pretext category, and models can represent unseen target-category objects using source-category structure.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling cs.AI · 2025-01-29 · conditional · none · ref 35 · internal anchor
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
A Survey of Hallucination in Large Foundation Models cs.AI · 2023-09-12 · accept · none · ref 96 · internal anchor
A survey classifying hallucination phenomena specific to large foundation models, establishing evaluation criteria, examining mitigation strategies, and discussing future directions.
PECKER: A Precisely Efficient Critical Knowledge Erasure Recipe For Machine Unlearning in Diffusion Models cs.AI · 2026-04-07 · unreviewed · ref 19 · internal anchor

Hierarchical Text-Conditional Image Generation with CLIP Latents

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer