Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh; Alex Nichol; Casey Chu; Mark Chen; Prafulla Dhariwal

arxiv: 2204.06125 · v1 · submitted 2022-04-13 · 💻 cs.CV

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh , Prafulla Dhariwal , Alex Nichol , Casey Chu , Mark Chen This is my paper

Pith reviewed 2026-05-10 16:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-image generationCLIP embeddingsdiffusion modelshierarchical generationimage diversityzero-shot image editingcontrastive representations

0 comments

The pith

A two-stage model that first generates a CLIP image embedding from text and then decodes it into pixels yields more diverse images than direct text-to-image generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes separating text-to-image generation into a prior step that produces a CLIP image embedding from the caption and a decoder step that turns the embedding into an image. This structure is meant to increase the variety of outputs for a given caption while keeping the images realistic and aligned with the text. The approach also supports creating multiple versions of an image that retain its core meaning and appearance but differ in details not captured by the embedding. Experiments compare diffusion and autoregressive models for the prior and find diffusion versions more efficient and effective.

Core claim

The paper claims that explicitly generating CLIP image embeddings via a prior conditioned on text, then decoding those embeddings with a diffusion model, improves image diversity with minimal loss in photorealism and caption similarity relative to direct generation methods. The joint CLIP space further enables zero-shot language-guided image manipulations and controlled variations that preserve semantics and style.

What carries the argument

A prior model that maps text captions to CLIP image embeddings, paired with a diffusion decoder that maps those embeddings to images.

If this is right

Decoders can produce multiple variations of an image that keep its semantics and style while changing details absent from the embedding.
The joint CLIP embedding space supports language-guided image manipulations without additional training.
Diffusion models for the prior are computationally more efficient and produce higher-quality samples than autoregressive alternatives.
Explicit generation of the image representation allows the system to vary non-essential details without altering core content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of prior and decoder could be tested on other conditional generation tasks where intermediate representations might improve controllability.
Leveraging a fixed pre-trained embedding space may allow independent scaling or fine-tuning of the prior and decoder for specialized domains.
Similar hierarchical designs might reduce the parameter count needed in the final decoder by offloading semantic encoding to the prior.

Load-bearing premise

A CLIP image embedding contains enough semantic and stylistic information for a decoder to reconstruct varied high-quality images while safely omitting non-essential details.

What would settle it

Train a single-stage text-to-image model and the two-stage prior-plus-decoder model on identical data, then check whether the two-stage version shows measurably higher diversity scores without a corresponding drop in photorealism or caption-matching scores.

read the original abstract

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move—predicting a CLIP image embedding first, then decoding to pixels—delivers measurable diversity gains with only small quality trade-offs.

read the letter

The central result is that routing generation through an explicit CLIP image embedding improves output variety compared with direct text-to-pixel models, while photorealism and caption alignment stay close. They train a prior to map captions to CLIP latents and then run a diffusion decoder conditioned on those latents. This setup also supports simple zero-shot edits by changing the embedding with text and lets them produce controlled variations that keep semantics and style but alter details the embedding ignores. The diffusion prior beats their autoregressive version on both speed and sample quality, which is a useful practical finding. The experiments directly compare the two-stage model against baselines and report the expected diversity-quality trade-off, so the claims rest on new measurements rather than circular definitions. The architecture exploits CLIP's joint space cleanly, and the decoder conditioning experiments give concrete evidence that the intermediate representation is doing real work. One soft spot is that the abstract and summary tables leave some ablation depth and exact baseline numbers for the full paper to clarify; without those it is hard to judge how consistent the “minimal loss” is across prompt types. The assumption that CLIP latents hold enough information for high-quality decoding is plausible but still empirical, and it may cap fine-grained control in edge cases. This work is aimed at researchers building or extending text-to-image systems who want a modular latent approach rather than end-to-end pixel models. Anyone working on diffusion or representation learning for generation will find the prior comparison and manipulation results worth reading. The paper shows clear thinking and honest experimental engagement, so it deserves a serious referee even if some sections need tightening on metrics.

Referee Report

0 major / 3 minor

Summary. The paper claims that a hierarchical two-stage model for text-conditional image generation—a prior that produces CLIP image embeddings from text captions, followed by a decoder that generates images conditioned on those embeddings—improves output diversity relative to direct text-to-image baselines while incurring only minimal losses in photorealism and caption similarity. Diffusion models are used for the decoder and both autoregressive and diffusion models are tested for the prior (with the latter found more efficient and higher-quality); the approach also enables image variations that preserve semantics and style plus zero-shot language-guided manipulations via the shared CLIP space.

Significance. If the reported empirical comparisons hold, the result is significant for text-to-image synthesis: by factoring high-level semantics and style into the CLIP embedding and letting the decoder supply omitted pixel-level details, the method demonstrably trades off diversity against quality in a controllable way. The direct ablations comparing diffusion versus autoregressive priors and varying decoder conditioning supply concrete evidence for the stated trade-off and the practical utility of zero-shot editing.

minor comments (3)

[Abstract] Abstract: the claim of 'empirical improvements' and 'minimal loss' would be easier to evaluate if the abstract itself included the key quantitative metrics (e.g., FID, CLIP similarity, diversity scores) and the primary baselines against which the gains are measured.
[§3] §3 (Method): the precise conditioning mechanism and noise schedule used in the diffusion prior are described at a high level; adding the exact hyper-parameter values or a reference to the supplementary material would improve reproducibility.
[Table 2 / Figure 4] Table 2 / Figure 4: the diversity and photorealism metrics for the hierarchical model versus the direct baseline are presented, but the caption does not explicitly state the number of samples used for each metric or whether the same random seeds were shared across conditions.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation of minor revision. The referee summary accurately captures the core contributions of our hierarchical prior-decoder approach using CLIP latents for improved diversity in text-conditional image generation.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical two-stage architecture (text-to-CLIP-embedding prior + embedding-to-image decoder) whose central claims rest on reported experimental comparisons, ablations, and qualitative results rather than any closed-form derivation. No equations or steps reduce by construction to fitted parameters, self-referential definitions, or load-bearing self-citations; the CLIP embedding is treated as an external pretrained representation, and diversity/photorealism trade-offs are measured against independent baselines. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim depends on the representational power of pre-trained CLIP embeddings and the ability of diffusion decoders to reconstruct images from them; both are imported from earlier work rather than derived here.

free parameters (1)

diffusion prior and decoder hyperparameters
Specific choices of step count, noise schedule, and conditioning strength that are tuned during training.

axioms (2)

domain assumption CLIP embeddings capture the semantics and style needed for high-quality image reconstruction and variation
Invoked when the prior is trained to predict embeddings and the decoder is conditioned on them.
domain assumption Diffusion models can decode from CLIP latents without direct text conditioning
Required for the decoder stage to function as described.

pith-pipeline@v0.9.0 · 5452 in / 1318 out tokens · 70215 ms · 2026-05-10T16:51:09.590636+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Flow-GRPO: Training Flow Matching Models via Online RL
cs.CV 2025-05 unverdicted novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
stat.ML 2023-10 unverdicted novelty 8.0

Score entropy loss enables discrete diffusion models (SEDD) that cut perplexity 25-75% versus prior diffusion methods and outperform GPT-2 on language modeling while supporting infilling and compute-quality tradeoffs.
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
cs.CL 2023-09 unverdicted novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
cs.CL 2023-04 accept novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
Consistency Models
cs.LG 2023-03 conditional novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
MusicLM: Generating Music From Text
cs.SD 2023-01 conditional novelty 8.0

MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
Building Normalizing Flows with Stochastic Interpolants
cs.LG 2022-09 conditional novelty 8.0

Normalizing flows are constructed by learning the velocity of a stochastic interpolant via a quadratic loss derived from its probability current, yielding an efficient ODE-based alternative to diffusion models.
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
cs.LG 2022-09 unverdicted novelty 8.0

Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
Prompt-to-Prompt Image Editing with Cross Attention Control
cs.CV 2022-08 unverdicted novelty 8.0

Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
cs.CV 2022-08 unverdicted novelty 8.0

Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.
VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation
cs.CV 2026-05 unverdicted novelty 7.0

VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.
GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation
cs.CV 2026-05 unverdicted novelty 7.0

GenEvolve proposes a self-evolving agent framework for open-ended image generation that uses tool-orchestrated trajectories and visual experience distillation from best-worst differences to achieve reported state-of-t...
GeoDiff-SAR II: 3D-Driven Foundation Diffusion Models for SAR Generation via Decoupled Control
eess.IV 2026-05 unverdicted novelty 7.0

GeoDiff-SAR II proposes a 3D-driven decoupled diffusion framework using GECM and ControlNet on a FLUX backbone for controllable SAR image generation across large viewpoint gaps.
Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models
cs.CR 2026-05 conditional novelty 7.0

ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.
Functionalization via Structure Completion and Motion Rectification
cs.CV 2026-05 unverdicted novelty 7.0

Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture wi...
Designing streetscapes from street-view imagery using diffusion models
cs.CV 2026-05 conditional novelty 7.0

A multimodal diffusion model generates controllable alternative streetscapes from street-view imagery using visual metrics and text, shown on Chicago and Orlando data with gains in semantic consistency.
A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation
cs.CR 2026-05 unverdicted novelty 7.0

CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.
Generating HDR Video from SDR Video
cs.CV 2026-05 unverdicted novelty 7.0

A multi-exposure video model predicts bracketed linear SDR sequences from single nonlinear SDR input, which a merging model combines into HDR video preserving shadow and highlight detail.
HIR-ALIGN: Enhancing Hyperspectral Image Restoration via Diffusion-Based Data Generation
cs.CV 2026-05 unverdicted novelty 7.0

HIR-ALIGN augments limited target data for hyperspectral restoration by creating proxy clean images, synthesizing aligned HSIs with blur-robust diffusion and warp-based transfer, then finetuning models to lower target...
ImageAttributionBench: How Far Are We from Generalizable Attribution?
cs.CV 2026-05 unverdicted novelty 7.0

ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
cs.CV 2026-05 unverdicted novelty 7.0

Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming ...
Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning
cs.LG 2026-05 unverdicted novelty 7.0

SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.
Coordinated Diffusion: Generating Multi-Agent Behavior Without Multi-Agent Demonstrations
cs.RO 2026-05 unverdicted novelty 7.0

CoDi decomposes the multi-agent diffusion score into pre-trained single-agent policies plus a gradient-free cost guidance term to generate coordinated behavior from single-agent data alone.
Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits
math.OC 2026-05 unverdicted novelty 7.0

Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.
Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences
cs.LG 2026-05 unverdicted novelty 7.0

Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.
Hyperbolic Concept Bottleneck Models
cs.LG 2026-05 unverdicted novelty 7.0

HypCBM reformulates concept activations as geometric containment in hyperbolic space to produce sparse, hierarchy-aware signals that match Euclidean models trained on 20 times more data.
Hyperbolic Concept Bottleneck Models
cs.LG 2026-05 unverdicted novelty 7.0

Hyperbolic Concept Bottleneck Models reformulate concept activations as test-time geometric containment in hyperbolic entailment cones to produce sparse, hierarchy-aware signals without extra supervision.
A Flow Matching Algorithm for Many-Shot Adaptation to Unseen Distributions
cs.LG 2026-05 unverdicted novelty 7.0

FP-FM adapts flow matching models to unseen distributions via least-squares projection onto basis functions spanning training velocity fields, yielding improved precision and recall without inference-time training.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
cs.CV 2026-05 unverdicted novelty 7.0

D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by minimizing distribution differences between a text-only student and a multimodal teacher on the student's o...
A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping
cs.CV 2026-05 unverdicted novelty 7.0

Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and...
LEGO: LoRA-Enabled Generator-Oriented Framework for Synthetic Image Detection
cs.CV 2026-05 unverdicted novelty 7.0

LEGO uses multiple generator-specific LoRA modules modulated by an MLP and fused with attention to detect synthetic images, achieving better performance than prior methods while using under 10% of the training data.
Generative Modeling with Orbit-Space Particle Flow Matching
cs.GR 2026-05 unverdicted novelty 7.0

OGPP is a particle flow-matching method using orbit-space canonicalization and geometric paths that achieves lower error and fewer steps than prior approaches on 3D benchmarks.
Hyper-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
cs.RO 2026-05 conditional novelty 7.0

Frequency analysis of smooth robot actions bounds denoising error to low-frequency modes, enabling a sub-1% parameter 3D diffusion policy with two-step inference that reaches SOTA on manipulation benchmarks.
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
cs.CV 2026-05 unverdicted novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
ResetEdit: Precise Text-guided Editing of Generated Image via Resettable Starting Latent
cs.CV 2026-04 unverdicted novelty 7.0

ResetEdit embeds a recoverable discrepancy signal during image generation in diffusion models to reconstruct an approximate original latent for high-fidelity text-guided editing.
CA-IDD: Cross-Attention Guided Identity-Conditional Diffusion for Identity-Consistent Face Swapping
cs.CV 2026-04 unverdicted novelty 7.0

CA-IDD is the first diffusion model for face swapping that integrates multi-modal cross-attention guidance from identity embeddings, gaze, and facial parsing to achieve better identity consistency and an FID of 11.73 ...
Dream-Cubed: Controllable Generative Modeling in Minecraft by Training on Billions of Cubes
cs.CV 2026-04 unverdicted novelty 7.0

Dream-Cubed releases a billion-scale voxel dataset and 3D diffusion models that generate controllable Minecraft worlds by operating directly on blocks.
Long-Text-to-Image Generation via Compositional Prompt Decomposition
cs.CV 2026-04 unverdicted novelty 7.0

PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models whil...
Grokking of Diffusion Models: Case Study on Modular Addition
cs.LG 2026-04 unverdicted novelty 7.0

Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
Marrying Text-to-Motion Generation with Skeleton-Based Action Recognition
cs.CV 2026-04 unverdicted novelty 7.0

CoAMD unifies skeleton-based action recognition and text-to-motion generation through autoregressive diffusion guided by a multi-modal recognizer, reporting SOTA results on 13 benchmarks for four tasks.
Quality-Aware Calibration for AI-Generated Image Detection in the Wild
cs.CV 2026-04 conditional novelty 7.0

QuAD aggregates quality-weighted detection scores from near-duplicates of an image to raise balanced accuracy by about 8% over simple averaging on state-of-the-art detectors.
Step-level Denoising-time Diffusion Alignment with Multiple Objectives
cs.LG 2026-04 unverdicted novelty 7.0

MSDDA derives a closed-form optimal reverse denoising distribution for multi-objective diffusion alignment that is exactly equivalent to step-level RL fine-tuning with no approximation error.
Scaling Exposes the Trigger: Input-Level Backdoor Detection in Text-to-Image Diffusion Models via Cross-Attention Scaling
cs.CR 2026-04 unverdicted novelty 7.0

SET detects input-level backdoors in T2I diffusion models by learning a benign cross-attention response space from clean samples and flagging deviations under multi-scale perturbations.
HiddenObjects: Scalable Diffusion-Distilled Spatial Priors for Object Placement
cs.CV 2026-04 unverdicted novelty 7.0

A diffusion-based pipeline creates a 27M-annotation dataset of object placements that outperforms human annotations and baselines on image editing tasks, then distills it into a fast model.
NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity
cs.LG 2026-04 unverdicted novelty 7.0

NeuroFlow is the first unified flow model for bidirectional visual encoding and decoding from neural activity using NeuroVAE and cross-modal flow matching.
SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation
cs.CV 2026-04 conditional novelty 7.0

SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.
Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding
cs.CV 2026-03 unverdicted novelty 7.0

Omni-NegCLIP improves CLIP's negation understanding by up to 52.65% on presence-based and 12.50% on absence-based tasks through front-layer fine-tuning with specialized contrastive losses.
Is the Modality Gap a Bug or a Feature? A Robustness Perspective
cs.CV 2026-03 unverdicted novelty 7.0

Minimizing contrastive loss produces an orthogonal modality gap vector whose size is monotonically tied to robustness, so post-processing that reduces the gap improves robustness with no loss in clean accuracy.
Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation
cs.CV 2026-03 unverdicted novelty 7.0

DREAM introduces Masking Warmup and Semantically Aligned Decoding to let a single encoder handle both contrastive alignment and masked generation, yielding gains over CLIP and FLUID on understanding and generation benchmarks.
MultiAnimate: Pose-Guided Image Animation Made Extensible
cs.CV 2026-02 unverdicted novelty 7.0

MultiAnimate adds Identifier Assigner and Identifier Adapter modules to diffusion video models so they can handle multiple characters without identity mix-ups, generalizing from two-character training data to more characters.
Information Filtering via Variational Regularization for Robot Manipulation
cs.RO 2026-01 unverdicted novelty 7.0

Variational Regularization imposes an adaptive information bottleneck on noisy intermediate features in DP3-UNet and DP3-DiT policies, consistently raising task success rates on RoboTwin2.0, Adroit, and MetaWorld whil...
A Unified and Controllable Framework for Layered Image Generation with Visual Effects
cs.CV 2026-01 unverdicted novelty 7.0

LASAGNA produces layered images with integrated visual effects in a single pass, enabling drift-free edits via alpha compositing while releasing a 48K dataset and a 242-sample benchmark.
ATATA: One Algorithm to Align Them All
cs.CV 2026-01 unverdicted novelty 7.0

ATATA enables fast joint inference of structurally aligned pairs using Rectified Flow models via segment transport, improving state-of-the-art for image and video generation while matching 3D quality at much higher speed.
CompNO: A Novel Foundation Model approach for solving Partial Differential Equations
cs.LG 2026-01 unverdicted novelty 7.0

CompNO composes specialized Fourier neural operator blocks for fundamental differential operators into task-specific solvers that achieve lower L2 error than baselines on linear parametric PDEs and remain competitive ...
Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
cs.CV 2026-01 unverdicted novelty 7.0

LocalDPO aligns text-to-video diffusion models with human preferences at the spatio-temporal region level by automatically generating localized preference pairs from corrupted real videos and applying a region-aware DPO loss.
Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
cs.CV 2026-01 unverdicted novelty 7.0

LocalDPO creates localized preference pairs from real videos by applying random spatio-temporal masks and restoring masked regions with the frozen base model, then applies region-restricted DPO loss to improve fidelit...
Screen, Cache, and Match: A Training-Free Causality-Consistent Reference Frame Framework for Human Animation
cs.GR 2025-12 unverdicted novelty 7.0

FrameCache uses a Screen-Cache-Match strategy and Trajectory-Aware Autoregressive Generation to convert past frames into causal guidance for temporally coherent human animation videos.
Agile Deliberation: Concept Deliberation for Subjective Visual Classification
cs.AI 2025-12 conditional novelty 7.0

Agile Deliberation improves F1 scores by 7.5% over automated baselines and 3% over manual deliberation in 18 user sessions by supporting iterative refinement of subjective visual concepts.
One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer
cs.CV 2025-11 unverdicted novelty 7.0

One-to-All Animation enables alignment-free character animation and image pose transfer via self-supervised outpainting reformulation, reference extraction, hybrid fusion attention, identity-robust pose control, and t...
SVG360: Editable Multiview Vector Graphics from a Single SVG
cs.CV 2025-11 unverdicted novelty 7.0

SVG360 lifts a single SVG to a view-conditioned representation, uses spatial memory to propagate consistent parts across views, and applies structure-aware vectorization to produce editable multiview SVGs.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 293 Pith papers · 16 internal anchors

[1]

arXiv preprint arXiv:2201.07520 , year=

Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, and Luke Zettlemoyer. CM3: A Causal Masked Multimodal Model of the Internet. arXiv:2201.07520, 2022

work page arXiv 2022
[2]

Analytic-dpm: an an- alytic estimate of the optimal reverse variance in diffusion probabilistic models

Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-DPM: an Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models. CoRR, abs/2201.06503, 2022. URL https: //arxiv.org/abs/2201.06503

work page arXiv 2022
[3]

High Fidelity Visualization of What Your Self-Supervised Representation Knows About

Florian Bordes, Randall Balestriero, and Pascal Vincent. High Fidelity Visualization of What Your Self-Supervised Representation Knows About. arXiv:2112.09164, 2021

work page arXiv 2021
[4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[5]

Very deep vaes generalize autoregressive models and can outperform them on images,

Rewon Child. Very Deep V AEs Generalize Autoregressive Models and Can Outperform Them on Images. arXiv:2011.10650, 2021

work page arXiv 2011
[6]

A V A Linear Probe

Katherine Crowson. A V A Linear Probe. https://twitter.com/RiversHaveWings/status/ 1472346186728173568?s=20&t=T-HRr3Gw5HRGjQaMDtRe3A, 2021

work page 2021
[7]

CLIP guided diffusion HQ 256x256

Katherine Crowson. CLIP guided diffusion HQ 256x256. https://colab.research.google.com/ drive/12a_Wrfi2_gwwAuN3VvMTwVMz9TfqctNj, 2021

work page 2021
[8]

CLIP Guided Diffusion 512x512, Secondary Model Method

Katherine Crowson. CLIP Guided Diffusion 512x512, Secondary Model Method. https://twitter. com/RiversHaveWings/status/1462859669454536711, 2021

work page arXiv 2021
[9]

v-diffusion

Katherine Crowson. v-diffusion. https://github.com/crowsonkb/v-diffusion-pytorch, 2021

work page 2021
[10]

arXiv preprint arXiv:2006.06666 , eprint =

Karan Desai and Justin Johnson. VirTex: Learning Visual Representations from Textual Annotations. arXiv:2006.06666, 2020

work page arXiv 2006
[11]

Diffusion Models Beat GANs on Image Synthesis

Prafulla Dhariwal and Alex Nichol. Diffusion Models Beat GANs on Image Synthesis. arXiv:2105.05233, 2021

work page internal anchor Pith review arXiv 2021
[12]

Cogview: Mastering text-to-image generation via transformers

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. CogView: Mastering Text-to-Image Generation via Transformers. arXiv:2105.13290, 2021

work page arXiv 2021
[13]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[14]

Esser, R

Patrick Esser, Robin Rombach, and Björn Ommer. Taming Transformers for High-Resolution Image Synthesis. arXiv:2012.09841, 2020

work page arXiv 2012
[15]

Sharpness-Aware Minimization for Efficiently Improving Generalization

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-Aware Minimization for Efﬁciently Improving Generalization. arXiv:2010.01412, 2020. 19

work page internal anchor Pith review arXiv 2010
[16]

CLOOB: Modern Hopﬁeld Networks with InfoLOOB Outperform CLIP, 2022

Andreas Fürst, Elisabeth Rumetshofer, Viet Thuong Tran, Hubert Ramsauer, Fei Tang, Johannes Lehner, D P Kreil, Michael K Kopp, Günter Klambauer, Angela Bitto-Nemling, and Sepp Hochreiter. CLOOB: Modern Hopﬁeld Networks with InfoLOOB Outperform CLIP, 2022. URL https://openreview. net/forum?id=qw674L9PfQE

work page 2022
[17]

Make-a-Scene:

Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-A- Scene: Scene-Based Text-to-Image Generation with Human Priors. arXiv:2203.13131, 2022

work page arXiv 2022
[18]

Stylegan-nada: Clip-guided domain adaptation of image generators

Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and Daniel Cohen-Or. StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators. arXiv:2108.00946, 2021

work page arXiv 2021
[19]

Galatolo, Mario G

Federico A. Galatolo, Mario G. C. A. Cimino, and Gigliola Vaglini. Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search. arXiv:2102.01645, 2021

work page arXiv 2021
[20]

Multimodal Neurons in Artificial Neural Networks , year =

Gabriel Goh, Nick Cammarata † , Chelsea V oss† , Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and Chris Olah. Multimodal Neurons in Artiﬁcial Neural Networks. Distill, 2021. doi: 10.23915/distill.00030. https://distill.pub/2021/multimodal-neurons

work page doi:10.23915/distill.00030 2021
[21]

Generative Adversarial Networks

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Networks. arXiv:1406.2661, 2014

work page internal anchor Pith review arXiv 2014
[22]

Vector quantized diffusion model for text-to-image synthesis, 2022

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector Quantized Diffusion Model for Text-to-Image Synthesis. arXiv:2111.14822, 2021

work page arXiv 2021
[23]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. Advances in Neural Information Processing Systems 30 (NIPS 2017) , 2017

work page 2017
[24]

Classiﬁer-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classiﬁer-Free Diffusion Guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications , 2021. URL https://openreview.net/ forum?id=qw8AKxfYbI

work page 2021
[25]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models.arXiv:2006.11239, 2020

work page internal anchor Pith review arXiv 2006
[26]

Cascaded diffusion models for high fidelity image generation

Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded Diffusion Models for High Fidelity Image Generation. arXiv:2106.15282, 2021

work page arXiv 2021
[27]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[28]

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft COCO: Common Objects in Context. arXiv:1405.0312, 2014

work page internal anchor Pith review arXiv 2014
[29]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

DALL ·E 2 Preview - Risks and Limitations

Pamela Mishkin, Lama Ahmad, Miles Brundage, Gretchen Krueger, and Girish Sastry. DALL ·E 2 Preview - Risks and Limitations. 2022. URL https://github.com/openai/dalle-2-preview/ blob/main/system-card.md

work page 2022
[31]

Wagner, and Saining Xie

Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. SLIP: Self-supervision meets Language-Image Pre-training. arXiv:2112.12750, 2021

work page arXiv 2021
[32]

The Big Sleep

Ryan Murdock. The Big Sleep. https://twitter.com/advadnoun/status/ 1351038053033406468, 2021. 20

work page 2021
[33]

A V A: A large-scale database for aesthetic visual analysis

Naila Murray, Luca Marchesotti, and Florent Perronnin. A V A: A large-scale database for aesthetic visual analysis. In 2012 IEEE Conference on Computer Vision and Pattern Recognition , pages 2408–2415,

work page 2012
[34]

doi: 10.1109/CVPR.2012.6247954

work page doi:10.1109/cvpr.2012.6247954 2012
[35]

Improved Denoising Diffusion Probabilistic Models

Alex Nichol and Prafulla Dhariwal. Improved Denoising Diffusion Probabilistic Models. arXiv:2102.09672, 2021

work page internal anchor Pith review arXiv 2021
[36]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv:2112.10741, 2021

work page internal anchor Pith review arXiv 2021
[37]

Styleclip: Text-driven manipulation of stylegan imagery.arXiv preprint arXiv:2103.17249, 2021

Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. StyleCLIP: Text- Driven Manipulation of StyleGAN Imagery. arXiv:2103.17249, 2021

work page arXiv 2021
[38]

Karl Pearson. LIII. On lines and planes of closest ﬁt to systems of points in space, November 1901. URL https://doi.org/10.1080/14786440109462720

work page doi:10.1080/14786440109462720 1901
[39]

Diffusion Autoencoders: Toward a Meaningful and Decodable Representation

Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion Autoencoders: Toward a Meaningful and Decodable Representation. arXiv:2111.15640, 2021

work page arXiv 2021
[40]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[41]

Zero-Shot Text-to-Image Generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-Shot Text-to-Image Generation. arXiv:2102.12092, 2021

work page internal anchor Pith review arXiv 2021
[42]

Generating Diverse High-Fidelity Images with VQ-VAE-2

Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating Diverse High-Fidelity Images with VQ-V AE-2.arXiv:1906.00446, 2019

work page Pith review arXiv 1906
[43]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752, 2021

work page Pith review arXiv 2021
[44]

Saharia, J

Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Image Super-Resolution via Iterative Reﬁnement.arXiv:arXiv:2104.07636, 2021

work page arXiv 2021
[45]

Learning Visual Representations with Caption Annotations

Mert Bulent Sariyildiz, Julien Perez, and Diane Larlus. Learning Visual Representations with Caption Annotations. arXiv:2008.01392, 2020

work page arXiv 2008
[46]

How much can clip beneﬁt vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021

Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How Much Can CLIP Beneﬁt Vision-and-Language Tasks? arXiv:2107.06383, 2021

work page arXiv 2021
[47]

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv:1503.03585, 2015

work page internal anchor Pith review arXiv 2015
[48]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising Diffusion Implicit Models. arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[49]

Improved techniques for training Score-Based generative models

Yang Song and Stefano Ermon. Improved Techniques for Training Score-Based Generative Models. arXiv:2006.09011, 2020

work page arXiv 2006
[50]

Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis

Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe, Xiao-Yuan Jing, Fei Wu, and Bingkun Bao. DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis. arXiv:2008.05865, 2020

work page arXiv 2008
[51]

P., Kumar, A., Er- mon, S., and Poole, B

Arash Vahdat and Jan Kautz. NV AE: A Deep Hierarchical Variational Autoencoder.arXiv:2007.03898, 2020. 21

work page arXiv 2007
[52]

Score-based Generative Modeling in Latent Space

Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based Generative Modeling in Latent Space. In Neural Information Processing Systems (NeurIPS) , 2021

work page 2021
[53]

Neural Discrete Representation Learning

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural Discrete Representation Learning. arXiv:1711.00937, 2017

work page Pith review arXiv 2017
[54]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. arXiv:1706.03762, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[55]

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

Zihao Wang, Wei Liu, Qian He, Xinglong Wu, and Zili Yi. CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP. arXiv:2203.00386, 2022

work page arXiv 2022
[56]

GAN Inversion: A Survey

Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei Zhou, and Ming-Hsuan Yang. GAN Inversion: A Survey. arXiv:2101.05278, 2021

work page arXiv 2021
[57]

Attngan: Fine-grained text to image gen- eration with attentional generative adversarial networks

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. arXiv:1711.10485, 2017

work page arXiv 2017
[58]

Improving text-to-image synthesis using contrastive learning

Hui Ye, Xiulong Yang, Martin Takac, Rajshekhar Sunderraman, and Shihao Ji. Improving Text-to-Image Synthesis Using Contrastive Learning. arXiv:2107.02423, 2021

work page arXiv 2021
[59]

Y ., Baldridge, J., Lee, H., and Yang, Y

Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. Cross-Modal Contrastive Learning for Text-to-Image Generation. arXiv:2101.04702, 2021

work page arXiv 2021
[60]

Walk in the cloud: Learning curves for point clouds shape analysis, pp

Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a Practical Degradation Model for Deep Blind Image Super-Resolution. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2021. doi: 10.1109/iccv48922.2021.00475. URL http://dx.doi.org/10.1109/ ICCV48922.2021.00475

work page doi:10.1109/iccv48922.2021.00475 2021
[61]

arXiv preprint arXiv:2010.00747 , year=

Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, and Curtis P. Langlotz. Contrastive Learning of Medical Visual Representations from Paired Images and Text. arXiv:2010.00747, 2020

work page arXiv 2010
[62]

Laﬁte: Towards language-free training for text-to- image generation

Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and Tong Sun. LAFITE: Towards Language-Free Training for Text-to-Image Generation. arXiv:2111.13792, 2021

work page arXiv 2021
[63]

Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A. Efros. Generative Visual Manipulation on the Natural Image Manifold. arXiv:1609.03552, 2016

work page arXiv 2016
[64]

Dm-gan: Dy- namic memory generative adversarial networks for text- to-image synthesis

Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis. arXiv:1904.01310, 2019. 22 A Linear Probes for Evaluations For our evaluations, we leverage two new linear probes on top of a CLIP ViT-L/14 [13] model. To automate aesthetic quality evaluations, we follow the procedure used b...

work page arXiv 1904

[1] [1]

arXiv preprint arXiv:2201.07520 , year=

Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, and Luke Zettlemoyer. CM3: A Causal Masked Multimodal Model of the Internet. arXiv:2201.07520, 2022

work page arXiv 2022

[2] [2]

Analytic-dpm: an an- alytic estimate of the optimal reverse variance in diffusion probabilistic models

Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-DPM: an Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models. CoRR, abs/2201.06503, 2022. URL https: //arxiv.org/abs/2201.06503

work page arXiv 2022

[3] [3]

High Fidelity Visualization of What Your Self-Supervised Representation Knows About

Florian Bordes, Randall Balestriero, and Pascal Vincent. High Fidelity Visualization of What Your Self-Supervised Representation Knows About. arXiv:2112.09164, 2021

work page arXiv 2021

[4] [4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page internal anchor Pith review Pith/arXiv arXiv 2005

[5] [5]

Very deep vaes generalize autoregressive models and can outperform them on images,

Rewon Child. Very Deep V AEs Generalize Autoregressive Models and Can Outperform Them on Images. arXiv:2011.10650, 2021

work page arXiv 2011

[6] [6]

A V A Linear Probe

Katherine Crowson. A V A Linear Probe. https://twitter.com/RiversHaveWings/status/ 1472346186728173568?s=20&t=T-HRr3Gw5HRGjQaMDtRe3A, 2021

work page 2021

[7] [7]

CLIP guided diffusion HQ 256x256

Katherine Crowson. CLIP guided diffusion HQ 256x256. https://colab.research.google.com/ drive/12a_Wrfi2_gwwAuN3VvMTwVMz9TfqctNj, 2021

work page 2021

[8] [8]

CLIP Guided Diffusion 512x512, Secondary Model Method

Katherine Crowson. CLIP Guided Diffusion 512x512, Secondary Model Method. https://twitter. com/RiversHaveWings/status/1462859669454536711, 2021

work page arXiv 2021

[9] [9]

v-diffusion

Katherine Crowson. v-diffusion. https://github.com/crowsonkb/v-diffusion-pytorch, 2021

work page 2021

[10] [10]

arXiv preprint arXiv:2006.06666 , eprint =

Karan Desai and Justin Johnson. VirTex: Learning Visual Representations from Textual Annotations. arXiv:2006.06666, 2020

work page arXiv 2006

[11] [11]

Diffusion Models Beat GANs on Image Synthesis

Prafulla Dhariwal and Alex Nichol. Diffusion Models Beat GANs on Image Synthesis. arXiv:2105.05233, 2021

work page internal anchor Pith review arXiv 2021

[12] [12]

Cogview: Mastering text-to-image generation via transformers

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. CogView: Mastering Text-to-Image Generation via Transformers. arXiv:2105.13290, 2021

work page arXiv 2021

[13] [13]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[14] [14]

Esser, R

Patrick Esser, Robin Rombach, and Björn Ommer. Taming Transformers for High-Resolution Image Synthesis. arXiv:2012.09841, 2020

work page arXiv 2012

[15] [15]

Sharpness-Aware Minimization for Efficiently Improving Generalization

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-Aware Minimization for Efﬁciently Improving Generalization. arXiv:2010.01412, 2020. 19

work page internal anchor Pith review arXiv 2010

[16] [16]

CLOOB: Modern Hopﬁeld Networks with InfoLOOB Outperform CLIP, 2022

Andreas Fürst, Elisabeth Rumetshofer, Viet Thuong Tran, Hubert Ramsauer, Fei Tang, Johannes Lehner, D P Kreil, Michael K Kopp, Günter Klambauer, Angela Bitto-Nemling, and Sepp Hochreiter. CLOOB: Modern Hopﬁeld Networks with InfoLOOB Outperform CLIP, 2022. URL https://openreview. net/forum?id=qw674L9PfQE

work page 2022

[17] [17]

Make-a-Scene:

Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-A- Scene: Scene-Based Text-to-Image Generation with Human Priors. arXiv:2203.13131, 2022

work page arXiv 2022

[18] [18]

Stylegan-nada: Clip-guided domain adaptation of image generators

Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and Daniel Cohen-Or. StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators. arXiv:2108.00946, 2021

work page arXiv 2021

[19] [19]

Galatolo, Mario G

Federico A. Galatolo, Mario G. C. A. Cimino, and Gigliola Vaglini. Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search. arXiv:2102.01645, 2021

work page arXiv 2021

[20] [20]

Multimodal Neurons in Artificial Neural Networks , year =

Gabriel Goh, Nick Cammarata † , Chelsea V oss† , Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and Chris Olah. Multimodal Neurons in Artiﬁcial Neural Networks. Distill, 2021. doi: 10.23915/distill.00030. https://distill.pub/2021/multimodal-neurons

work page doi:10.23915/distill.00030 2021

[21] [21]

Generative Adversarial Networks

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Networks. arXiv:1406.2661, 2014

work page internal anchor Pith review arXiv 2014

[22] [22]

Vector quantized diffusion model for text-to-image synthesis, 2022

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector Quantized Diffusion Model for Text-to-Image Synthesis. arXiv:2111.14822, 2021

work page arXiv 2021

[23] [23]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. Advances in Neural Information Processing Systems 30 (NIPS 2017) , 2017

work page 2017

[24] [24]

Classiﬁer-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classiﬁer-Free Diffusion Guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications , 2021. URL https://openreview.net/ forum?id=qw8AKxfYbI

work page 2021

[25] [25]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models.arXiv:2006.11239, 2020

work page internal anchor Pith review arXiv 2006

[26] [26]

Cascaded diffusion models for high fidelity image generation

Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded Diffusion Models for High Fidelity Image Generation. arXiv:2106.15282, 2021

work page arXiv 2021

[27] [27]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[28] [28]

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft COCO: Common Objects in Context. arXiv:1405.0312, 2014

work page internal anchor Pith review arXiv 2014

[29] [29]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[30] [30]

DALL ·E 2 Preview - Risks and Limitations

Pamela Mishkin, Lama Ahmad, Miles Brundage, Gretchen Krueger, and Girish Sastry. DALL ·E 2 Preview - Risks and Limitations. 2022. URL https://github.com/openai/dalle-2-preview/ blob/main/system-card.md

work page 2022

[31] [31]

Wagner, and Saining Xie

Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. SLIP: Self-supervision meets Language-Image Pre-training. arXiv:2112.12750, 2021

work page arXiv 2021

[32] [32]

The Big Sleep

Ryan Murdock. The Big Sleep. https://twitter.com/advadnoun/status/ 1351038053033406468, 2021. 20

work page 2021

[33] [33]

A V A: A large-scale database for aesthetic visual analysis

Naila Murray, Luca Marchesotti, and Florent Perronnin. A V A: A large-scale database for aesthetic visual analysis. In 2012 IEEE Conference on Computer Vision and Pattern Recognition , pages 2408–2415,

work page 2012

[34] [34]

doi: 10.1109/CVPR.2012.6247954

work page doi:10.1109/cvpr.2012.6247954 2012

[35] [35]

Improved Denoising Diffusion Probabilistic Models

Alex Nichol and Prafulla Dhariwal. Improved Denoising Diffusion Probabilistic Models. arXiv:2102.09672, 2021

work page internal anchor Pith review arXiv 2021

[36] [36]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv:2112.10741, 2021

work page internal anchor Pith review arXiv 2021

[37] [37]

Styleclip: Text-driven manipulation of stylegan imagery.arXiv preprint arXiv:2103.17249, 2021

Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. StyleCLIP: Text- Driven Manipulation of StyleGAN Imagery. arXiv:2103.17249, 2021

work page arXiv 2021

[38] [38]

Karl Pearson. LIII. On lines and planes of closest ﬁt to systems of points in space, November 1901. URL https://doi.org/10.1080/14786440109462720

work page doi:10.1080/14786440109462720 1901

[39] [39]

Diffusion Autoencoders: Toward a Meaningful and Decodable Representation

Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion Autoencoders: Toward a Meaningful and Decodable Representation. arXiv:2111.15640, 2021

work page arXiv 2021

[40] [40]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[41] [41]

Zero-Shot Text-to-Image Generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-Shot Text-to-Image Generation. arXiv:2102.12092, 2021

work page internal anchor Pith review arXiv 2021

[42] [42]

Generating Diverse High-Fidelity Images with VQ-VAE-2

Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating Diverse High-Fidelity Images with VQ-V AE-2.arXiv:1906.00446, 2019

work page Pith review arXiv 1906

[43] [43]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752, 2021

work page Pith review arXiv 2021

[44] [44]

Saharia, J

Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Image Super-Resolution via Iterative Reﬁnement.arXiv:arXiv:2104.07636, 2021

work page arXiv 2021

[45] [45]

Learning Visual Representations with Caption Annotations

Mert Bulent Sariyildiz, Julien Perez, and Diane Larlus. Learning Visual Representations with Caption Annotations. arXiv:2008.01392, 2020

work page arXiv 2008

[46] [46]

How much can clip beneﬁt vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021

Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How Much Can CLIP Beneﬁt Vision-and-Language Tasks? arXiv:2107.06383, 2021

work page arXiv 2021

[47] [47]

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv:1503.03585, 2015

work page internal anchor Pith review arXiv 2015

[48] [48]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising Diffusion Implicit Models. arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[49] [49]

Improved techniques for training Score-Based generative models

Yang Song and Stefano Ermon. Improved Techniques for Training Score-Based Generative Models. arXiv:2006.09011, 2020

work page arXiv 2006

[50] [50]

Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis

Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe, Xiao-Yuan Jing, Fei Wu, and Bingkun Bao. DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis. arXiv:2008.05865, 2020

work page arXiv 2008

[51] [51]

P., Kumar, A., Er- mon, S., and Poole, B

Arash Vahdat and Jan Kautz. NV AE: A Deep Hierarchical Variational Autoencoder.arXiv:2007.03898, 2020. 21

work page arXiv 2007

[52] [52]

Score-based Generative Modeling in Latent Space

Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based Generative Modeling in Latent Space. In Neural Information Processing Systems (NeurIPS) , 2021

work page 2021

[53] [53]

Neural Discrete Representation Learning

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural Discrete Representation Learning. arXiv:1711.00937, 2017

work page Pith review arXiv 2017

[54] [54]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. arXiv:1706.03762, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[55] [55]

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

Zihao Wang, Wei Liu, Qian He, Xinglong Wu, and Zili Yi. CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP. arXiv:2203.00386, 2022

work page arXiv 2022

[56] [56]

GAN Inversion: A Survey

Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei Zhou, and Ming-Hsuan Yang. GAN Inversion: A Survey. arXiv:2101.05278, 2021

work page arXiv 2021

[57] [57]

Attngan: Fine-grained text to image gen- eration with attentional generative adversarial networks

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. arXiv:1711.10485, 2017

work page arXiv 2017

[58] [58]

Improving text-to-image synthesis using contrastive learning

Hui Ye, Xiulong Yang, Martin Takac, Rajshekhar Sunderraman, and Shihao Ji. Improving Text-to-Image Synthesis Using Contrastive Learning. arXiv:2107.02423, 2021

work page arXiv 2021

[59] [59]

Y ., Baldridge, J., Lee, H., and Yang, Y

Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. Cross-Modal Contrastive Learning for Text-to-Image Generation. arXiv:2101.04702, 2021

work page arXiv 2021

[60] [60]

Walk in the cloud: Learning curves for point clouds shape analysis, pp

Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a Practical Degradation Model for Deep Blind Image Super-Resolution. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2021. doi: 10.1109/iccv48922.2021.00475. URL http://dx.doi.org/10.1109/ ICCV48922.2021.00475

work page doi:10.1109/iccv48922.2021.00475 2021

[61] [61]

arXiv preprint arXiv:2010.00747 , year=

Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, and Curtis P. Langlotz. Contrastive Learning of Medical Visual Representations from Paired Images and Text. arXiv:2010.00747, 2020

work page arXiv 2010

[62] [62]

Laﬁte: Towards language-free training for text-to- image generation

Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and Tong Sun. LAFITE: Towards Language-Free Training for Text-to-Image Generation. arXiv:2111.13792, 2021

work page arXiv 2021

[63] [63]

Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A. Efros. Generative Visual Manipulation on the Natural Image Manifold. arXiv:1609.03552, 2016

work page arXiv 2016

[64] [64]

Dm-gan: Dy- namic memory generative adversarial networks for text- to-image synthesis

Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis. arXiv:1904.01310, 2019. 22 A Linear Probes for Evaluations For our evaluations, we leverage two new linear probes on top of a CLIP ViT-L/14 [13] model. To automate aesthetic quality evaluations, we follow the procedure used b...

work page arXiv 1904