Hierarchical Text-Conditional Image Generation with CLIP Latents
Pith reviewed 2026-05-10 16:51 UTC · model grok-4.3
The pith
A two-stage model that first generates a CLIP image embedding from text and then decodes it into pixels yields more diverse images than direct text-to-image generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that explicitly generating CLIP image embeddings via a prior conditioned on text, then decoding those embeddings with a diffusion model, improves image diversity with minimal loss in photorealism and caption similarity relative to direct generation methods. The joint CLIP space further enables zero-shot language-guided image manipulations and controlled variations that preserve semantics and style.
What carries the argument
A prior model that maps text captions to CLIP image embeddings, paired with a diffusion decoder that maps those embeddings to images.
If this is right
- Decoders can produce multiple variations of an image that keep its semantics and style while changing details absent from the embedding.
- The joint CLIP embedding space supports language-guided image manipulations without additional training.
- Diffusion models for the prior are computationally more efficient and produce higher-quality samples than autoregressive alternatives.
- Explicit generation of the image representation allows the system to vary non-essential details without altering core content.
Where Pith is reading between the lines
- The separation of prior and decoder could be tested on other conditional generation tasks where intermediate representations might improve controllability.
- Leveraging a fixed pre-trained embedding space may allow independent scaling or fine-tuning of the prior and decoder for specialized domains.
- Similar hierarchical designs might reduce the parameter count needed in the final decoder by offloading semantic encoding to the prior.
Load-bearing premise
A CLIP image embedding contains enough semantic and stylistic information for a decoder to reconstruct varied high-quality images while safely omitting non-essential details.
What would settle it
Train a single-stage text-to-image model and the two-stage prior-plus-decoder model on identical data, then check whether the two-stage version shows measurably higher diversity scores without a corresponding drop in photorealism or caption-matching scores.
read the original abstract
Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that a hierarchical two-stage model for text-conditional image generation—a prior that produces CLIP image embeddings from text captions, followed by a decoder that generates images conditioned on those embeddings—improves output diversity relative to direct text-to-image baselines while incurring only minimal losses in photorealism and caption similarity. Diffusion models are used for the decoder and both autoregressive and diffusion models are tested for the prior (with the latter found more efficient and higher-quality); the approach also enables image variations that preserve semantics and style plus zero-shot language-guided manipulations via the shared CLIP space.
Significance. If the reported empirical comparisons hold, the result is significant for text-to-image synthesis: by factoring high-level semantics and style into the CLIP embedding and letting the decoder supply omitted pixel-level details, the method demonstrably trades off diversity against quality in a controllable way. The direct ablations comparing diffusion versus autoregressive priors and varying decoder conditioning supply concrete evidence for the stated trade-off and the practical utility of zero-shot editing.
minor comments (3)
- [Abstract] Abstract: the claim of 'empirical improvements' and 'minimal loss' would be easier to evaluate if the abstract itself included the key quantitative metrics (e.g., FID, CLIP similarity, diversity scores) and the primary baselines against which the gains are measured.
- [§3] §3 (Method): the precise conditioning mechanism and noise schedule used in the diffusion prior are described at a high level; adding the exact hyper-parameter values or a reference to the supplementary material would improve reproducibility.
- [Table 2 / Figure 4] Table 2 / Figure 4: the diversity and photorealism metrics for the hierarchical model versus the direct baseline are presented, but the caption does not explicitly state the number of samples used for each metric or whether the same random seeds were shared across conditions.
Simulated Author's Rebuttal
We thank the referee for the positive review and recommendation of minor revision. The referee summary accurately captures the core contributions of our hierarchical prior-decoder approach using CLIP latents for improved diversity in text-conditional image generation.
Circularity Check
No significant circularity
full rationale
The paper presents an empirical two-stage architecture (text-to-CLIP-embedding prior + embedding-to-image decoder) whose central claims rest on reported experimental comparisons, ablations, and qualitative results rather than any closed-form derivation. No equations or steps reduce by construction to fitted parameters, self-referential definitions, or load-bearing self-citations; the CLIP embedding is treated as an external pretrained representation, and diversity/photorealism trade-offs are measured against independent baselines. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- diffusion prior and decoder hyperparameters
axioms (2)
- domain assumption CLIP embeddings capture the semantics and style needed for high-quality image reconstruction and variation
- domain assumption Diffusion models can decode from CLIP latents without direct text conditioning
Forward citations
Cited by 60 Pith papers
-
Flow-GRPO: Training Flow Matching Models via Online RL
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
-
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
Score entropy loss enables discrete diffusion models (SEDD) that cut perplexity 25-75% versus prior diffusion methods and outperform GPT-2 on language modeling while supporting infilling and compute-quality tradeoffs.
-
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
-
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
-
Consistency Models
Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
-
MusicLM: Generating Music From Text
MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
-
Building Normalizing Flows with Stochastic Interpolants
Normalizing flows are constructed by learning the velocity of a stochastic interpolant via a quadratic loss derived from its probability current, yielding an efficient ODE-based alternative to diffusion models.
-
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
-
Prompt-to-Prompt Image Editing with Cross Attention Control
Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.
-
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.
-
VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation
VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.
-
GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation
GenEvolve proposes a self-evolving agent framework for open-ended image generation that uses tool-orchestrated trajectories and visual experience distillation from best-worst differences to achieve reported state-of-t...
-
GeoDiff-SAR II: 3D-Driven Foundation Diffusion Models for SAR Generation via Decoupled Control
GeoDiff-SAR II proposes a 3D-driven decoupled diffusion framework using GECM and ControlNet on a FLUX backbone for controllable SAR image generation across large viewpoint gaps.
-
Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models
ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.
-
Functionalization via Structure Completion and Motion Rectification
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture wi...
-
Designing streetscapes from street-view imagery using diffusion models
A multimodal diffusion model generates controllable alternative streetscapes from street-view imagery using visual metrics and text, shown on Chicago and Orlando data with gains in semantic consistency.
-
A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation
CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.
-
Generating HDR Video from SDR Video
A multi-exposure video model predicts bracketed linear SDR sequences from single nonlinear SDR input, which a merging model combines into HDR video preserving shadow and highlight detail.
-
HIR-ALIGN: Enhancing Hyperspectral Image Restoration via Diffusion-Based Data Generation
HIR-ALIGN augments limited target data for hyperspectral restoration by creating proxy clean images, synthesizing aligned HSIs with blur-robust diffusion and warp-based transfer, then finetuning models to lower target...
-
ImageAttributionBench: How Far Are We from Generalizable Attribution?
ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.
-
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming ...
-
Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning
SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.
-
Coordinated Diffusion: Generating Multi-Agent Behavior Without Multi-Agent Demonstrations
CoDi decomposes the multi-agent diffusion score into pre-trained single-agent policies plus a gradient-free cost guidance term to generate coordinated behavior from single-agent data alone.
-
Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits
Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.
-
Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences
Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.
-
Hyperbolic Concept Bottleneck Models
HypCBM reformulates concept activations as geometric containment in hyperbolic space to produce sparse, hierarchy-aware signals that match Euclidean models trained on 20 times more data.
-
Hyperbolic Concept Bottleneck Models
Hyperbolic Concept Bottleneck Models reformulate concept activations as test-time geometric containment in hyperbolic entailment cones to produce sparse, hierarchy-aware signals without extra supervision.
-
A Flow Matching Algorithm for Many-Shot Adaptation to Unseen Distributions
FP-FM adapts flow matching models to unseen distributions via least-squares projection onto basis functions spanning training velocity fields, yielding improved precision and recall without inference-time training.
-
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by minimizing distribution differences between a text-only student and a multimodal teacher on the student's o...
-
A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping
Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and...
-
LEGO: LoRA-Enabled Generator-Oriented Framework for Synthetic Image Detection
LEGO uses multiple generator-specific LoRA modules modulated by an MLP and fused with attention to detect synthetic images, achieving better performance than prior methods while using under 10% of the training data.
-
Generative Modeling with Orbit-Space Particle Flow Matching
OGPP is a particle flow-matching method using orbit-space canonicalization and geometric paths that achieves lower error and fewer steps than prior approaches on 3D benchmarks.
-
Hyper-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
Frequency analysis of smooth robot actions bounds denoising error to low-frequency modes, enabling a sub-1% parameter 3D diffusion policy with two-step inference that reaches SOTA on manipulation benchmarks.
-
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
-
ResetEdit: Precise Text-guided Editing of Generated Image via Resettable Starting Latent
ResetEdit embeds a recoverable discrepancy signal during image generation in diffusion models to reconstruct an approximate original latent for high-fidelity text-guided editing.
-
CA-IDD: Cross-Attention Guided Identity-Conditional Diffusion for Identity-Consistent Face Swapping
CA-IDD is the first diffusion model for face swapping that integrates multi-modal cross-attention guidance from identity embeddings, gaze, and facial parsing to achieve better identity consistency and an FID of 11.73 ...
-
Dream-Cubed: Controllable Generative Modeling in Minecraft by Training on Billions of Cubes
Dream-Cubed releases a billion-scale voxel dataset and 3D diffusion models that generate controllable Minecraft worlds by operating directly on blocks.
-
Long-Text-to-Image Generation via Compositional Prompt Decomposition
PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models whil...
-
Grokking of Diffusion Models: Case Study on Modular Addition
Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
-
Marrying Text-to-Motion Generation with Skeleton-Based Action Recognition
CoAMD unifies skeleton-based action recognition and text-to-motion generation through autoregressive diffusion guided by a multi-modal recognizer, reporting SOTA results on 13 benchmarks for four tasks.
-
Quality-Aware Calibration for AI-Generated Image Detection in the Wild
QuAD aggregates quality-weighted detection scores from near-duplicates of an image to raise balanced accuracy by about 8% over simple averaging on state-of-the-art detectors.
-
Step-level Denoising-time Diffusion Alignment with Multiple Objectives
MSDDA derives a closed-form optimal reverse denoising distribution for multi-objective diffusion alignment that is exactly equivalent to step-level RL fine-tuning with no approximation error.
-
Scaling Exposes the Trigger: Input-Level Backdoor Detection in Text-to-Image Diffusion Models via Cross-Attention Scaling
SET detects input-level backdoors in T2I diffusion models by learning a benign cross-attention response space from clean samples and flagging deviations under multi-scale perturbations.
-
HiddenObjects: Scalable Diffusion-Distilled Spatial Priors for Object Placement
A diffusion-based pipeline creates a 27M-annotation dataset of object placements that outperforms human annotations and baselines on image editing tasks, then distills it into a fast model.
-
NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity
NeuroFlow is the first unified flow model for bidirectional visual encoding and decoding from neural activity using NeuroVAE and cross-modal flow matching.
-
SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation
SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.
-
Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding
Omni-NegCLIP improves CLIP's negation understanding by up to 52.65% on presence-based and 12.50% on absence-based tasks through front-layer fine-tuning with specialized contrastive losses.
-
Is the Modality Gap a Bug or a Feature? A Robustness Perspective
Minimizing contrastive loss produces an orthogonal modality gap vector whose size is monotonically tied to robustness, so post-processing that reduces the gap improves robustness with no loss in clean accuracy.
-
Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation
DREAM introduces Masking Warmup and Semantically Aligned Decoding to let a single encoder handle both contrastive alignment and masked generation, yielding gains over CLIP and FLUID on understanding and generation benchmarks.
-
MultiAnimate: Pose-Guided Image Animation Made Extensible
MultiAnimate adds Identifier Assigner and Identifier Adapter modules to diffusion video models so they can handle multiple characters without identity mix-ups, generalizing from two-character training data to more characters.
-
Information Filtering via Variational Regularization for Robot Manipulation
Variational Regularization imposes an adaptive information bottleneck on noisy intermediate features in DP3-UNet and DP3-DiT policies, consistently raising task success rates on RoboTwin2.0, Adroit, and MetaWorld whil...
-
A Unified and Controllable Framework for Layered Image Generation with Visual Effects
LASAGNA produces layered images with integrated visual effects in a single pass, enabling drift-free edits via alpha compositing while releasing a 48K dataset and a 242-sample benchmark.
-
ATATA: One Algorithm to Align Them All
ATATA enables fast joint inference of structurally aligned pairs using Rectified Flow models via segment transport, improving state-of-the-art for image and video generation while matching 3D quality at much higher speed.
-
CompNO: A Novel Foundation Model approach for solving Partial Differential Equations
CompNO composes specialized Fourier neural operator blocks for fundamental differential operators into task-specific solvers that achieve lower L2 error than baselines on linear parametric PDEs and remain competitive ...
-
Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
LocalDPO aligns text-to-video diffusion models with human preferences at the spatio-temporal region level by automatically generating localized preference pairs from corrupted real videos and applying a region-aware DPO loss.
-
Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
LocalDPO creates localized preference pairs from real videos by applying random spatio-temporal masks and restoring masked regions with the frozen base model, then applies region-restricted DPO loss to improve fidelit...
-
Screen, Cache, and Match: A Training-Free Causality-Consistent Reference Frame Framework for Human Animation
FrameCache uses a Screen-Cache-Match strategy and Trajectory-Aware Autoregressive Generation to convert past frames into causal guidance for temporally coherent human animation videos.
-
Agile Deliberation: Concept Deliberation for Subjective Visual Classification
Agile Deliberation improves F1 scores by 7.5% over automated baselines and 3% over manual deliberation in 18 user sessions by supporting iterative refinement of subjective visual concepts.
-
One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer
One-to-All Animation enables alignment-free character animation and image pose transfer via self-supervised outpainting reformulation, reference extraction, hybrid fusion attention, identity-robust pose control, and t...
-
SVG360: Editable Multiview Vector Graphics from a Single SVG
SVG360 lifts a single SVG to a view-conditioned representation, uses spatial memory to propagate consistent parts across views, and applies structure-aware vectorization to produce editable multiview SVGs.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2201.07520 , year=
Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, and Luke Zettlemoyer. CM3: A Causal Masked Multimodal Model of the Internet. arXiv:2201.07520, 2022
-
[2]
Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-DPM: an Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models. CoRR, abs/2201.06503, 2022. URL https: //arxiv.org/abs/2201.06503
-
[3]
High Fidelity Visualization of What Your Self-Supervised Representation Knows About
Florian Bordes, Randall Balestriero, and Pascal Vincent. High Fidelity Visualization of What Your Self-Supervised Representation Knows About. arXiv:2112.09164, 2021
-
[4]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[5]
Very deep vaes generalize autoregressive models and can outperform them on images,
Rewon Child. Very Deep V AEs Generalize Autoregressive Models and Can Outperform Them on Images. arXiv:2011.10650, 2021
-
[6]
Katherine Crowson. A V A Linear Probe. https://twitter.com/RiversHaveWings/status/ 1472346186728173568?s=20&t=T-HRr3Gw5HRGjQaMDtRe3A, 2021
work page 2021
-
[7]
CLIP guided diffusion HQ 256x256
Katherine Crowson. CLIP guided diffusion HQ 256x256. https://colab.research.google.com/ drive/12a_Wrfi2_gwwAuN3VvMTwVMz9TfqctNj, 2021
work page 2021
-
[8]
CLIP Guided Diffusion 512x512, Secondary Model Method
Katherine Crowson. CLIP Guided Diffusion 512x512, Secondary Model Method. https://twitter. com/RiversHaveWings/status/1462859669454536711, 2021
-
[9]
Katherine Crowson. v-diffusion. https://github.com/crowsonkb/v-diffusion-pytorch, 2021
work page 2021
-
[10]
arXiv preprint arXiv:2006.06666 , eprint =
Karan Desai and Justin Johnson. VirTex: Learning Visual Representations from Textual Annotations. arXiv:2006.06666, 2020
-
[11]
Diffusion Models Beat GANs on Image Synthesis
Prafulla Dhariwal and Alex Nichol. Diffusion Models Beat GANs on Image Synthesis. arXiv:2105.05233, 2021
work page internal anchor Pith review arXiv 2021
-
[12]
Cogview: Mastering text-to-image generation via transformers
Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. CogView: Mastering Text-to-Image Generation via Transformers. arXiv:2105.13290, 2021
-
[13]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
- [14]
-
[15]
Sharpness-Aware Minimization for Efficiently Improving Generalization
Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-Aware Minimization for Efficiently Improving Generalization. arXiv:2010.01412, 2020. 19
work page internal anchor Pith review arXiv 2010
-
[16]
CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP, 2022
Andreas Fürst, Elisabeth Rumetshofer, Viet Thuong Tran, Hubert Ramsauer, Fei Tang, Johannes Lehner, D P Kreil, Michael K Kopp, Günter Klambauer, Angela Bitto-Nemling, and Sepp Hochreiter. CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP, 2022. URL https://openreview. net/forum?id=qw674L9PfQE
work page 2022
-
[17]
Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-A- Scene: Scene-Based Text-to-Image Generation with Human Priors. arXiv:2203.13131, 2022
-
[18]
Stylegan-nada: Clip-guided domain adaptation of image generators
Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and Daniel Cohen-Or. StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators. arXiv:2108.00946, 2021
-
[19]
Federico A. Galatolo, Mario G. C. A. Cimino, and Gigliola Vaglini. Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search. arXiv:2102.01645, 2021
-
[20]
Multimodal Neurons in Artificial Neural Networks , year =
Gabriel Goh, Nick Cammarata † , Chelsea V oss† , Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and Chris Olah. Multimodal Neurons in Artificial Neural Networks. Distill, 2021. doi: 10.23915/distill.00030. https://distill.pub/2021/multimodal-neurons
-
[21]
Generative Adversarial Networks
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Networks. arXiv:1406.2661, 2014
work page internal anchor Pith review arXiv 2014
-
[22]
Vector quantized diffusion model for text-to-image synthesis, 2022
Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector Quantized Diffusion Model for Text-to-Image Synthesis. arXiv:2111.14822, 2021
-
[23]
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. Advances in Neural Information Processing Systems 30 (NIPS 2017) , 2017
work page 2017
-
[24]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-Free Diffusion Guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications , 2021. URL https://openreview.net/ forum?id=qw8AKxfYbI
work page 2021
-
[25]
Denoising Diffusion Probabilistic Models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models.arXiv:2006.11239, 2020
work page internal anchor Pith review arXiv 2006
-
[26]
Cascaded diffusion models for high fidelity image generation
Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded Diffusion Models for High Fidelity Image Generation. arXiv:2106.15282, 2021
-
[27]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[28]
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft COCO: Common Objects in Context. arXiv:1405.0312, 2014
work page internal anchor Pith review arXiv 2014
-
[29]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[30]
DALL ·E 2 Preview - Risks and Limitations
Pamela Mishkin, Lama Ahmad, Miles Brundage, Gretchen Krueger, and Girish Sastry. DALL ·E 2 Preview - Risks and Limitations. 2022. URL https://github.com/openai/dalle-2-preview/ blob/main/system-card.md
work page 2022
-
[31]
Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. SLIP: Self-supervision meets Language-Image Pre-training. arXiv:2112.12750, 2021
-
[32]
Ryan Murdock. The Big Sleep. https://twitter.com/advadnoun/status/ 1351038053033406468, 2021. 20
work page 2021
-
[33]
A V A: A large-scale database for aesthetic visual analysis
Naila Murray, Luca Marchesotti, and Florent Perronnin. A V A: A large-scale database for aesthetic visual analysis. In 2012 IEEE Conference on Computer Vision and Pattern Recognition , pages 2408–2415,
work page 2012
-
[34]
doi: 10.1109/CVPR.2012.6247954
-
[35]
Improved Denoising Diffusion Probabilistic Models
Alex Nichol and Prafulla Dhariwal. Improved Denoising Diffusion Probabilistic Models. arXiv:2102.09672, 2021
work page internal anchor Pith review arXiv 2021
-
[36]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv:2112.10741, 2021
work page internal anchor Pith review arXiv 2021
-
[37]
Styleclip: Text-driven manipulation of stylegan imagery.arXiv preprint arXiv:2103.17249, 2021
Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. StyleCLIP: Text- Driven Manipulation of StyleGAN Imagery. arXiv:2103.17249, 2021
-
[38]
Karl Pearson. LIII. On lines and planes of closest fit to systems of points in space, November 1901. URL https://doi.org/10.1080/14786440109462720
-
[39]
Diffusion Autoencoders: Toward a Meaningful and Decodable Representation
Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion Autoencoders: Toward a Meaningful and Decodable Representation. arXiv:2111.15640, 2021
-
[40]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[41]
Zero-Shot Text-to-Image Generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-Shot Text-to-Image Generation. arXiv:2102.12092, 2021
work page internal anchor Pith review arXiv 2021
-
[42]
Generating Diverse High-Fidelity Images with VQ-VAE-2
Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating Diverse High-Fidelity Images with VQ-V AE-2.arXiv:1906.00446, 2019
work page Pith review arXiv 1906
-
[43]
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752, 2021
work page Pith review arXiv 2021
-
[44]
Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Image Super-Resolution via Iterative Refinement.arXiv:arXiv:2104.07636, 2021
-
[45]
Learning Visual Representations with Caption Annotations
Mert Bulent Sariyildiz, Julien Perez, and Diane Larlus. Learning Visual Representations with Caption Annotations. arXiv:2008.01392, 2020
-
[46]
How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021
Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How Much Can CLIP Benefit Vision-and-Language Tasks? arXiv:2107.06383, 2021
-
[47]
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv:1503.03585, 2015
work page internal anchor Pith review arXiv 2015
-
[48]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising Diffusion Implicit Models. arXiv:2010.02502, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[49]
Improved techniques for training Score-Based generative models
Yang Song and Stefano Ermon. Improved Techniques for Training Score-Based Generative Models. arXiv:2006.09011, 2020
-
[50]
Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis
Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe, Xiao-Yuan Jing, Fei Wu, and Bingkun Bao. DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis. arXiv:2008.05865, 2020
-
[51]
P., Kumar, A., Er- mon, S., and Poole, B
Arash Vahdat and Jan Kautz. NV AE: A Deep Hierarchical Variational Autoencoder.arXiv:2007.03898, 2020. 21
-
[52]
Score-based Generative Modeling in Latent Space
Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based Generative Modeling in Latent Space. In Neural Information Processing Systems (NeurIPS) , 2021
work page 2021
-
[53]
Neural Discrete Representation Learning
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural Discrete Representation Learning. arXiv:1711.00937, 2017
work page Pith review arXiv 2017
-
[54]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. arXiv:1706.03762, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[55]
CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP
Zihao Wang, Wei Liu, Qian He, Xinglong Wu, and Zili Yi. CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP. arXiv:2203.00386, 2022
-
[56]
Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei Zhou, and Ming-Hsuan Yang. GAN Inversion: A Survey. arXiv:2101.05278, 2021
-
[57]
Attngan: Fine-grained text to image gen- eration with attentional generative adversarial networks
Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. arXiv:1711.10485, 2017
-
[58]
Improving text-to-image synthesis using contrastive learning
Hui Ye, Xiulong Yang, Martin Takac, Rajshekhar Sunderraman, and Shihao Ji. Improving Text-to-Image Synthesis Using Contrastive Learning. arXiv:2107.02423, 2021
-
[59]
Y ., Baldridge, J., Lee, H., and Yang, Y
Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. Cross-Modal Contrastive Learning for Text-to-Image Generation. arXiv:2101.04702, 2021
-
[60]
Walk in the cloud: Learning curves for point clouds shape analysis, pp
Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a Practical Degradation Model for Deep Blind Image Super-Resolution. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2021. doi: 10.1109/iccv48922.2021.00475. URL http://dx.doi.org/10.1109/ ICCV48922.2021.00475
-
[61]
arXiv preprint arXiv:2010.00747 , year=
Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, and Curtis P. Langlotz. Contrastive Learning of Medical Visual Representations from Paired Images and Text. arXiv:2010.00747, 2020
-
[62]
Lafite: Towards language-free training for text-to- image generation
Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and Tong Sun. LAFITE: Towards Language-Free Training for Text-to-Image Generation. arXiv:2111.13792, 2021
- [63]
-
[64]
Dm-gan: Dy- namic memory generative adversarial networks for text- to-image synthesis
Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis. arXiv:1904.01310, 2019. 22 A Linear Probes for Evaluations For our evaluations, we leverage two new linear probes on top of a CLIP ViT-L/14 [13] model. To automate aesthetic quality evaluations, we follow the procedure used b...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.