arxiv: 2204.06125 · v1 · submitted 2022-04-13 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Alex Nichol, Casey Chu, Mark Chen, Prafulla Dhariwal

Pith reviewed 2026-05-10 16:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-image generationCLIP embeddingsdiffusion modelshierarchical generationimage diversityzero-shot image editingcontrastive representations

0 comments

The pith

A two-stage model that first generates a CLIP image embedding from text and then decodes it into pixels yields more diverse images than direct text-to-image generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes separating text-to-image generation into a prior step that produces a CLIP image embedding from the caption and a decoder step that turns the embedding into an image. This structure is meant to increase the variety of outputs for a given caption while keeping the images realistic and aligned with the text. The approach also supports creating multiple versions of an image that retain its core meaning and appearance but differ in details not captured by the embedding. Experiments compare diffusion and autoregressive models for the prior and find diffusion versions more efficient and effective.

Core claim

The paper claims that explicitly generating CLIP image embeddings via a prior conditioned on text, then decoding those embeddings with a diffusion model, improves image diversity with minimal loss in photorealism and caption similarity relative to direct generation methods. The joint CLIP space further enables zero-shot language-guided image manipulations and controlled variations that preserve semantics and style.

What carries the argument

A prior model that maps text captions to CLIP image embeddings, paired with a diffusion decoder that maps those embeddings to images.

If this is right

Decoders can produce multiple variations of an image that keep its semantics and style while changing details absent from the embedding.
The joint CLIP embedding space supports language-guided image manipulations without additional training.
Diffusion models for the prior are computationally more efficient and produce higher-quality samples than autoregressive alternatives.
Explicit generation of the image representation allows the system to vary non-essential details without altering core content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of prior and decoder could be tested on other conditional generation tasks where intermediate representations might improve controllability.
Leveraging a fixed pre-trained embedding space may allow independent scaling or fine-tuning of the prior and decoder for specialized domains.
Similar hierarchical designs might reduce the parameter count needed in the final decoder by offloading semantic encoding to the prior.

Load-bearing premise

A CLIP image embedding contains enough semantic and stylistic information for a decoder to reconstruct varied high-quality images while safely omitting non-essential details.

What would settle it

Train a single-stage text-to-image model and the two-stage prior-plus-decoder model on identical data, then check whether the two-stage version shows measurably higher diversity scores without a corresponding drop in photorealism or caption-matching scores.

read the original abstract

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move—predicting a CLIP image embedding first, then decoding to pixels—delivers measurable diversity gains with only small quality trade-offs.

read the letter

The central result is that routing generation through an explicit CLIP image embedding improves output variety compared with direct text-to-pixel models, while photorealism and caption alignment stay close. They train a prior to map captions to CLIP latents and then run a diffusion decoder conditioned on those latents. This setup also supports simple zero-shot edits by changing the embedding with text and lets them produce controlled variations that keep semantics and style but alter details the embedding ignores. The diffusion prior beats their autoregressive version on both speed and sample quality, which is a useful practical finding. The experiments directly compare the two-stage model against baselines and report the expected diversity-quality trade-off, so the claims rest on new measurements rather than circular definitions. The architecture exploits CLIP's joint space cleanly, and the decoder conditioning experiments give concrete evidence that the intermediate representation is doing real work. One soft spot is that the abstract and summary tables leave some ablation depth and exact baseline numbers for the full paper to clarify; without those it is hard to judge how consistent the “minimal loss” is across prompt types. The assumption that CLIP latents hold enough information for high-quality decoding is plausible but still empirical, and it may cap fine-grained control in edge cases. This work is aimed at researchers building or extending text-to-image systems who want a modular latent approach rather than end-to-end pixel models. Anyone working on diffusion or representation learning for generation will find the prior comparison and manipulation results worth reading. The paper shows clear thinking and honest experimental engagement, so it deserves a serious referee even if some sections need tightening on metrics.

Referee Report

0 major / 3 minor

Summary. The paper claims that a hierarchical two-stage model for text-conditional image generation—a prior that produces CLIP image embeddings from text captions, followed by a decoder that generates images conditioned on those embeddings—improves output diversity relative to direct text-to-image baselines while incurring only minimal losses in photorealism and caption similarity. Diffusion models are used for the decoder and both autoregressive and diffusion models are tested for the prior (with the latter found more efficient and higher-quality); the approach also enables image variations that preserve semantics and style plus zero-shot language-guided manipulations via the shared CLIP space.

Significance. If the reported empirical comparisons hold, the result is significant for text-to-image synthesis: by factoring high-level semantics and style into the CLIP embedding and letting the decoder supply omitted pixel-level details, the method demonstrably trades off diversity against quality in a controllable way. The direct ablations comparing diffusion versus autoregressive priors and varying decoder conditioning supply concrete evidence for the stated trade-off and the practical utility of zero-shot editing.

minor comments (3)

[Abstract] Abstract: the claim of 'empirical improvements' and 'minimal loss' would be easier to evaluate if the abstract itself included the key quantitative metrics (e.g., FID, CLIP similarity, diversity scores) and the primary baselines against which the gains are measured.
[§3] §3 (Method): the precise conditioning mechanism and noise schedule used in the diffusion prior are described at a high level; adding the exact hyper-parameter values or a reference to the supplementary material would improve reproducibility.
[Table 2 / Figure 4] Table 2 / Figure 4: the diversity and photorealism metrics for the hierarchical model versus the direct baseline are presented, but the caption does not explicitly state the number of samples used for each metric or whether the same random seeds were shared across conditions.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation of minor revision. The referee summary accurately captures the core contributions of our hierarchical prior-decoder approach using CLIP latents for improved diversity in text-conditional image generation.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical two-stage architecture (text-to-CLIP-embedding prior + embedding-to-image decoder) whose central claims rest on reported experimental comparisons, ablations, and qualitative results rather than any closed-form derivation. No equations or steps reduce by construction to fitted parameters, self-referential definitions, or load-bearing self-citations; the CLIP embedding is treated as an external pretrained representation, and diversity/photorealism trade-offs are measured against independent baselines. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim depends on the representational power of pre-trained CLIP embeddings and the ability of diffusion decoders to reconstruct images from them; both are imported from earlier work rather than derived here.

free parameters (1)

diffusion prior and decoder hyperparameters
Specific choices of step count, noise schedule, and conditioning strength that are tuned during training.

axioms (2)

domain assumption CLIP embeddings capture the semantics and style needed for high-quality image reconstruction and variation
Invoked when the prior is trained to predict embeddings and the decoder is conditioned on them.
domain assumption Diffusion models can decode from CLIP latents without direct text conditioning
Required for the decoder stage to function as described.

pith-pipeline@v0.9.0 · 5452 in / 1318 out tokens · 70215 ms · 2026-05-10T16:51:09.590636+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Flow-GRPO: Training Flow Matching Models via Online RL
cs.CV 2025-05 unverdicted novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
stat.ML 2023-10 unverdicted novelty 8.0

Score entropy loss enables discrete diffusion models (SEDD) that cut perplexity 25-75% versus prior diffusion methods and outperform GPT-2 on language modeling while supporting infilling and compute-quality tradeoffs.
Consistency Models
cs.LG 2023-03 conditional novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
MusicLM: Generating Music From Text
cs.SD 2023-01 conditional novelty 8.0

MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
Building Normalizing Flows with Stochastic Interpolants
cs.LG 2022-09 conditional novelty 8.0

Normalizing flows are constructed by learning the velocity of a stochastic interpolant via a quadratic loss derived from its probability current, yielding an efficient ODE-based alternative to diffusion models.
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
cs.LG 2022-09 unverdicted novelty 8.0

Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
Prompt-to-Prompt Image Editing with Cross Attention Control
cs.CV 2022-08 unverdicted novelty 8.0

Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
cs.CV 2022-08 unverdicted novelty 8.0

Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
cs.CV 2026-05 unverdicted novelty 7.0

Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming ...
Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning
cs.LG 2026-05 unverdicted novelty 7.0

SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.
Coordinated Diffusion: Generating Multi-Agent Behavior Without Multi-Agent Demonstrations
cs.RO 2026-05 unverdicted novelty 7.0

CoDi decomposes the multi-agent diffusion score into pre-trained single-agent policies plus a gradient-free cost guidance term to generate coordinated behavior from single-agent data alone.
Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits
math.OC 2026-05 unverdicted novelty 7.0

Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.
Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences
cs.LG 2026-05 unverdicted novelty 7.0

Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.
Hyperbolic Concept Bottleneck Models
cs.LG 2026-05 unverdicted novelty 7.0

HypCBM reformulates concept activations as geometric containment in hyperbolic space to produce sparse, hierarchy-aware signals that match Euclidean models trained on 20 times more data.
Hyperbolic Concept Bottleneck Models
cs.LG 2026-05 unverdicted novelty 7.0

Hyperbolic Concept Bottleneck Models reformulate concept activations as test-time geometric containment in hyperbolic entailment cones to produce sparse, hierarchy-aware signals without extra supervision.
A Flow Matching Algorithm for Many-Shot Adaptation to Unseen Distributions
cs.LG 2026-05 unverdicted novelty 7.0

FP-FM adapts flow matching models to unseen distributions via least-squares projection onto basis functions spanning training velocity fields, yielding improved precision and recall without inference-time training.
A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping
cs.CV 2026-05 unverdicted novelty 7.0

Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and...
LEGO: LoRA-Enabled Generator-Oriented Framework for Synthetic Image Detection
cs.CV 2026-05 unverdicted novelty 7.0

LEGO uses multiple generator-specific LoRA modules modulated by an MLP and fused with attention to detect synthetic images, achieving better performance than prior methods while using under 10% of the training data.
Generative Modeling with Orbit-Space Particle Flow Matching
cs.GR 2026-05 unverdicted novelty 7.0

OGPP is a particle flow-matching method using orbit-space canonicalization and geometric paths that achieves lower error and fewer steps than prior approaches on 3D benchmarks.
Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
cs.RO 2026-05 conditional novelty 7.0

Frequency analysis of smooth robot actions bounds denoising error to low-frequency modes, enabling a sub-1% parameter 3D diffusion policy with two-step inference that reaches SOTA on manipulation benchmarks.
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
cs.CV 2026-05 unverdicted novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
ResetEdit: Precise Text-guided Editing of Generated Image via Resettable Starting Latent
cs.CV 2026-04 unverdicted novelty 7.0

ResetEdit embeds a recoverable discrepancy signal during image generation in diffusion models to reconstruct an approximate original latent for high-fidelity text-guided editing.
CA-IDD: Cross-Attention Guided Identity-Conditional Diffusion for Identity-Consistent Face Swapping
cs.CV 2026-04 unverdicted novelty 7.0

CA-IDD is the first diffusion model for face swapping that integrates multi-modal cross-attention guidance from identity embeddings, gaze, and facial parsing to achieve better identity consistency and an FID of 11.73 ...
Dream-Cubed: Controllable Generative Modeling in Minecraft by Training on Billions of Cubes
cs.CV 2026-04 unverdicted novelty 7.0

Dream-Cubed releases a billion-scale voxel dataset and 3D diffusion models that generate controllable Minecraft worlds by operating directly on blocks.
Long-Text-to-Image Generation via Compositional Prompt Decomposition
cs.CV 2026-04 unverdicted novelty 7.0

PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models whil...
Grokking of Diffusion Models: Case Study on Modular Addition
cs.LG 2026-04 unverdicted novelty 7.0

Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
Marrying Text-to-Motion Generation with Skeleton-Based Action Recognition
cs.CV 2026-04 unverdicted novelty 7.0

CoAMD unifies skeleton-based action recognition and text-to-motion generation through autoregressive diffusion guided by a multi-modal recognizer, reporting SOTA results on 13 benchmarks for four tasks.
Quality-Aware Calibration for AI-Generated Image Detection in the Wild
cs.CV 2026-04 conditional novelty 7.0

QuAD aggregates quality-weighted detection scores from near-duplicates of an image to raise balanced accuracy by about 8% over simple averaging on state-of-the-art detectors.
Step-level Denoising-time Diffusion Alignment with Multiple Objectives
cs.LG 2026-04 unverdicted novelty 7.0

MSDDA derives a closed-form optimal reverse denoising distribution for multi-objective diffusion alignment that is exactly equivalent to step-level RL fine-tuning with no approximation error.
Scaling Exposes the Trigger: Input-Level Backdoor Detection in Text-to-Image Diffusion Models via Cross-Attention Scaling
cs.CR 2026-04 unverdicted novelty 7.0

SET detects input-level backdoors in T2I diffusion models by learning a benign cross-attention response space from clean samples and flagging deviations under multi-scale perturbations.
HiddenObjects: Scalable Diffusion-Distilled Spatial Priors for Object Placement
cs.CV 2026-04 unverdicted novelty 7.0

A diffusion-based pipeline creates a 27M-annotation dataset of object placements that outperforms human annotations and baselines on image editing tasks, then distills it into a fast model.
NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity
cs.LG 2026-04 unverdicted novelty 7.0

NeuroFlow is the first unified flow model for bidirectional visual encoding and decoding from neural activity using NeuroVAE and cross-modal flow matching.
SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation
cs.CV 2026-04 conditional novelty 7.0

SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.
Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding
cs.CV 2026-03 unverdicted novelty 7.0

Omni-NegCLIP improves CLIP's negation understanding by up to 52.65% on presence-based and 12.50% on absence-based tasks through front-layer fine-tuning with specialized contrastive losses.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
cs.CV 2024-06 conditional novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
cs.CV 2024-03 unverdicted novelty 7.0

ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
cs.CV 2023-10 unverdicted novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
cs.CV 2023-10 unverdicted novelty 7.0

Latent Consistency Models enable high-fidelity text-to-image generation in 2-4 steps by directly predicting solutions to the probability flow ODE in latent space, distilled from pre-trained LDMs.
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
cs.CV 2023-07 unverdicted novelty 7.0

A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.
Visual Instruction Tuning
cs.CV 2023-04 unverdicted novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
Scalable Diffusion Models with Transformers
cs.CV 2022-12 unverdicted novelty 7.0

DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.
DreamFusion: Text-to-3D using 2D Diffusion
cs.CV 2022-09 accept novelty 7.0

Optimizes a Neural Radiance Field via probability density distillation from a 2D diffusion model to produce text-conditioned 3D scenes viewable from any angle.
Diffusion Posterior Sampling for General Noisy Inverse Problems
stat.ML 2022-09 unverdicted novelty 7.0

Diffusion models solve noisy (non)linear inverse problems via approximated posterior sampling that blends diffusion steps with manifold gradients without strict consistency projection.
Flamingo: a Visual Language Model for Few-Shot Learning
cs.CV 2022-04 unverdicted novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
cs.CV 2026-05 unverdicted novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
L2P: Unlocking Latent Potential for Pixel Generation
cs.CV 2026-05 unverdicted novelty 6.0

L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
cs.CV 2026-05 unverdicted novelty 6.0

Fashion130K dataset and UMC framework align text and visual prompts with embedding refiner, Fusion Transformer, and redesigned attention to generate more consistent outfits than prior methods.
From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data
cs.CV 2026-05 unverdicted novelty 6.0

The work creates identity-consistent synthetic makeup data via ConsistentBeauty and adapts models to real images using reinforcement learning in RealBeauty, achieving better identity preservation and real-world perfor...
APEX: Assumption-free Projection-based Embedding eXamination Metric for Image Quality Assessment
cs.CV 2026-05 unverdicted novelty 6.0

APEX is an assumption-free image quality metric based on sliced Wasserstein distance applied to open-vocabulary embeddings from CLIP and DINOv2, showing better robustness and stability than FID and similar baselines.
P-Guide: Parameter-Efficient Prior Steering for Single-Pass CFG Inference
cs.AI 2026-05 unverdicted novelty 6.0

P-Guide achieves single-pass classifier-free guidance in flow matching by modulating the initial latent state and is equivalent to standard CFG under a first-order approximation while cutting latency by half.
Leveraging Image Generators to Address Training Data Scarcity: The Gen4Regen Dataset for Forest Regeneration Mapping
cs.CV 2026-05 conditional novelty 6.0

Mixing real UAV imagery with 2101 AI-generated image-mask pairs improves semantic segmentation F1 scores for fine-grained forest species by over 15 percentage points overall and up to 30 points for rare classes.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
Statistical Consistency and Generalization of Contrastive Representation Learning
cs.LG 2026-05 unverdicted novelty 6.0

Contrastive representation learning is statistically consistent for optimal retrieval and admits generalization bounds of order O(1/m + 1/sqrt(n)) supervised and O(1/sqrt(m) + 1/sqrt(n)) self-supervised that benefit f...
Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
cs.RO 2026-05 unverdicted novelty 6.0

Hydra-DP3 achieves SOTA visuomotor performance with under 1% of prior 3D diffusion policy parameters by using frequency analysis to justify a lightweight decoder and two-step DDIM inference.
Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
cs.RO 2026-05 unverdicted novelty 6.0

Hydra-DP3 is a lightweight 3D diffusion policy that uses frequency analysis of smooth action trajectories to enable two-step DDIM inference and achieves state-of-the-art results with under 1% of prior parameters.
Prop-Chromeleon: Adaptive Haptic Props in Mixed Reality through Generative Artificial Intelligence
cs.HC 2026-05 unverdicted novelty 6.0

A generative-AI pipeline dynamically generates and anchors virtual assets to match the shape of physical props, enabling adaptive passive haptics in MR that users rate higher in realism, immersion, and enjoyment than ...
Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training
cs.CV 2026-04 unverdicted novelty 6.0

DynamiCS dynamically scales semantic clusters per training epoch to reduce VLM pre-training compute while improving accuracy on long-tail concepts compared to static or flattening baselines.
Leveraging Verifier-Based Reinforcement Learning in Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.
Improving Graph Few-shot Learning with Hyperbolic Space and Denoising Diffusion
cs.LG 2026-04 unverdicted novelty 6.0

IMPRESS improves graph few-shot learning by learning representations in hyperbolic space and using denoising diffusion to better approximate target distributions from few support samples.
Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models as conditional regeneration using prompt and semantic tokens from the initial image, yielding higher alignment scores than e...

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 126 Pith papers · 14 internal anchors

[1]

Cm3: A causal masked multimodal model of the internet

Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, and Luke Zettlemoyer. CM3: A Causal Masked Multimodal Model of the Internet. arXiv:2201.07520, 2022

work page arXiv 2022
[2]

Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models, 2022

Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-DPM: an Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models. CoRR, abs/2201.06503, 2022. URL https: //arxiv.org/abs/2201.06503

work page arXiv 2022
[3]

High Fidelity Visualization of What Your Self-Supervised Representation Knows About

Florian Bordes, Randall Balestriero, and Pascal Vincent. High Fidelity Visualization of What Your Self-Supervised Representation Knows About. arXiv:2112.09164, 2021

work page arXiv 2021
[4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[5]

Very deep vaes generalize autoregressive models and can outperform them on images

Rewon Child. Very Deep V AEs Generalize Autoregressive Models and Can Outperform Them on Images. arXiv:2011.10650, 2021

work page arXiv 2011
[6]

A V A Linear Probe

Katherine Crowson. A V A Linear Probe. https://twitter.com/RiversHaveWings/status/ 1472346186728173568?s=20&t=T-HRr3Gw5HRGjQaMDtRe3A, 2021

work page 2021
[7]

CLIP guided diffusion HQ 256x256

Katherine Crowson. CLIP guided diffusion HQ 256x256. https://colab.research.google.com/ drive/12a_Wrfi2_gwwAuN3VvMTwVMz9TfqctNj, 2021

work page 2021
[8]

CLIP Guided Diffusion 512x512, Secondary Model Method

Katherine Crowson. CLIP Guided Diffusion 512x512, Secondary Model Method. https://twitter. com/RiversHaveWings/status/1462859669454536711, 2021

work page arXiv 2021
[9]

v-diffusion

Katherine Crowson. v-diffusion. https://github.com/crowsonkb/v-diffusion-pytorch, 2021

work page 2021
[10]

VirTex: Learning Visual Representations from Textual Annotations

Karan Desai and Justin Johnson. VirTex: Learning Visual Representations from Textual Annotations. arXiv:2006.06666, 2020

work page arXiv 2006
[11]

Diffusion Models Beat GANs on Image Synthesis

Prafulla Dhariwal and Alex Nichol. Diffusion Models Beat GANs on Image Synthesis. arXiv:2105.05233, 2021

work page internal anchor Pith review arXiv 2021
[12]

Cogview: Mastering text-to- image generation via transformers

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. CogView: Mastering Text-to-Image Generation via Transformers. arXiv:2105.13290, 2021

work page arXiv 2021
[13]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[14]

Esser, R

Patrick Esser, Robin Rombach, and Björn Ommer. Taming Transformers for High-Resolution Image Synthesis. arXiv:2012.09841, 2020

work page arXiv 2012
[15]

Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412,

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-Aware Minimization for Efﬁciently Improving Generalization. arXiv:2010.01412, 2020. 19

work page arXiv 2010
[16]

CLOOB: Modern Hopﬁeld Networks with InfoLOOB Outperform CLIP, 2022

Andreas Fürst, Elisabeth Rumetshofer, Viet Thuong Tran, Hubert Ramsauer, Fei Tang, Johannes Lehner, D P Kreil, Michael K Kopp, Günter Klambauer, Angela Bitto-Nemling, and Sepp Hochreiter. CLOOB: Modern Hopﬁeld Networks with InfoLOOB Outperform CLIP, 2022. URL https://openreview. net/forum?id=qw674L9PfQE

work page 2022
[17]

Make-a-scene: Scene-based text-to-image generation with human priors

Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-A- Scene: Scene-Based Text-to-Image Generation with Human Priors. arXiv:2203.13131, 2022

work page arXiv 2022
[18]

Stylegan-nada: Clip-guided domain adaptation of image generators

Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and Daniel Cohen-Or. StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators. arXiv:2108.00946, 2021

work page arXiv 2021
[19]

Galatolo, Mario G

Federico A. Galatolo, Mario G. C. A. Cimino, and Gigliola Vaglini. Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search. arXiv:2102.01645, 2021

work page arXiv 2021
[20]

Multimodal neurons in artificial neural networks

Gabriel Goh, Nick Cammarata † , Chelsea V oss† , Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and Chris Olah. Multimodal Neurons in Artiﬁcial Neural Networks. Distill, 2021. doi: 10.23915/distill.00030. https://distill.pub/2021/multimodal-neurons

work page doi:10.23915/distill.00030 2021
[21]

Generative Adversarial Networks

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Networks. arXiv:1406.2661, 2014

work page internal anchor Pith review arXiv 2014
[22]

Vector quantized diﬀusion model for text-to-image synthesis

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector Quantized Diffusion Model for Text-to-Image Synthesis. arXiv:2111.14822, 2021

work page arXiv 2021
[23]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. Advances in Neural Information Processing Systems 30 (NIPS 2017) , 2017

work page 2017
[24]

Classiﬁer-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classiﬁer-Free Diffusion Guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications , 2021. URL https://openreview.net/ forum?id=qw8AKxfYbI

work page 2021
[25]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models.arXiv:2006.11239, 2020

work page internal anchor Pith review arXiv 2006
[26]

Cascaded diffusion models for high ﬁdelity image generation

Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded Diffusion Models for High Fidelity Image Generation. arXiv:2106.15282, 2021

work page arXiv 2021
[27]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[28]

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft COCO: Common Objects in Context. arXiv:1405.0312, 2014

work page internal anchor Pith review arXiv 2014
[29]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

DALL ·E 2 Preview - Risks and Limitations

Pamela Mishkin, Lama Ahmad, Miles Brundage, Gretchen Krueger, and Girish Sastry. DALL ·E 2 Preview - Risks and Limitations. 2022. URL https://github.com/openai/dalle-2-preview/ blob/main/system-card.md

work page 2022
[31]

In: ECCV (2022),https://arxiv.org/abs/2112.12750

Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. SLIP: Self-supervision meets Language-Image Pre-training. arXiv:2112.12750, 2021

work page arXiv 2021
[32]

The Big Sleep

Ryan Murdock. The Big Sleep. https://twitter.com/advadnoun/status/ 1351038053033406468, 2021. 20

work page 2021
[33]

A V A: A large-scale database for aesthetic visual analysis

Naila Murray, Luca Marchesotti, and Florent Perronnin. A V A: A large-scale database for aesthetic visual analysis. In 2012 IEEE Conference on Computer Vision and Pattern Recognition , pages 2408–2415,

work page 2012
[34]

doi: 10.1109/CVPR.2012.6247954

work page doi:10.1109/cvpr.2012.6247954 2012
[35]

Improved denois- ing diffusion probabilistic models.arXiv preprint arXiv:2102.09672,

Alex Nichol and Prafulla Dhariwal. Improved Denoising Diffusion Probabilistic Models. arXiv:2102.09672, 2021

work page arXiv 2021
[36]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv:2112.10741, 2021

work page internal anchor Pith review arXiv 2021
[37]

Styleclip: Text-driven manipulation of stylegan imagery

Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. StyleCLIP: Text- Driven Manipulation of StyleGAN Imagery. arXiv:2103.17249, 2021

work page arXiv 2021
[38]

Karl Pearson. LIII. On lines and planes of closest ﬁt to systems of points in space, November 1901. URL https://doi.org/10.1080/14786440109462720

work page doi:10.1080/14786440109462720 1901
[39]

Diffusion Autoencoders: Toward a Meaningful and Decodable Representation

Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion Autoencoders: Toward a Meaningful and Decodable Representation. arXiv:2111.15640, 2021

work page arXiv 2021
[40]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[41]

Zero-Shot Text-to-Image Generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-Shot Text-to-Image Generation. arXiv:2102.12092, 2021

work page internal anchor Pith review arXiv 2021
[42]

Generating diverse high-ﬁdelity images with VQ-V AE-2.arXiv:1906.00446, 2019

Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating Diverse High-Fidelity Images with VQ-V AE-2.arXiv:1906.00446, 2019

work page arXiv 1906
[43]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752, 2021

work page Pith review arXiv 2021
[44]

Image super-resolution via iterative reﬁnement

Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Image Super-Resolution via Iterative Reﬁnement.arXiv:arXiv:2104.07636, 2021

work page arXiv 2021
[45]

Learning Visual Representations with Caption Annotations

Mert Bulent Sariyildiz, Julien Perez, and Diane Larlus. Learning Visual Representations with Caption Annotations. arXiv:2008.01392, 2020

work page arXiv 2008
[46]

How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021

Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How Much Can CLIP Beneﬁt Vision-and-Language Tasks? arXiv:2107.06383, 2021

work page arXiv 2021
[47]

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv:1503.03585, 2015

work page internal anchor Pith review arXiv 2015
[48]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising Diffusion Implicit Models. arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[49]

Improved techniques for training score-based generative models

Yang Song and Stefano Ermon. Improved Techniques for Training Score-Based Generative Models. arXiv:2006.09011, 2020

work page arXiv 2006
[50]

Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis

Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe, Xiao-Yuan Jing, Fei Wu, and Bingkun Bao. DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis. arXiv:2008.05865, 2020

work page arXiv 2008
[51]

NV AE: A deep hierarchical variational autoencoder,

Arash Vahdat and Jan Kautz. NV AE: A Deep Hierarchical Variational Autoencoder.arXiv:2007.03898, 2020. 21

work page arXiv 2007
[52]

Score-based Generative Modeling in Latent Space

Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based Generative Modeling in Latent Space. In Neural Information Processing Systems (NeurIPS) , 2021

work page 2021
[53]

://arxiv.org/abs/1711.00937

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural Discrete Representation Learning. arXiv:1711.00937, 2017

work page arXiv 2017
[54]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. arXiv:1706.03762, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[55]

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

Zihao Wang, Wei Liu, Qian He, Xinglong Wu, and Zili Yi. CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP. arXiv:2203.00386, 2022

work page arXiv 2022
[56]

GAN Inversion: A Survey

Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei Zhou, and Ming-Hsuan Yang. GAN Inversion: A Survey. arXiv:2101.05278, 2021

work page arXiv 2021
[57]

Attngan: Fine-grained text to image gen- eration with attentional generative adversarial networks

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. arXiv:1711.10485, 2017

work page arXiv 2017
[58]

Improving text-to-image synthesis using contrastive learning

Hui Ye, Xiulong Yang, Martin Takac, Rajshekhar Sunderraman, and Shihao Ji. Improving Text-to-Image Synthesis Using Contrastive Learning. arXiv:2107.02423, 2021

work page arXiv 2021
[59]

Y ., Baldridge, J., Lee, H., and Yang, Y

Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. Cross-Modal Contrastive Learning for Text-to-Image Generation. arXiv:2101.04702, 2021

work page arXiv 2021
[60]

In: 2021 IEEE/CVF In- ternational Conference on Computer Vision (ICCV)

Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a Practical Degradation Model for Deep Blind Image Super-Resolution. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2021. doi: 10.1109/iccv48922.2021.00475. URL http://dx.doi.org/10.1109/ ICCV48922.2021.00475

work page doi:10.1109/iccv48922.2021.00475 2021
[61]

Manning, and Curtis P

Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, and Curtis P. Langlotz. Contrastive Learning of Medical Visual Representations from Paired Images and Text. arXiv:2010.00747, 2020

work page arXiv 2010
[62]

Laﬁte: Towards language-free training for text-to-image generation

Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and Tong Sun. LAFITE: Towards Language-Free Training for Text-to-Image Generation. arXiv:2111.13792, 2021

work page arXiv 2021
[63]

Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A. Efros. Generative Visual Manipulation on the Natural Image Manifold. arXiv:1609.03552, 2016

work page arXiv 2016
[64]

Dm-gan: Dy- namic memory generative adversarial networks for text- to-image synthesis

Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis. arXiv:1904.01310, 2019. 22 A Linear Probes for Evaluations For our evaluations, we leverage two new linear probes on top of a CLIP ViT-L/14 [13] model. To automate aesthetic quality evaluations, we follow the procedure used b...

work page arXiv 1904