pith. machine review for the scientific record. sign in

arxiv: 2204.06125 · v1 · submitted 2022-04-13 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Alex Nichol, Casey Chu, Mark Chen, Prafulla Dhariwal

Pith reviewed 2026-05-10 16:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-image generationCLIP embeddingsdiffusion modelshierarchical generationimage diversityzero-shot image editingcontrastive representations
0
0 comments X

The pith

A two-stage model that first generates a CLIP image embedding from text and then decodes it into pixels yields more diverse images than direct text-to-image generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes separating text-to-image generation into a prior step that produces a CLIP image embedding from the caption and a decoder step that turns the embedding into an image. This structure is meant to increase the variety of outputs for a given caption while keeping the images realistic and aligned with the text. The approach also supports creating multiple versions of an image that retain its core meaning and appearance but differ in details not captured by the embedding. Experiments compare diffusion and autoregressive models for the prior and find diffusion versions more efficient and effective.

Core claim

The paper claims that explicitly generating CLIP image embeddings via a prior conditioned on text, then decoding those embeddings with a diffusion model, improves image diversity with minimal loss in photorealism and caption similarity relative to direct generation methods. The joint CLIP space further enables zero-shot language-guided image manipulations and controlled variations that preserve semantics and style.

What carries the argument

A prior model that maps text captions to CLIP image embeddings, paired with a diffusion decoder that maps those embeddings to images.

If this is right

  • Decoders can produce multiple variations of an image that keep its semantics and style while changing details absent from the embedding.
  • The joint CLIP embedding space supports language-guided image manipulations without additional training.
  • Diffusion models for the prior are computationally more efficient and produce higher-quality samples than autoregressive alternatives.
  • Explicit generation of the image representation allows the system to vary non-essential details without altering core content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of prior and decoder could be tested on other conditional generation tasks where intermediate representations might improve controllability.
  • Leveraging a fixed pre-trained embedding space may allow independent scaling or fine-tuning of the prior and decoder for specialized domains.
  • Similar hierarchical designs might reduce the parameter count needed in the final decoder by offloading semantic encoding to the prior.

Load-bearing premise

A CLIP image embedding contains enough semantic and stylistic information for a decoder to reconstruct varied high-quality images while safely omitting non-essential details.

What would settle it

Train a single-stage text-to-image model and the two-stage prior-plus-decoder model on identical data, then check whether the two-stage version shows measurably higher diversity scores without a corresponding drop in photorealism or caption-matching scores.

read the original abstract

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims that a hierarchical two-stage model for text-conditional image generation—a prior that produces CLIP image embeddings from text captions, followed by a decoder that generates images conditioned on those embeddings—improves output diversity relative to direct text-to-image baselines while incurring only minimal losses in photorealism and caption similarity. Diffusion models are used for the decoder and both autoregressive and diffusion models are tested for the prior (with the latter found more efficient and higher-quality); the approach also enables image variations that preserve semantics and style plus zero-shot language-guided manipulations via the shared CLIP space.

Significance. If the reported empirical comparisons hold, the result is significant for text-to-image synthesis: by factoring high-level semantics and style into the CLIP embedding and letting the decoder supply omitted pixel-level details, the method demonstrably trades off diversity against quality in a controllable way. The direct ablations comparing diffusion versus autoregressive priors and varying decoder conditioning supply concrete evidence for the stated trade-off and the practical utility of zero-shot editing.

minor comments (3)
  1. [Abstract] Abstract: the claim of 'empirical improvements' and 'minimal loss' would be easier to evaluate if the abstract itself included the key quantitative metrics (e.g., FID, CLIP similarity, diversity scores) and the primary baselines against which the gains are measured.
  2. [§3] §3 (Method): the precise conditioning mechanism and noise schedule used in the diffusion prior are described at a high level; adding the exact hyper-parameter values or a reference to the supplementary material would improve reproducibility.
  3. [Table 2 / Figure 4] Table 2 / Figure 4: the diversity and photorealism metrics for the hierarchical model versus the direct baseline are presented, but the caption does not explicitly state the number of samples used for each metric or whether the same random seeds were shared across conditions.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation of minor revision. The referee summary accurately captures the core contributions of our hierarchical prior-decoder approach using CLIP latents for improved diversity in text-conditional image generation.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical two-stage architecture (text-to-CLIP-embedding prior + embedding-to-image decoder) whose central claims rest on reported experimental comparisons, ablations, and qualitative results rather than any closed-form derivation. No equations or steps reduce by construction to fitted parameters, self-referential definitions, or load-bearing self-citations; the CLIP embedding is treated as an external pretrained representation, and diversity/photorealism trade-offs are measured against independent baselines. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim depends on the representational power of pre-trained CLIP embeddings and the ability of diffusion decoders to reconstruct images from them; both are imported from earlier work rather than derived here.

free parameters (1)
  • diffusion prior and decoder hyperparameters
    Specific choices of step count, noise schedule, and conditioning strength that are tuned during training.
axioms (2)
  • domain assumption CLIP embeddings capture the semantics and style needed for high-quality image reconstruction and variation
    Invoked when the prior is trained to predict embeddings and the decoder is conditioned on them.
  • domain assumption Diffusion models can decode from CLIP latents without direct text conditioning
    Required for the decoder stage to function as described.

pith-pipeline@v0.9.0 · 5452 in / 1318 out tokens · 70215 ms · 2026-05-10T16:51:09.590636+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Flow-GRPO: Training Flow Matching Models via Online RL

    cs.CV 2025-05 unverdicted novelty 8.0

    Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

  2. Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

    stat.ML 2023-10 unverdicted novelty 8.0

    Score entropy loss enables discrete diffusion models (SEDD) that cut perplexity 25-75% versus prior diffusion methods and outperform GPT-2 on language modeling while supporting infilling and compute-quality tradeoffs.

  3. Consistency Models

    cs.LG 2023-03 conditional novelty 8.0

    Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.

  4. MusicLM: Generating Music From Text

    cs.SD 2023-01 conditional novelty 8.0

    MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.

  5. Building Normalizing Flows with Stochastic Interpolants

    cs.LG 2022-09 conditional novelty 8.0

    Normalizing flows are constructed by learning the velocity of a stochastic interpolant via a quadratic loss derived from its probability current, yielding an efficient ODE-based alternative to diffusion models.

  6. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    cs.LG 2022-09 unverdicted novelty 8.0

    Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.

  7. Prompt-to-Prompt Image Editing with Cross Attention Control

    cs.CV 2022-08 unverdicted novelty 8.0

    Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.

  8. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    cs.CV 2022-08 unverdicted novelty 8.0

    Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.

  9. Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming ...

  10. Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning

    cs.LG 2026-05 unverdicted novelty 7.0

    SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.

  11. Coordinated Diffusion: Generating Multi-Agent Behavior Without Multi-Agent Demonstrations

    cs.RO 2026-05 unverdicted novelty 7.0

    CoDi decomposes the multi-agent diffusion score into pre-trained single-agent policies plus a gradient-free cost guidance term to generate coordinated behavior from single-agent data alone.

  12. Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits

    math.OC 2026-05 unverdicted novelty 7.0

    Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.

  13. Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences

    cs.LG 2026-05 unverdicted novelty 7.0

    Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.

  14. Hyperbolic Concept Bottleneck Models

    cs.LG 2026-05 unverdicted novelty 7.0

    HypCBM reformulates concept activations as geometric containment in hyperbolic space to produce sparse, hierarchy-aware signals that match Euclidean models trained on 20 times more data.

  15. Hyperbolic Concept Bottleneck Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Hyperbolic Concept Bottleneck Models reformulate concept activations as test-time geometric containment in hyperbolic entailment cones to produce sparse, hierarchy-aware signals without extra supervision.

  16. A Flow Matching Algorithm for Many-Shot Adaptation to Unseen Distributions

    cs.LG 2026-05 unverdicted novelty 7.0

    FP-FM adapts flow matching models to unseen distributions via least-squares projection onto basis functions spanning training velocity fields, yielding improved precision and recall without inference-time training.

  17. A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping

    cs.CV 2026-05 unverdicted novelty 7.0

    Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and...

  18. LEGO: LoRA-Enabled Generator-Oriented Framework for Synthetic Image Detection

    cs.CV 2026-05 unverdicted novelty 7.0

    LEGO uses multiple generator-specific LoRA modules modulated by an MLP and fused with attention to detect synthetic images, achieving better performance than prior methods while using under 10% of the training data.

  19. Generative Modeling with Orbit-Space Particle Flow Matching

    cs.GR 2026-05 unverdicted novelty 7.0

    OGPP is a particle flow-matching method using orbit-space canonicalization and geometric paths that achieves lower error and fewer steps than prior approaches on 3D benchmarks.

  20. Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control

    cs.RO 2026-05 conditional novelty 7.0

    Frequency analysis of smooth robot actions bounds denoising error to low-frequency modes, enabling a sub-1% parameter 3D diffusion policy with two-step inference that reaches SOTA on manipulation benchmarks.

  21. VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

    cs.CV 2026-05 unverdicted novelty 7.0

    VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

  22. ResetEdit: Precise Text-guided Editing of Generated Image via Resettable Starting Latent

    cs.CV 2026-04 unverdicted novelty 7.0

    ResetEdit embeds a recoverable discrepancy signal during image generation in diffusion models to reconstruct an approximate original latent for high-fidelity text-guided editing.

  23. CA-IDD: Cross-Attention Guided Identity-Conditional Diffusion for Identity-Consistent Face Swapping

    cs.CV 2026-04 unverdicted novelty 7.0

    CA-IDD is the first diffusion model for face swapping that integrates multi-modal cross-attention guidance from identity embeddings, gaze, and facial parsing to achieve better identity consistency and an FID of 11.73 ...

  24. Dream-Cubed: Controllable Generative Modeling in Minecraft by Training on Billions of Cubes

    cs.CV 2026-04 unverdicted novelty 7.0

    Dream-Cubed releases a billion-scale voxel dataset and 3D diffusion models that generate controllable Minecraft worlds by operating directly on blocks.

  25. Long-Text-to-Image Generation via Compositional Prompt Decomposition

    cs.CV 2026-04 unverdicted novelty 7.0

    PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models whil...

  26. Grokking of Diffusion Models: Case Study on Modular Addition

    cs.LG 2026-04 unverdicted novelty 7.0

    Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.

  27. Marrying Text-to-Motion Generation with Skeleton-Based Action Recognition

    cs.CV 2026-04 unverdicted novelty 7.0

    CoAMD unifies skeleton-based action recognition and text-to-motion generation through autoregressive diffusion guided by a multi-modal recognizer, reporting SOTA results on 13 benchmarks for four tasks.

  28. Quality-Aware Calibration for AI-Generated Image Detection in the Wild

    cs.CV 2026-04 conditional novelty 7.0

    QuAD aggregates quality-weighted detection scores from near-duplicates of an image to raise balanced accuracy by about 8% over simple averaging on state-of-the-art detectors.

  29. Step-level Denoising-time Diffusion Alignment with Multiple Objectives

    cs.LG 2026-04 unverdicted novelty 7.0

    MSDDA derives a closed-form optimal reverse denoising distribution for multi-objective diffusion alignment that is exactly equivalent to step-level RL fine-tuning with no approximation error.

  30. Scaling Exposes the Trigger: Input-Level Backdoor Detection in Text-to-Image Diffusion Models via Cross-Attention Scaling

    cs.CR 2026-04 unverdicted novelty 7.0

    SET detects input-level backdoors in T2I diffusion models by learning a benign cross-attention response space from clean samples and flagging deviations under multi-scale perturbations.

  31. HiddenObjects: Scalable Diffusion-Distilled Spatial Priors for Object Placement

    cs.CV 2026-04 unverdicted novelty 7.0

    A diffusion-based pipeline creates a 27M-annotation dataset of object placements that outperforms human annotations and baselines on image editing tasks, then distills it into a fast model.

  32. NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity

    cs.LG 2026-04 unverdicted novelty 7.0

    NeuroFlow is the first unified flow model for bidirectional visual encoding and decoding from neural activity using NeuroVAE and cross-modal flow matching.

  33. SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation

    cs.CV 2026-04 conditional novelty 7.0

    SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.

  34. Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding

    cs.CV 2026-03 unverdicted novelty 7.0

    Omni-NegCLIP improves CLIP's negation understanding by up to 52.65% on presence-based and 12.50% on absence-based tasks through front-layer fine-tuning with specialized contrastive losses.

  35. Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    cs.CV 2024-06 conditional novelty 7.0

    Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.

  36. ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    cs.CV 2024-03 unverdicted novelty 7.0

    ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.

  37. Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    cs.CV 2023-10 unverdicted novelty 7.0

    A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.

  38. Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    cs.CV 2023-10 unverdicted novelty 7.0

    Latent Consistency Models enable high-fidelity text-to-image generation in 2-4 steps by directly predicting solutions to the probability flow ODE in latent space, distilled from pre-trained LDMs.

  39. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    cs.CV 2023-07 unverdicted novelty 7.0

    A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.

  40. Visual Instruction Tuning

    cs.CV 2023-04 unverdicted novelty 7.0

    LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.

  41. Scalable Diffusion Models with Transformers

    cs.CV 2022-12 unverdicted novelty 7.0

    DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.

  42. DreamFusion: Text-to-3D using 2D Diffusion

    cs.CV 2022-09 accept novelty 7.0

    Optimizes a Neural Radiance Field via probability density distillation from a 2D diffusion model to produce text-conditioned 3D scenes viewable from any angle.

  43. Diffusion Posterior Sampling for General Noisy Inverse Problems

    stat.ML 2022-09 unverdicted novelty 7.0

    Diffusion models solve noisy (non)linear inverse problems via approximated posterior sampling that blends diffusion steps with manifold gradients without strict consistency projection.

  44. Flamingo: a Visual Language Model for Few-Shot Learning

    cs.CV 2022-04 unverdicted novelty 7.0

    Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

  45. When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

    cs.CV 2026-05 unverdicted novelty 6.0

    Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...

  46. L2P: Unlocking Latent Potential for Pixel Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.

  47. Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition

    cs.CV 2026-05 unverdicted novelty 6.0

    Fashion130K dataset and UMC framework align text and visual prompts with embedding refiner, Fusion Transformer, and redesigned attention to generate more consistent outfits than prior methods.

  48. From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data

    cs.CV 2026-05 unverdicted novelty 6.0

    The work creates identity-consistent synthetic makeup data via ConsistentBeauty and adapts models to real images using reinforcement learning in RealBeauty, achieving better identity preservation and real-world perfor...

  49. APEX: Assumption-free Projection-based Embedding eXamination Metric for Image Quality Assessment

    cs.CV 2026-05 unverdicted novelty 6.0

    APEX is an assumption-free image quality metric based on sliced Wasserstein distance applied to open-vocabulary embeddings from CLIP and DINOv2, showing better robustness and stability than FID and similar baselines.

  50. P-Guide: Parameter-Efficient Prior Steering for Single-Pass CFG Inference

    cs.AI 2026-05 unverdicted novelty 6.0

    P-Guide achieves single-pass classifier-free guidance in flow matching by modulating the initial latent state and is equivalent to standard CFG under a first-order approximation while cutting latency by half.

  51. Leveraging Image Generators to Address Training Data Scarcity: The Gen4Regen Dataset for Forest Regeneration Mapping

    cs.CV 2026-05 conditional novelty 6.0

    Mixing real UAV imagery with 2101 AI-generated image-mask pairs improves semantic segmentation F1 scores for fine-grained forest species by over 15 percentage points overall and up to 30 points for rare classes.

  52. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.

  53. Statistical Consistency and Generalization of Contrastive Representation Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Contrastive representation learning is statistically consistent for optimal retrieval and admits generalization bounds of order O(1/m + 1/sqrt(n)) supervised and O(1/sqrt(m) + 1/sqrt(n)) self-supervised that benefit f...

  54. Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control

    cs.RO 2026-05 unverdicted novelty 6.0

    Hydra-DP3 achieves SOTA visuomotor performance with under 1% of prior 3D diffusion policy parameters by using frequency analysis to justify a lightweight decoder and two-step DDIM inference.

  55. Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control

    cs.RO 2026-05 unverdicted novelty 6.0

    Hydra-DP3 is a lightweight 3D diffusion policy that uses frequency analysis of smooth action trajectories to enable two-step DDIM inference and achieves state-of-the-art results with under 1% of prior parameters.

  56. Prop-Chromeleon: Adaptive Haptic Props in Mixed Reality through Generative Artificial Intelligence

    cs.HC 2026-05 unverdicted novelty 6.0

    A generative-AI pipeline dynamically generates and anchors virtual assets to match the shape of physical props, enabling adaptive passive haptics in MR that users rate higher in realism, immersion, and enjoyment than ...

  57. Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training

    cs.CV 2026-04 unverdicted novelty 6.0

    DynamiCS dynamically scales semantic clusters per training epoch to reduce VLM pre-training compute while improving accuracy on long-tail concepts compared to static or flattening baselines.

  58. Leveraging Verifier-Based Reinforcement Learning in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.

  59. Improving Graph Few-shot Learning with Hyperbolic Space and Denoising Diffusion

    cs.LG 2026-04 unverdicted novelty 6.0

    IMPRESS improves graph few-shot learning by learning representations in hyperbolic space and using denoising diffusion to better approximate target distributions from few support samples.

  60. Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models as conditional regeneration using prompt and semantic tokens from the initial image, yielding higher alignment scores than e...

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 126 Pith papers · 14 internal anchors

  1. [1]

    Cm3: A causal masked multimodal model of the internet

    Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, and Luke Zettlemoyer. CM3: A Causal Masked Multimodal Model of the Internet. arXiv:2201.07520, 2022

  2. [2]

    Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models, 2022

    Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-DPM: an Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models. CoRR, abs/2201.06503, 2022. URL https: //arxiv.org/abs/2201.06503

  3. [3]

    High Fidelity Visualization of What Your Self-Supervised Representation Knows About

    Florian Bordes, Randall Balestriero, and Pascal Vincent. High Fidelity Visualization of What Your Self-Supervised Representation Knows About. arXiv:2112.09164, 2021

  4. [4]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  5. [5]

    Very deep vaes generalize autoregressive models and can outperform them on images

    Rewon Child. Very Deep V AEs Generalize Autoregressive Models and Can Outperform Them on Images. arXiv:2011.10650, 2021

  6. [6]

    A V A Linear Probe

    Katherine Crowson. A V A Linear Probe. https://twitter.com/RiversHaveWings/status/ 1472346186728173568?s=20&t=T-HRr3Gw5HRGjQaMDtRe3A, 2021

  7. [7]

    CLIP guided diffusion HQ 256x256

    Katherine Crowson. CLIP guided diffusion HQ 256x256. https://colab.research.google.com/ drive/12a_Wrfi2_gwwAuN3VvMTwVMz9TfqctNj, 2021

  8. [8]

    CLIP Guided Diffusion 512x512, Secondary Model Method

    Katherine Crowson. CLIP Guided Diffusion 512x512, Secondary Model Method. https://twitter. com/RiversHaveWings/status/1462859669454536711, 2021

  9. [9]

    v-diffusion

    Katherine Crowson. v-diffusion. https://github.com/crowsonkb/v-diffusion-pytorch, 2021

  10. [10]

    VirTex: Learning Visual Representations from Textual Annotations

    Karan Desai and Justin Johnson. VirTex: Learning Visual Representations from Textual Annotations. arXiv:2006.06666, 2020

  11. [11]

    Diffusion Models Beat GANs on Image Synthesis

    Prafulla Dhariwal and Alex Nichol. Diffusion Models Beat GANs on Image Synthesis. arXiv:2105.05233, 2021

  12. [12]

    Cogview: Mastering text-to- image generation via transformers

    Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. CogView: Mastering Text-to-Image Generation via Transformers. arXiv:2105.13290, 2021

  13. [13]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929, 2020

  14. [14]

    Esser, R

    Patrick Esser, Robin Rombach, and Björn Ommer. Taming Transformers for High-Resolution Image Synthesis. arXiv:2012.09841, 2020

  15. [15]

    Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412,

    Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-Aware Minimization for Efficiently Improving Generalization. arXiv:2010.01412, 2020. 19

  16. [16]

    CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP, 2022

    Andreas Fürst, Elisabeth Rumetshofer, Viet Thuong Tran, Hubert Ramsauer, Fei Tang, Johannes Lehner, D P Kreil, Michael K Kopp, Günter Klambauer, Angela Bitto-Nemling, and Sepp Hochreiter. CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP, 2022. URL https://openreview. net/forum?id=qw674L9PfQE

  17. [17]

    Make-a-scene: Scene-based text-to-image generation with human priors

    Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-A- Scene: Scene-Based Text-to-Image Generation with Human Priors. arXiv:2203.13131, 2022

  18. [18]

    Stylegan-nada: Clip-guided domain adaptation of image generators

    Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and Daniel Cohen-Or. StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators. arXiv:2108.00946, 2021

  19. [19]

    Galatolo, Mario G

    Federico A. Galatolo, Mario G. C. A. Cimino, and Gigliola Vaglini. Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search. arXiv:2102.01645, 2021

  20. [20]

    Multimodal neurons in artificial neural networks

    Gabriel Goh, Nick Cammarata † , Chelsea V oss† , Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and Chris Olah. Multimodal Neurons in Artificial Neural Networks. Distill, 2021. doi: 10.23915/distill.00030. https://distill.pub/2021/multimodal-neurons

  21. [21]

    Generative Adversarial Networks

    Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Networks. arXiv:1406.2661, 2014

  22. [22]

    Vector quantized diffusion model for text-to-image synthesis

    Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector Quantized Diffusion Model for Text-to-Image Synthesis. arXiv:2111.14822, 2021

  23. [23]

    GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. Advances in Neural Information Processing Systems 30 (NIPS 2017) , 2017

  24. [24]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-Free Diffusion Guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications , 2021. URL https://openreview.net/ forum?id=qw8AKxfYbI

  25. [25]

    Denoising Diffusion Probabilistic Models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models.arXiv:2006.11239, 2020

  26. [26]

    Cascaded diffusion models for high fidelity image generation

    Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded Diffusion Models for High Fidelity Image Generation. arXiv:2106.15282, 2021

  27. [27]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014

  28. [28]

    Microsoft COCO: Common Objects in Context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft COCO: Common Objects in Context. arXiv:1405.0312, 2014

  29. [29]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. arXiv:1711.05101, 2017

  30. [30]

    DALL ·E 2 Preview - Risks and Limitations

    Pamela Mishkin, Lama Ahmad, Miles Brundage, Gretchen Krueger, and Girish Sastry. DALL ·E 2 Preview - Risks and Limitations. 2022. URL https://github.com/openai/dalle-2-preview/ blob/main/system-card.md

  31. [31]

    In: ECCV (2022),https://arxiv.org/abs/2112.12750

    Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. SLIP: Self-supervision meets Language-Image Pre-training. arXiv:2112.12750, 2021

  32. [32]

    The Big Sleep

    Ryan Murdock. The Big Sleep. https://twitter.com/advadnoun/status/ 1351038053033406468, 2021. 20

  33. [33]

    A V A: A large-scale database for aesthetic visual analysis

    Naila Murray, Luca Marchesotti, and Florent Perronnin. A V A: A large-scale database for aesthetic visual analysis. In 2012 IEEE Conference on Computer Vision and Pattern Recognition , pages 2408–2415,

  34. [34]

    doi: 10.1109/CVPR.2012.6247954

  35. [35]

    Improved denois- ing diffusion probabilistic models.arXiv preprint arXiv:2102.09672,

    Alex Nichol and Prafulla Dhariwal. Improved Denoising Diffusion Probabilistic Models. arXiv:2102.09672, 2021

  36. [36]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv:2112.10741, 2021

  37. [37]

    Styleclip: Text-driven manipulation of stylegan imagery

    Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. StyleCLIP: Text- Driven Manipulation of StyleGAN Imagery. arXiv:2103.17249, 2021

  38. [38]

    Karl Pearson. LIII. On lines and planes of closest fit to systems of points in space, November 1901. URL https://doi.org/10.1080/14786440109462720

  39. [39]

    Diffusion Autoencoders: Toward a Meaningful and Decodable Representation

    Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion Autoencoders: Toward a Meaningful and Decodable Representation. arXiv:2111.15640, 2021

  40. [40]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020, 2021

  41. [41]

    Zero-Shot Text-to-Image Generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-Shot Text-to-Image Generation. arXiv:2102.12092, 2021

  42. [42]

    Generating diverse high-fidelity images with VQ-V AE-2.arXiv:1906.00446, 2019

    Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating Diverse High-Fidelity Images with VQ-V AE-2.arXiv:1906.00446, 2019

  43. [43]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752, 2021

  44. [44]

    Image super-resolution via iterative refinement

    Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Image Super-Resolution via Iterative Refinement.arXiv:arXiv:2104.07636, 2021

  45. [45]

    Learning Visual Representations with Caption Annotations

    Mert Bulent Sariyildiz, Julien Perez, and Diane Larlus. Learning Visual Representations with Caption Annotations. arXiv:2008.01392, 2020

  46. [46]

    How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021

    Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How Much Can CLIP Benefit Vision-and-Language Tasks? arXiv:2107.06383, 2021

  47. [47]

    Deep Unsupervised Learning using Nonequilibrium Thermodynamics

    Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv:1503.03585, 2015

  48. [48]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising Diffusion Implicit Models. arXiv:2010.02502, 2020

  49. [49]

    Improved techniques for training score-based generative models

    Yang Song and Stefano Ermon. Improved Techniques for Training Score-Based Generative Models. arXiv:2006.09011, 2020

  50. [50]

    Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis

    Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe, Xiao-Yuan Jing, Fei Wu, and Bingkun Bao. DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis. arXiv:2008.05865, 2020

  51. [51]

    NV AE: A deep hierarchical variational autoencoder,

    Arash Vahdat and Jan Kautz. NV AE: A Deep Hierarchical Variational Autoencoder.arXiv:2007.03898, 2020. 21

  52. [52]

    Score-based Generative Modeling in Latent Space

    Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based Generative Modeling in Latent Space. In Neural Information Processing Systems (NeurIPS) , 2021

  53. [53]

    ://arxiv.org/abs/1711.00937

    Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural Discrete Representation Learning. arXiv:1711.00937, 2017

  54. [54]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. arXiv:1706.03762, 2017

  55. [55]

    CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

    Zihao Wang, Wei Liu, Qian He, Xinglong Wu, and Zili Yi. CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP. arXiv:2203.00386, 2022

  56. [56]

    GAN Inversion: A Survey

    Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei Zhou, and Ming-Hsuan Yang. GAN Inversion: A Survey. arXiv:2101.05278, 2021

  57. [57]

    Attngan: Fine-grained text to image gen- eration with attentional generative adversarial networks

    Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. arXiv:1711.10485, 2017

  58. [58]

    Improving text-to-image synthesis using contrastive learning

    Hui Ye, Xiulong Yang, Martin Takac, Rajshekhar Sunderraman, and Shihao Ji. Improving Text-to-Image Synthesis Using Contrastive Learning. arXiv:2107.02423, 2021

  59. [59]

    Y ., Baldridge, J., Lee, H., and Yang, Y

    Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. Cross-Modal Contrastive Learning for Text-to-Image Generation. arXiv:2101.04702, 2021

  60. [60]

    In: 2021 IEEE/CVF In- ternational Conference on Computer Vision (ICCV)

    Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a Practical Degradation Model for Deep Blind Image Super-Resolution. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2021. doi: 10.1109/iccv48922.2021.00475. URL http://dx.doi.org/10.1109/ ICCV48922.2021.00475

  61. [61]

    Manning, and Curtis P

    Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, and Curtis P. Langlotz. Contrastive Learning of Medical Visual Representations from Paired Images and Text. arXiv:2010.00747, 2020

  62. [62]

    Lafite: Towards language-free training for text-to-image generation

    Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and Tong Sun. LAFITE: Towards Language-Free Training for Text-to-Image Generation. arXiv:2111.13792, 2021

  63. [63]

    Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A. Efros. Generative Visual Manipulation on the Natural Image Manifold. arXiv:1609.03552, 2016

  64. [64]

    Dm-gan: Dy- namic memory generative adversarial networks for text- to-image synthesis

    Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis. arXiv:1904.01310, 2019. 22 A Linear Probes for Evaluations For our evaluations, we leverage two new linear probes on top of a CLIP ViT-L/14 [13] model. To automate aesthetic quality evaluations, we follow the procedure used b...