pith. sign in

arxiv: 2112.10752 · v2 · submitted 2021-12-20 · 💻 cs.CV

High-Resolution Image Synthesis with Latent Diffusion Models

Pith reviewed 2026-05-11 21:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords latent diffusion modelsimage synthesisdenoising diffusionautoencodersconditional generationimage inpaintingsuper-resolutioncross-attention
0
0 comments X

The pith

Diffusion models trained in the latent space of pretrained autoencoders generate high-resolution images with substantially lower computational cost than pixel-space versions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that diffusion models can be moved from raw pixel space into the compressed latent space of a fixed pretrained autoencoder. This shift preserves enough visual structure for high-fidelity synthesis while cutting the cost of training and sampling dramatically. Readers care because the same denoising process now supports conditioning via cross-attention layers, turning the model into a flexible generator for text, boxes, or masks without retraining. The result is practical high-resolution synthesis on ordinary hardware and new performance levels on inpainting.

Core claim

By applying the diffusion process to the latent representations of a pretrained autoencoder rather than to pixels, and by inserting cross-attention layers to accept arbitrary conditioning inputs, latent diffusion models reach a favorable trade-off between model capacity and perceptual fidelity while requiring far fewer resources than pixel-based diffusion models.

What carries the argument

The latent diffusion model (LDM), which runs the forward and reverse diffusion processes on the lower-dimensional latent codes produced by a fixed variational autoencoder and uses cross-attention to incorporate conditioning signals such as text or spatial layouts.

If this is right

  • Training and inference of powerful diffusion models become feasible on limited hardware while retaining visual quality.
  • High-resolution synthesis is performed directly in a convolutional manner without patch-wise processing.
  • Image inpainting reaches state-of-the-art results.
  • Unconditional generation, semantic scene synthesis, and super-resolution remain competitive with prior pixel-space methods.
  • Conditioning on text, bounding boxes, or other inputs is enabled without retraining the core model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of perceptual compression from the generative diffusion stage suggests similar latent-space training could be tested on other modalities once suitable autoencoders exist.
  • If the autoencoder is kept fixed, future improvements in autoencoder quality would immediately lift the upper bound on LDM fidelity without changing the diffusion architecture.
  • The approach implies that many existing pixel-based diffusion pipelines could be accelerated by first training a domain-specific autoencoder rather than scaling the diffusion model itself.

Load-bearing premise

The latent codes from the pretrained autoencoder already contain enough perceptual detail and spatial structure that the diffusion model can recover high-fidelity images without uncorrectable artifacts.

What would settle it

High-resolution outputs that consistently exhibit uncorrectable artifacts or visible loss of fine detail relative to pixel-based diffusion models of comparable training effort would show the assumption does not hold.

read the original abstract

By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at https://github.com/CompVis/latent-diffusion .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that applying diffusion models in the latent space of pretrained autoencoders enables efficient high-resolution image synthesis. Latent diffusion models (LDMs) reduce spatial dimensions via a KL-regularized VAE (with downsampling factors f=4/8/16) while preserving detail, incorporate cross-attention for conditioning on text or bounding boxes, and achieve new state-of-the-art inpainting results along with competitive performance on unconditional generation, semantic synthesis, and super-resolution, all at substantially lower computational cost than pixel-space DMs. Public code is released.

Significance. If the results hold, this has high significance for making diffusion-based synthesis practical at high resolutions with limited resources. Strengths include the public code release, direct ablations on autoencoder factors, and quantitative FID/LPIPS tables on ImageNet, Places2, and ADE20K that support the efficiency and quality claims. The stress-test concern on latent representation fidelity does not land as a load-bearing issue, since the f=8 model empirically recovers high-frequency detail without uncorrectable artifacts and matches or exceeds pixel DM quality.

major comments (2)
  1. [Ablations on autoencoder downsampling factors] Ablations on autoencoder downsampling factors: the claim of reaching a 'near-optimal point' between complexity reduction and detail preservation for f=8 rests on FID comparisons, but the exact spatial cost reduction (stated as ~1/64) should be derived explicitly from the UNet channel dimensions and latent resolution to allow verification of the efficiency gain.
  2. [Cross-attention layers] Cross-attention for conditioning: while cross-attention enables flexible conditioning, the manuscript does not include an ablation against simpler conditioning mechanisms (e.g., concatenation or FiLM), which would isolate whether this architecture choice is necessary for the flexibility and high-resolution claims.
minor comments (2)
  1. [Abstract] The abstract's reference to 'hundreds of GPU days' for pixel-space DM optimization would be strengthened by citing the specific prior works being compared.
  2. [Methods] Notation for the latent variable z and the diffusion forward/reverse processes in latent space could be clarified with an explicit equation reference or diagram early in the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment, the recommendation of minor revision, and the constructive comments on our efficiency claims and conditioning design. We address each major comment below and have incorporated revisions to improve clarity.

read point-by-point responses
  1. Referee: Ablations on autoencoder downsampling factors: the claim of reaching a 'near-optimal point' between complexity reduction and detail preservation for f=8 rests on FID comparisons, but the exact spatial cost reduction (stated as ~1/64) should be derived explicitly from the UNet channel dimensions and latent resolution to allow verification of the efficiency gain.

    Authors: We agree that an explicit derivation would strengthen the presentation. The ~1/64 factor follows directly from reducing the spatial resolution of the UNet input by f=8 in each dimension (latent size H/8 × W/8), which quadratically reduces the number of spatial operations. Accounting for the UNet channel schedule (starting at 320 channels with doubling in down-blocks), the overall computational cost of the diffusion process scales by this factor relative to pixel-space models. In the revised manuscript we will add a short derivation in Section 3.1 (or an appendix table) that computes the reduction from the exact latent resolution and channel dimensions, enabling straightforward verification. revision: yes

  2. Referee: Cross-attention for conditioning: while cross-attention enables flexible conditioning, the manuscript does not include an ablation against simpler conditioning mechanisms (e.g., concatenation or FiLM), which would isolate whether this architecture choice is necessary for the flexibility and high-resolution claims.

    Authors: We appreciate the suggestion. Cross-attention is chosen because it supports conditioning inputs of arbitrary length and structure (e.g., variable-length text token sequences or unordered sets of bounding-box embeddings) without requiring fixed-dimensional inputs, which concatenation or FiLM layers would necessitate. This flexibility is central to the high-resolution text-to-image and layout-to-image results. A full retraining ablation is outside the scope of a minor revision, but we will add a concise discussion paragraph in Section 3.2 explaining the architectural rationale and contrasting it with simpler alternatives, thereby addressing the concern without misrepresenting the design. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper proposes applying diffusion models in the latent space of a separately pretrained autoencoder, with the central claims of state-of-the-art inpainting and competitive performance on generation tasks supported by direct empirical ablations (e.g., downsampling factors f=4/8/16) and quantitative comparisons to pixel-space baselines on ImageNet, Places2, and ADE20K. No load-bearing step reduces a result or prediction to its own inputs by construction, fitted parameters renamed as outputs, or a self-citation chain; the autoencoder training and latent diffusion training are independent stages, and all performance assertions rest on measured metrics rather than theoretical closure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a pretrained autoencoder can compress images into a latent space that retains sufficient detail for diffusion-based generation; no free parameters are introduced in the abstract description, and no new entities are postulated.

axioms (1)
  • domain assumption Pretrained autoencoders produce latent representations that preserve perceptual details necessary for high-fidelity image synthesis.
    Invoked to justify operating diffusion in latent space rather than pixels.

pith-pipeline@v0.9.0 · 5539 in / 1297 out tokens · 47273 ms · 2026-05-11T21:56:11.949352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Autoregressive Learning in Joint KL: Sharp Oracle Bounds and Lower Bounds

    cs.LG 2026-05 unverdicted novelty 8.0

    Joint KL yields horizon-free approximation but an information-theoretic lower bound of order Omega(H) for estimation error in autoregressive learning, with matching computationally efficient upper bounds.

  2. What Time Is It? How Data Geometry Makes Time Conditioning Optional for Flow Matching

    cs.LG 2026-05 unverdicted novelty 8.0

    Data geometry makes time identifiable from noisy interpolants at rate O(1/sqrt(d-k)), rendering the time-blindness gap asymptotically negligible relative to coupling variance.

  3. Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

    cs.CL 2023-09 unverdicted novelty 8.0

    Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

  4. Constrained Code Generation with Discrete Diffusion

    cs.CL 2026-05 unverdicted novelty 7.0

    Constrained Diffusion for Code (CDC) integrates constraint satisfaction into the reverse denoising process of discrete diffusion models via constraint-aware operators that use optimization and program analysis to stee...

  5. Seeking the Unfamiliar but Memorable: Conceptual Creativity as Meta-Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    Creativity is defined as meta-learning where a frozen diffusion creator optimizes candidates for rapid improvement by an adapting appraiser such as an autoencoder or CLIP adapter.

  6. CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

    cs.CV 2026-05 conditional novelty 7.0

    CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...

  7. AuraMask: An Extensible Pipeline for Developing Aesthetic Anti-Facial Recognition Image Filters

    cs.CV 2026-05 conditional novelty 7.0

    AuraMask produces 40 aesthetic anti-facial recognition filters that match or exceed prior adversarial effectiveness and achieve significantly higher user acceptance in a 630-person study.

  8. Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LVO applies optimization-based feature visualization to latent diffusion models after disentangling their representations with sparse autoencoders, yielding recognizable concept images on a fine-tuned Stable Diffusion...

  9. How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance

    cs.LG 2026-04 unverdicted novelty 7.0

    FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.

  10. Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization

    cs.CV 2026-04 unverdicted novelty 7.0

    Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.

  11. $Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...

  12. Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Flow of Truth introduces a learnable forensic template and template-guided flow module that follows pixel motion to enable temporal tracing in image-to-video generation.

  13. Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Flow of Truth is the first proactive temporal forensics framework for image-to-video generation that uses a learnable forensic template following pixel motion and a template-guided flow module to decouple motion from content.

  14. Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes

    cs.CV 2026-04 unverdicted novelty 7.0

    Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.

  15. Advantage-Guided Diffusion for Model-Based Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 7.0

    Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.

  16. VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion

    cs.AI 2026-04 unverdicted novelty 7.0

    FVD applies Fleming-Viot population dynamics to diffusion model sampling at inference time to reduce diversity collapse while improving reward alignment and FID scores.

  17. Drifting Fields are not Conservative

    cs.LG 2026-04 conditional novelty 7.0

    Drift fields in single-pass generative models are not conservative except for Gaussian kernels; a sharp kernel normalization makes them conservative for any radial kernel while noting that non-conservative fields offe...

  18. SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    SEM-ROVER generates large multiview-consistent 3D urban driving scenes via semantic-conditioned diffusion on Σ-Voxfield voxel grids with progressive outpainting and deferred rendering.

  19. Setting-Matched and Semantics-Scaled Benchmarking of One-Step Generative Models Against Multistep Diffusion and Flow Models

    cs.CV 2026-03 unverdicted novelty 7.0

    Matched benchmarking reveals FID misleads in few-step regimes under CFG, prompting CLIP-scaled and PickScore-scaled FID and IS variants for better semantic evaluation of one-step image generators.

  20. Latent Generative Solvers for Generalizable Long-Term Physics Simulation

    cs.AI 2026-02 unverdicted novelty 7.0

    LGS pretrained on 2.5M trajectories across 16 systems matches deterministic baselines at one step and halves 20-step error while using far less compute and adapting to held-out higher-resolution flows.

  21. Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models

    cs.LG 2026-02 unverdicted novelty 7.0

    Early and late denoising steps in masked diffusion LMs are robust to smaller-model replacement, enabling 17% FLOPs reduction with modest generative quality loss.

  22. Visual Diffusion Models are Geometric Solvers

    cs.CV 2025-10 unverdicted novelty 7.0

    Standard visual diffusion models operating in pixel space can approximate solutions to the inscribed square, Steiner tree, and simple polygon problems.

  23. VIPaint: Image Inpainting with Pre-Trained Diffusion Models via Variational Inference

    cs.CV 2024-11 unverdicted novelty 7.0

    VIPaint uses hierarchical variational inference to optimize a non-Gaussian Markov approximation of the diffusion posterior, enabling better inpainting and inverse problems with pre-trained and latent diffusion models.

  24. LAION-5B: An open large-scale dataset for training next generation image-text models

    cs.CV 2022-10 accept novelty 7.0

    LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.

  25. Hierarchical Text-Conditional Image Generation with CLIP Latents

    cs.CV 2022-04 accept novelty 7.0

    A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.

  26. Broken Memories: Detecting and Mitigating Memorization in Diffusion Models with Degraded Generations

    cs.CV 2026-05 unverdicted novelty 6.0

    Memorization in diffusion models is detected via latent update norm instability and mitigated on-the-fly, yielding AUC over 0.999 and zero memorization rate on Stable Diffusion 1.4.

  27. UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    UniVL unifies vision and language into one mask-rendered input processed by an OCR backbone to condition diffusion models for spatially grounded image generation without a standalone text encoder.

  28. PaintCopilot: Modeling Painting as Autonomous Artistic Continuation

    cs.CV 2026-05 unverdicted novelty 6.0

    PaintCopilot models painting as an open-ended autoregressive process that predicts coherent brushstrokes from partial canvas observations using a ViT target predictor, flow-matching stroke generator, and VAE region sampler.

  29. Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment

    cs.LG 2026-05 unverdicted novelty 6.0

    REPA-P aligns intermediate representations in diffusion models with physical states using first-principles PDE residuals to accelerate convergence and boost out-of-distribution robustness on PDE tasks.

  30. Going PLACES: Participatory Localized Red Teaming for Text-to-Image Safety in the Global South

    cs.CY 2026-05 unverdicted novelty 6.0

    A participatory red-teaming project in the Global South created the PLACES dataset of 26k T2I failure examples that reveal unique cultural and linguistic harms missed by existing safety frameworks.

  31. A Distributional View for Visual Mechanistic Interpretability: KL-Minimal Soft-Constraint Principle

    cs.CV 2026-05 unverdicted novelty 6.0

    The work introduces a distributional view of visual mechanistic interpretability that casts the task as KL-minimal optimization and realizes it through a soft-constraint principle implemented with energy-guided diffus...

  32. MIRAGE: Robust multi-modal architectures translate fMRI-to-image models from vision to mental imagery

    q-bio.NC 2026-05 unverdicted novelty 6.0

    MIRAGE achieves state-of-the-art mental image reconstruction from fMRI on the NSD-Imagery benchmark by using a linear backbone with multi-modal text and image features fed to a diffusion model.

  33. Global Convergence of Sampling-Based Nonconvex Optimization through Diffusion-Style Smoothing

    cs.LG 2026-05 unverdicted novelty 6.0

    Recasts sampling-based nonconvex optimization as smoothed gradient descent to obtain non-asymptotic convergence guarantees and introduces the DIDA annealed algorithm that converges to the global optimum.

  34. The two clocks and the innovation window: When and how generative models learn rules

    cs.LG 2026-05 unverdicted novelty 6.0

    Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.

  35. Network-Efficient World Model Token Streaming

    cs.RO 2026-05 unverdicted novelty 6.0

    An adaptive delta-prioritization algorithm using cosine distance and Hamming-drift thresholds improves embedding distortion by 4.8-7.2% and next-token perplexity by 2.1-6.3% over periodic keyframing at matched low bit...

  36. Probability-Flow Distillation: Exact Wasserstein Gradient Flow for High-Fidelity 3D Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Probability-Flow Distillation exactly matches the Wasserstein gradient flow of the target distribution when distilling 2D diffusion priors into 3D models, yielding higher-fidelity results than SDS or SDI.

  37. AIMIP Phase 1: systematic evaluations of AI weather and climate models

    physics.ao-ph 2026-05 unverdicted novelty 6.0

    AIMIP Phase 1 shows AI models simulate historical climate and El Niño responses as well as traditional models, though some underestimate trends and diverge in generalization tests, with a public dataset released for f...

  38. AIMIP Phase 1: systematic evaluations of AI weather and climate models

    physics.ao-ph 2026-05 unverdicted novelty 6.0

    AIMIP Phase 1 sets up a common experiment and five evaluation criteria for AI atmosphere models forced by historical sea surface temperatures, finding they match conventional models on most metrics but underestimate s...

  39. GCCM: Enhancing Generative Graph Prediction via Contrastive Consistency Model

    cs.AI 2026-05 unverdicted novelty 6.0

    GCCM prevents shortcut collapse in consistency models for graph prediction by using contrastive negative pairs and input feature perturbation, leading to better performance than deterministic baselines.

  40. Velox: Learning Representations of 4D Geometry and Appearance

    cs.CV 2026-05 unverdicted novelty 6.0

    Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...

  41. Compared to What? Baselines and Metrics for Counterfactual Prompting

    cs.CL 2026-05 conditional novelty 6.0

    Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...

  42. Scale-Aware Adversarial Analysis: A Diagnostic for Generative AI in Multiscale Complex Systems

    cs.LG 2026-05 unverdicted novelty 6.0

    A new scale-aware diagnostic framework shows that unconstrained diffusion generative models exhibit structural freezing and instability instead of smooth physical responses under multiscale perturbations.

  43. How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance

    cs.LG 2026-04 unverdicted novelty 6.0

    FMRG is a training-free single-trajectory guidance framework for flow-based models that matches or exceeds baselines on reward-guided tasks and inverse problems using as few as 3 NFEs.

  44. Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    SAMG uses spatially adaptive guidance scales derived from a geometric analysis of classifier-free guidance to resolve the detail-artifact dilemma in diffusion-based image and video generation.

  45. Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    L2P trains per-timestep linear weights on feature trajectories in about 20 seconds to enable aggressive caching in DiT models, delivering up to 4.55x FLOPs reduction with maintained visual quality.

  46. Deepfake Detection Generalization with Diffusion Noise

    cs.CV 2026-04 unverdicted novelty 6.0

    ANL uses diffusion noise prediction and attention to regularize deepfake detectors for better generalization to unseen synthesis methods without added inference cost.

  47. PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios

    cs.CV 2026-04 unverdicted novelty 6.0

    PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional ...

  48. ELT: Elastic Looped Transformers for Visual Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.

  49. VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion

    cs.AI 2026-04 unverdicted novelty 6.0

    VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and M...

  50. Drifting Fields are not Conservative

    cs.LG 2026-04 unverdicted novelty 6.0

    Drift fields are not conservative except for Gaussian kernels; sharp normalization makes them conservative for any radial kernel by equating them to score differences of kernel density estimates.

  51. Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Unlearning methods that strongly erase concepts from text-to-image diffusion models consistently degrade performance on attribute binding, spatial reasoning, and counting tasks.

  52. LLM-Generated Fault Scenarios for Evaluating Perception-Driven Lane Following in Autonomous Edge Systems

    cs.LG 2026-04 conditional novelty 6.0

    A decoupled offline-online framework uses LLMs and latent diffusion models to generate fault scenarios for testing edge-based lane-following models, revealing large robustness drops under conditions like fog.

  53. Diffusion Models Memorize in Training -- and Generalize in Inference

    cs.LG 2026-03 unverdicted novelty 6.0

    Diffusion models overfit denoising loss at intermediate noise but generalize in inference as model error smooths the flow field and sampling paths avoid memorized noisy training data.

  54. Meltdown: Circuits and Bifurcations in Point-Cloud-Conditioned 3D Diffusion Transformers

    cs.LG 2026-02 unverdicted novelty 6.0

    Tiny on-surface point perturbations trigger a bifurcation in the reverse diffusion process of 3D transformers, localized to a low-rank cross-attention write that can be reshaped at test time to suppress the failure.

  55. Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models

    cs.CV 2026-02 conditional novelty 6.0

    Frozen features from vision foundation models enable a linear probe to outperform specialized AIGI detectors by over 30% on in-the-wild data due to emergent forgery knowledge from pre-training.

  56. InfiniteDiffusion: Bridging Learned Fidelity and Procedural Utility for Open-World Terrain Generation

    cs.CV 2025-12 unverdicted novelty 6.0

    InfiniteDiffusion adapts diffusion models to produce infinite, seed-consistent, high-fidelity terrain with procedural-noise-like access and 9x speed over prior methods.

  57. FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing

    cs.CV 2025-09 conditional novelty 6.0

    FlashEdit delivers real-time localized text-guided image editing under 0.2 seconds via cycle-consistent one-step inversion, background shield, and sparsified spatial cross-attention, achieving over 150x speedup on PIE-Bench.

  58. Flow marching for a generative PDE foundation model

    cs.LG 2025-09 unverdicted novelty 6.0

    Flow Marching jointly samples noise and physical time to learn a velocity field for generative PDE modeling, paired with a latent autoencoder and efficient transformer for large-scale pretraining on 2.5M trajectories.

  59. Learning Zero-Shot Subject-Driven Video Generation Using 1% Compute

    cs.CV 2025-04 unverdicted novelty 6.0

    A zero-shot subject-driven video generation framework that decomposes the task into identity injection from 200K subject-image pairs and motion preservation from 4K arbitrary videos, trained in 288 A100 GPU hours on C...

  60. Pretrained Event Classification Model for High Energy Physics Analysis

    hep-ph 2024-12 unverdicted novelty 6.0

    A GNN pretrained on 120M simulated HEP events generalizes to unseen processes and ATLAS data; fine-tuning boosts accuracy especially with small datasets, with CKA showing preserved encoders but altered intermediate layers.

Reference graph

Works this paper leans on

109 extracted references · 109 canonical work pages · cited by 76 Pith papers · 16 internal anchors

  1. [1]

    NTIRE 2017 chal- lenge on single image super-resolution: Dataset and study

    Eirikur Agustsson and Radu Timofte. NTIRE 2017 chal- lenge on single image super-resolution: Dataset and study. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1122–1131. IEEE Com- puter Society, 2017. 1

  2. [2]

    Wasserstein gan, 2017

    Martin Arjovsky, Soumith Chintala, and L ´eon Bottou. Wasserstein gan, 2017. 3

  3. [3]

    Large scale GAN training for high fidelity natural image synthe- sis

    Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthe- sis. In Int. Conf. Learn. Represent. , 2019. 1, 2, 7, 8, 22, 28

  4. [4]

    Holger Caesar, Jasper R. R. Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In 2018 IEEE Conference on Computer Vision and Pattern Recog- nition, CVPR 2018, Salt Lake City, UT, USA, June 18- 22, 2018, pages 1209–1218. Computer Vision Foundation / IEEE Computer Society, 2018. 7, 20, 22

  5. [5]

    Extracting training data from large language models

    Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21) , pages 2633–2650, 2021. 9

  6. [6]

    Generative pre- training from pixels

    Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee- woo Jun, David Luan, and Ilya Sutskever. Generative pre- training from pixels. In ICML, volume 119 of Proceedings of Machine Learning Research, pages 1691–1703. PMLR,

  7. [7]

    Weiss, Mo- hammad Norouzi, and William Chan

    Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mo- hammad Norouzi, and William Chan. Wavegrad: Estimat- ing gradients for waveform generation. In ICLR. OpenRe- view.net, 2021. 1

  8. [8]

    Fast fourier convolu- tion

    Lu Chi, Borui Jiang, and Yadong Mu. Fast fourier convolu- tion. In NeurIPS, 2020. 8

  9. [9]

    Very deep vaes generalize autoregressive models and can outperform them on images,

    Rewon Child. Very deep vaes generalize autoregressive models and can outperform them on images. CoRR, abs/2011.10650, 2020. 3

  10. [10]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019. 3

  11. [11]

    Bin Dai and David P. Wipf. Diagnosing and enhancing V AE models. In ICLR (Poster). OpenReview.net, 2019. 2, 3

  12. [12]

    Imagenet: A large-scale hierarchical im- age database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical im- age database. In CVPR, pages 248–255. IEEE Computer Society, 2009. 1, 5, 7, 22

  13. [13]

    Ethical considerations of generative ai

    Emily Denton. Ethical considerations of generative ai. AI for Content Creation Workshop, CVPR, 2021. 9

  14. [14]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirec- tional transformers for language understanding. CoRR, abs/1810.04805, 2018. 7

  15. [15]

    Diffusion Models Beat GANs on Image Synthesis

    Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. CoRR, abs/2105.05233, 2021. 1, 2, 3, 4, 6, 7, 8, 18, 22, 25, 26, 28

  16. [16]

    Musings on typicality, 2020

    Sander Dieleman. Musings on typicality, 2020. 1, 3

  17. [17]

    Cogview: Mastering text-to-image generation via transformers

    Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. Cogview: Mastering text-to- image generation via transformers. CoRR, abs/2105.13290,

  18. [18]

    Nice: Non-linear independent components estimation, 2015

    Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation, 2015. 3

  19. [19]

    Density estimation using real NVP

    Laurent Dinh, Jascha Sohl-Dickstein, and Samy Ben- gio. Density estimation using real NVP. In 5th Inter- national Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. 1, 3

  20. [20]

    Generating images with perceptual similarity metrics based on deep networks

    Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Adv. Neural Inform. Process. Syst., pages 658–666, 2016. 3

  21. [21]

    Imagebart: Bidirectional context with multi- nomial diffusion for autoregressive image synthesis.CoRR, abs/2108.08827, 2021

    Patrick Esser, Robin Rombach, Andreas Blattmann, and Bj¨orn Ommer. Imagebart: Bidirectional context with multi- nomial diffusion for autoregressive image synthesis.CoRR, abs/2108.08827, 2021. 6, 7, 22

  22. [22]

    A note on data biases in generative models

    Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. A note on data biases in generative models. arXiv preprint arXiv:2012.02516, 2020. 9

  23. [23]

    Esser, R

    Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Taming transformers for high-resolution image synthesis. CoRR, abs/2012.09841, 2020. 2, 3, 4, 6, 7, 21, 22, 29, 34, 36

  24. [24]

    Sex, lies, and videotape: Deep fakes and free speech delusions

    Mary Anne Franks and Ari Ezra Waldman. Sex, lies, and videotape: Deep fakes and free speech delusions. Md. L. Rev., 78:892, 2018. 9

  25. [25]

    Soros, and Olaf Witkowski

    Kevin Frans, Lisa B. Soros, and Olaf Witkowski. Clipdraw: Exploring text-to-drawing synthesis through language- image encoders. ArXiv, abs/2106.14843, 2021. 3

  26. [26]

    Make-a-Scene:

    Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene- based text-to-image generation with human priors. CoRR, abs/2203.13131, 2022. 6, 7, 16

  27. [27]

    Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C

    Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial networks. CoRR, 2014. 1, 2

  28. [28]

    Improved training of wasserstein gans, 2017

    Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans, 2017. 3

  29. [29]

    Gans trained by a two time-scale update rule converge to a local nash equi- librium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equi- librium. In Adv. Neural Inform. Process. Syst., pages 6626– 6637, 2017. 1, 5, 26

  30. [30]

    Denoising dif- fusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. In NeurIPS, 2020. 1, 2, 3, 4, 6, 17

  31. [31]

    Cascaded diffusion models for high fidelity image generation

    Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation.CoRR, abs/2106.15282, 2021. 1, 3, 22 10

  32. [32]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 6, 7, 16, 22, 28, 37, 38

  33. [33]

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adver- sarial networks. In CVPR, pages 5967–5976. IEEE Com- puter Society, 2017. 3, 4

  34. [34]

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adver- sarial networks. 2017 IEEE Conference on Computer Vi- sion and Pattern Recognition (CVPR) , pages 5967–5976,

  35. [35]

    Perceiver IO: A General Architecture for Structured Inputs & Outputs

    Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier J. H ´enaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, and Jo ˜ao Carreira. Perceiver IO: A general architecture for structured inputs &outputs. CoRR, abs/2107.14795, 2021. 4

  36. [36]

    Perceiver: General perception with iterative attention

    Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Jo ˜ao Carreira. Perceiver: General perception with iterative attention. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 ofProceedings of Machine Learning Researc...

  37. [37]

    High- resolution complex scene synthesis with transformers

    Manuel Jahn, Robin Rombach, and Bj ¨orn Ommer. High- resolution complex scene synthesis with transformers. CoRR, abs/2105.06458, 2021. 20, 22, 27

  38. [38]

    Imperfect ima- ganation: Implications of gans exacerbating biases on fa- cial data augmentation and snapchat selfie lenses

    Niharika Jain, Alberto Olmo, Sailik Sengupta, Lydia Manikonda, and Subbarao Kambhampati. Imperfect ima- ganation: Implications of gans exacerbating biases on fa- cial data augmentation and snapchat selfie lenses. arXiv preprint arXiv:2001.09528, 2020. 9

  39. [39]

    Progressive Growing of GANs for Improved Quality, Stability, and Variation

    Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehti- nen. Progressive growing of gans for improved quality, sta- bility, and variation. CoRR, abs/1710.10196, 2017. 5, 6

  40. [40]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE Conf. Comput. Vis. Pattern Recog. , pages 4401– 4410, 2019. 1

  41. [41]

    Karras, S

    T. Karras, S. Laine, and T. Aila. A style-based gener- ator architecture for generative adversarial networks. In 2019 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2019. 5, 6

  42. [42]

    Xander Steenbrugge

    Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improv- ing the image quality of stylegan. CoRR, abs/1912.04958,

  43. [43]

    Soft truncation: A universal training technique of score-based diffusion model for high precision score estimation

    Dongjun Kim, Seungjae Shin, Kyungwoo Song, Wanmo Kang, and Il-Chul Moon. Score matching model for un- bounded data score. CoRR, abs/2106.05527, 2021. 6

  44. [44]

    Glow: Generative flow with invertible 1x1 convolutions

    Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In S. Bengio, H. Wal- lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Process- ing Systems, 2018. 3

  45. [45]

    Kingma, Tim Salimans, Ben Poole, and Jonathan Ho

    Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. CoRR, abs/2107.00630, 2021. 1, 3, 16

  46. [46]

    Kingma and Max Welling

    Diederik P. Kingma and Max Welling. Auto-Encoding Vari- ational Bayes. In 2nd International Conference on Learn- ing Representations, ICLR, 2014. 1, 3, 4, 29

  47. [47]

    On fast sampling of diffusion probabilistic models,

    Zhifeng Kong and Wei Ping. On fast sampling of diffusion probabilistic models. CoRR, abs/2106.00132, 2021. 3

  48. [48]

    Diffwave: A versatile diffusion model for audio synthesis

    Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. In ICLR. OpenReview.net, 2021. 1

  49. [49]

    Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper R. R. Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, and Vittorio Ferrari. The open images dataset V4: unified image classi- fication, object detection, and visual relationship detection at scale. CoRR, abs/1811.00982, 2018. 7, 20, 22

  50. [50]

    Improved precision and recall met- ric for assessing generative models.CoRR, abs/1904.06991,

    Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and re- call metric for assessing generative models. CoRR, abs/1904.06991, 2019. 5, 26

  51. [51]

    Microsoft COCO: Common Objects in Context

    Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zit- nick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014. 6, 7, 27

  52. [52]

    Region-wise generative adversarial imageinpainting for large missing ar- eas

    Yuqing Ma, Xianglong Liu, Shihao Bai, Le-Yi Wang, Ais- han Liu, Dacheng Tao, and Edwin Hancock. Region-wise generative adversarial imageinpainting for large missing ar- eas. ArXiv, abs/1909.12507, 2019. 9

  53. [53]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun- Yan Zhu, and Stefano Ermon. Sdedit: Image synthesis and editing with stochastic differential equations. CoRR, abs/2108.01073, 2021. 1

  54. [54]

    Which Training Methods for GANs do actually Converge?

    Lars M. Mescheder. On the convergence properties of GAN training. CoRR, abs/1801.04406, 2018. 3

  55. [55]

    Unrolled generative adversarial networks

    Luke Metz, Ben Poole, David Pfau, and Jascha Sohl- Dickstein. Unrolled generative adversarial networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. 3

  56. [56]

    Conditional Generative Adversarial Nets

    Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. CoRR, abs/1411.1784, 2014. 4

  57. [57]

    Engel, Curtis Hawthorne, and Ian Simon

    Gautam Mittal, Jesse H. Engel, Curtis Hawthorne, and Ian Simon. Symbolic music generation with diffusion models. CoRR, abs/2103.16091, 2021. 1

  58. [58]

    EdgeConnect: Generative Image Inpainting with Adversarial Edge Learning

    Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Z. Qureshi, and Mehran Ebrahimi. Edgeconnect: Generative im- age inpainting with adversarial edge learning. ArXiv, abs/1901.00212, 2019. 9

  59. [59]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: towards photorealistic image genera- tion and editing with text-guided diffusion models. CoRR, abs/2112.10741, 2021. 6, 7, 16

  60. [60]

    kinematic fitting

    Anton Obukhov, Maximilian Seitzer, Po-Wei Wu, Se- men Zhydenko, Jonathan Kyl, and Elvis Yu-Jing Lin. 11 High-fidelity performance metrics for generative models in pytorch, 2020. Version: 0.3.0, DOI: 10.5281/zen- odo.4957738. 26, 27

  61. [61]

    Semantic image synthesis with spatially-adaptive normalization

    Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun- Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 4, 7

  62. [62]

    Semantic image synthesis with spatially-adaptive normalization

    Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun- Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), June 2019. 22

  63. [63]

    Dual contradistinctive generative autoencoder

    Gaurav Parmar, Dacheng Li, Kwonjoon Lee, and Zhuowen Tu. Dual contradistinctive generative autoencoder. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021 , pages 823–832. Computer Vision Foundation / IEEE, 2021. 6

  64. [64]

    arXiv preprint arXiv:2104.11222 , year=

    Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On buggy resizing libraries and surprising subtleties in fid cal- culation. arXiv preprint arXiv:2104.11222, 2021. 26

  65. [65]

    Carbon Emissions and Large Neural Network Training

    David A. Patterson, Joseph Gonzalez, Quoc V . Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David R. So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training. CoRR, abs/2104.10350,

  66. [66]

    Zero-Shot Text-to-Image Generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. CoRR, abs/2102.12092, 2021. 1, 2, 3, 4, 7, 21, 27

  67. [67]

    Gen- erating diverse high-fidelity images with VQ-V AE-2

    Ali Razavi, A ¨aron van den Oord, and Oriol Vinyals. Gen- erating diverse high-fidelity images with VQ-V AE-2. In NeurIPS, pages 14837–14847, 2019. 1, 2, 3, 22

  68. [68]

    Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo- geswaran, Bernt Schiele, and Honglak Lee

    Scott E. Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo- geswaran, Bernt Schiele, and Honglak Lee. Generative ad- versarial text to image synthesis. In ICML, 2016. 4

  69. [69]

    Stochastic backpropagation and approximate in- ference in deep generative models

    Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate in- ference in deep generative models. In Proceedings of the 31st International Conference on International Conference on Machine Learning, ICML, 2014. 1, 4, 29

  70. [70]

    Network-to-network translation with conditional invertible neural networks

    Robin Rombach, Patrick Esser, and Bj ¨orn Ommer. Network-to-network translation with conditional invertible neural networks. In NeurIPS, 2020. 3

  71. [71]

    U- net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In MICCAI (3), volume 9351 of Lecture Notes in Computer Science, pages 234–241. Springer, 2015. 2, 3, 4

  72. [72]

    Saharia, J

    Chitwan Saharia, Jonathan Ho, William Chan, Tim Sal- imans, David J. Fleet, and Mohammad Norouzi. Im- age super-resolution via iterative refinement. CoRR, abs/2104.07636, 2021. 1, 4, 8, 16, 22, 23, 27

  73. [73]

    Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma. Pixelcnn++: Improving the pixelcnn with dis- cretized logistic mixture likelihood and other modifications. CoRR, abs/1701.05517, 2017. 1, 3

  74. [74]

    NVIDIA Developer Blog

    Dave Salvator. NVIDIA Developer Blog. https : / / developer . nvidia . com / blog / getting - immediate- speedups- with- a100- tf32, 2020. 28

  75. [75]

    Noise estim ation for generative diffusion models

    Robin San-Roman, Eliya Nachmani, and Lior Wolf. Noise estimation for generative diffusion models. CoRR, abs/2104.02600, 2021. 3

  76. [76]

    Projected gans converge faster

    Axel Sauer, Kashyap Chitta, Jens M ¨uller, and An- dreas Geiger. Projected gans converge faster. CoRR, abs/2111.01007, 2021. 6

  77. [77]

    A u- net based discriminator for generative adversarial networks

    Edgar Sch ¨onfeld, Bernt Schiele, and Anna Khoreva. A u- net based discriminator for generative adversarial networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 8204–8213. Computer Vision Founda- tion / IEEE, 2020. 6

  78. [78]

    Laion- 400m: Open dataset of clip-filtered 400 million image-text pairs, 2021

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion- 400m: Open dataset of clip-filtered 400 million image-text pairs, 2021. 6, 7

  79. [79]

    Very deep con- volutional networks for large-scale image recognition

    Karen Simonyan and Andrew Zisserman. Very deep con- volutional networks for large-scale image recognition. In Yoshua Bengio and Yann LeCun, editors, Int. Conf. Learn. Represent., 2015. 29, 43, 44, 45

  80. [80]

    D2C: diffusion-denoising models for few-shot con- ditional generation

    Abhishek Sinha, Jiaming Song, Chenlin Meng, and Stefano Ermon. D2C: diffusion-denoising models for few-shot con- ditional generation. CoRR, abs/2106.06819, 2021. 3

Showing first 80 references.