super hub Canonical reference

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Andreas Blattmann, Bjorn Ommer, Dominik Lorenz, Patrick Esser, Robin Rombach · 2022 · 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) · DOI 10.1109/cvpr52688.2022.01042

Canonical reference. 82% of citing Pith papers cite this work as background.

46 Pith papers citing it

13.8k external citations · external index

Background 82% of classified citations

open at publisher browse 46 citing papers more from Andreas Blattmann

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 9 method 2

citation-polarity summary

background 9 use method 2

claims ledger

background globally consistent 3D scenes that remain stable under large viewpoint changes. fundamentally an ill-posed problem: Simple texts or image inputs fail to provide a comprehensive representation of the entire 3D space. Consequently, inferring massive amounts of missing information for unseen areas while maintaining ge- ometric consistency remains a significant challenge. Deep generative models, particularly diffusion models [13,17,34,35,37], ad- dress this by leveraging strong 2D visual priors. How
background language model to guide the evolution process. Importantly, it works on black-box generation models by requiring only image outputs. Finally, we evaluate PromptEvolver across multiple prompt inversion benchmarks and show that it consistently outperforms competing methods. Keywords:Prompt inversion·Text to image generation 1 Introduction Text-to-image (T2I) diffusion models [21,35,48] have transformed visual con- tent creation, enabling users to generate photorealistic images from natural- langua
background Thisphysics-basedreference ˆImv v guaranteesglobalilluminationconsistencyacross views but lacks photorealistic high-frequency details (e.g. specularities, sky tex- tures), so we use it as a structural guidance signal for the generative stage. Generative Refinement via IC-Light.We refineˆImv v with IC-Light [48], a re- lighting diffusion model adapted from Stable Diffusion [25]. While IC-Light pro- duces photorealistic lighting effects, applying it independently per view breaks multi-view consist
background Subsequent works further enhance controllability and semantic alignment, including Prompt-to-Prompt [11], DiffEdit [7], Imagic [18], Plug-and-Play Diffusion Features [43], and ControlNet [59]. More recent approaches explore richer instruction interfaces and multimodal reasoning, such as MGIE [9] and GenArtist [46], while subject-driven and compositional editing are studied in DreamBooth [35], Blended Diffusion [1], SDEdit [25], and image translation methods such as Detail Fusion GAN [ 20]. Comme
background To validate the effectiveness of our proposed Neural Simulation in recovering real-world data distributions from simulation, we consider the following set of diverse comparative approaches: 1) Classical Simulation(Sim), denoting the canonical raw simulation pipeline without neural-driven refinement; 2) Baseline, a video-to-video generation model built on Stable Diffusion 1.5 [39] with temporal continuity post-processing [54]; 3) Zero-Shot, referring to the backbone model deployed without any sim
background Several methods explicitly incorporate inpainting modules to hallucinate missing details in saturated re- gions [23,60,111]. However, when using limited-capacity generative models, the synthesized content often lacks realism or fine details. 2.3 Generative HDR Advancesingenerativemodeling,includingGANs[4,9,10,22,40,48-50,79,83,106] and diffusion models [3,16,31,34,39,67,74,88-90,96,102,105,107,108,112,113], have shown strong priors for image and video generation. Some approaches learn themapping

authors

Andreas Blattmann Bjorn Ommer Dominik Lorenz Patrick Esser Robin Rombach

co-cited works

representative citing papers

Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts

cs.CV · 2026-05-12 · unverdicted · novelty 8.0 · 2 refs

An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.

Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

DiSI disentangles stochastic interpolants into separate generation and regression paths, allowing controllable transitions between regression and generative image restoration with a unified few-step sampler.

SeamCam: Quantifying Seamless Camouflage via Multi-Cue Visual Detectability

cs.CV · 2026-05-15 · conditional · novelty 7.0

SeamCam quantifies camouflage by computing one minus the highest IoU recoverable from category-conditioned detection proposals against a ground-truth mask, achieving 78.82% agreement with human judgments.

From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.

Fake3DGS: A Benchmark for 3D Manipulation Detection in Neural Rendering

cs.CV · 2026-04-30 · unverdicted · novelty 7.0

Fake3DGS benchmark shows state-of-the-art 2D fake detectors fail on 3D-manipulated Gaussian Splatting images while a new multi-view coherence method improves detection.

Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

Render-in-the-Loop reformulates SVG generation as a step-wise visual-context-aware process using self-feedback from rendered intermediate states, VSF training, and RaV inference to outperform baselines on MMSVGBench for Text-to-SVG and Image-to-SVG.

Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.

Towards In-Context Tone Style Transfer with A Large-Scale Triplet Dataset

cs.CV · 2026-04-17 · unverdicted · novelty 7.0

A new 100k triplet dataset and in-context diffusion framework ICTone enable state-of-the-art tone style transfer by jointly conditioning on content and reference images with scorer-based reward learning.

Physically Grounded 3D Generative Reconstruction under Hand Occlusion using Proprioception and Multi-Contact Touch

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

A conditional diffusion model using proprioception and multi-contact touch produces metric-scale, physically consistent 3D object reconstructions under hand occlusion.

Novel View Synthesis as Video Completion

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.

Training-Free Refinement of Flow Matching with Divergence-based Sampling

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

Flow Divergence Sampler refines flow matching by computing velocity field divergence to correct ambiguous intermediate states during inference, improving fidelity in text-to-image and inverse problem tasks.

PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.

HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

HairOrbit leverages video generation priors and a neural orientation extractor to achieve state-of-the-art strand-level 3D hair reconstruction from single-view portraits in visible and invisible regions.

MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane

cs.CV · 2026-03-20 · unverdicted · novelty 7.0

MoCA3D formulates monocular 3D box prediction as dense pixel-space tasks using corner heatmaps and depth maps, with a new PAG metric for image-plane evaluation.

DSH-Bench: A Difficulty- and Scenario-Aware Benchmark with Hierarchical Subject Taxonomy for Subject-Driven Text-to-Image Generation

cs.CV · 2026-03-09 · unverdicted · novelty 7.0

DSH-Bench is a benchmark for subject-driven T2I generation that uses hierarchical taxonomy sampling, difficulty/scenario classification, and a new SICS metric showing 9.4% higher human correlation than prior measures.

StreamGVE: Training-Free Video Editing via Few-Step Streaming Video Generation

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

StreamGVE enables high-quality training-free video editing by converting the task to noise-to-data streaming generation with dual-branch fast sampling, self-attention bridges, cross-attention grounding, source-oriented guidance, and visual prompting.

Learning to Balance: Decoupled Siamese Diffusion Transformer for Reference-Based Remote Sensing Image Super-Resolution

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

DS-DiT decouples low-resolution and reference interactions in a siamese diffusion transformer and adds a patch-level weights module plus autoguidance to improve reference-based super-resolution for remote sensing images.

The Learnability Gap in Medical Latent Diffusion

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

Pretrained autoencoders in medical latent diffusion encode discriminative features well for reconstruction but structure their latent spaces in ways that hinder classifier learning, a gap that persists across architectures and is not closed by domain fine-tuning.

Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.

Diffusion Model as a Generalist Segmentation Learner

cs.CV · 2026-04-27 · unverdicted · novelty 6.0

DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.

Optimizing Diffusion Priors in Image Reconstruction from a Single Observation

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

Combining diffusion priors as a product-of-experts and optimizing exponents via Bayesian evidence maximization enables prior tuning from one observation in inverse imaging problems.

FluSplat: Sparse-View 3D Editing without Test-Time Optimization

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

FluSplat trains a model with geometric alignment constraints on multi-view edits to produce consistent 3D scene edits from sparse views in a single forward pass without test-time optimization.

Allo{SR}$^2$: Rectifying One-Step Super-Resolution to Stay Real via Allomorphic Generative Flows

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

Allo{SR}^2 rectifies one-step super-resolution trajectories with allomorphic generative flows via SNR initialization, velocity supervision, and self-adversarial matching to deliver state-of-the-art fidelity and realism.

Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.

citing papers explorer

Showing 2 of 2 citing papers after filters.

HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits cs.CV · 2026-04-03 · unverdicted · none · ref 23
HairOrbit leverages video generation priors and a neural orientation extractor to achieve state-of-the-art strand-level 3D hair reconstruction from single-view portraits in visible and invisible regions.
ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment cs.CV · 2026-04-12 · unverdicted · none · ref 43
ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer