An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.
super hub Canonical reference
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Canonical reference. 82% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- background globally consistent 3D scenes that remain stable under large viewpoint changes. fundamentally an ill-posed problem: Simple texts or image inputs fail to provide a comprehensive representation of the entire 3D space. Consequently, inferring massive amounts of missing information for unseen areas while maintaining ge- ometric consistency remains a significant challenge. Deep generative models, particularly diffusion models [13,17,34,35,37], ad- dress this by leveraging strong 2D visual priors. How
- background language model to guide the evolution process. Importantly, it works on black-box generation models by requiring only image outputs. Finally, we evaluate PromptEvolver across multiple prompt inversion benchmarks and show that it consistently outperforms competing methods. Keywords:Prompt inversion·Text to image generation 1 Introduction Text-to-image (T2I) diffusion models [21,35,48] have transformed visual con- tent creation, enabling users to generate photorealistic images from natural- langua
- background Thisphysics-basedreference ˆImv v guaranteesglobalilluminationconsistencyacross views but lacks photorealistic high-frequency details (e.g. specularities, sky tex- tures), so we use it as a structural guidance signal for the generative stage. Generative Refinement via IC-Light.We refineˆImv v with IC-Light [48], a re- lighting diffusion model adapted from Stable Diffusion [25]. While IC-Light pro- duces photorealistic lighting effects, applying it independently per view breaks multi-view consist
- background Subsequent works further enhance controllability and semantic alignment, including Prompt-to-Prompt [11], DiffEdit [7], Imagic [18], Plug-and-Play Diffusion Features [43], and ControlNet [59]. More recent approaches explore richer instruction interfaces and multimodal reasoning, such as MGIE [9] and GenArtist [46], while subject-driven and compositional editing are studied in DreamBooth [35], Blended Diffusion [1], SDEdit [25], and image translation methods such as Detail Fusion GAN [ 20]. Comme
- background To validate the effectiveness of our proposed Neural Simulation in recovering real-world data distributions from simulation, we consider the following set of diverse comparative approaches: 1) Classical Simulation(Sim), denoting the canonical raw simulation pipeline without neural-driven refinement; 2) Baseline, a video-to-video generation model built on Stable Diffusion 1.5 [39] with temporal continuity post-processing [54]; 3) Zero-Shot, referring to the backbone model deployed without any sim
- background Several methods explicitly incorporate inpainting modules to hallucinate missing details in saturated re- gions [23,60,111]. However, when using limited-capacity generative models, the synthesized content often lacks realism or fine details. 2.3 Generative HDR Advancesingenerativemodeling,includingGANs[4,9,10,22,40,48-50,79,83,106] and diffusion models [3,16,31,34,39,67,74,88-90,96,102,105,107,108,112,113], have shown strong priors for image and video generation. Some approaches learn themapping
authors
co-cited works
representative citing papers
DiSI disentangles stochastic interpolants into separate generation and regression paths, allowing controllable transitions between regression and generative image restoration with a unified few-step sampler.
SeamCam quantifies camouflage by computing one minus the highest IoU recoverable from category-conditioned detection proposals against a ground-truth mask, achieving 78.82% agreement with human judgments.
A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
Fake3DGS benchmark shows state-of-the-art 2D fake detectors fail on 3D-manipulated Gaussian Splatting images while a new multi-view coherence method improves detection.
Render-in-the-Loop reformulates SVG generation as a step-wise visual-context-aware process using self-feedback from rendered intermediate states, VSF training, and RaV inference to outperform baselines on MMSVGBench for Text-to-SVG and Image-to-SVG.
GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
A new 100k triplet dataset and in-context diffusion framework ICTone enable state-of-the-art tone style transfer by jointly conditioning on content and reference images with scorer-based reward learning.
A conditional diffusion model using proprioception and multi-contact touch produces metric-scale, physically consistent 3D object reconstructions under hand occlusion.
Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
Flow Divergence Sampler refines flow matching by computing velocity field divergence to correct ambiguous intermediate states during inference, improving fidelity in text-to-image and inverse problem tasks.
PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.
HairOrbit leverages video generation priors and a neural orientation extractor to achieve state-of-the-art strand-level 3D hair reconstruction from single-view portraits in visible and invisible regions.
MoCA3D formulates monocular 3D box prediction as dense pixel-space tasks using corner heatmaps and depth maps, with a new PAG metric for image-plane evaluation.
DSH-Bench is a benchmark for subject-driven T2I generation that uses hierarchical taxonomy sampling, difficulty/scenario classification, and a new SICS metric showing 9.4% higher human correlation than prior measures.
StreamGVE enables high-quality training-free video editing by converting the task to noise-to-data streaming generation with dual-branch fast sampling, self-attention bridges, cross-attention grounding, source-oriented guidance, and visual prompting.
DS-DiT decouples low-resolution and reference interactions in a siamese diffusion transformer and adds a patch-level weights module plus autoguidance to improve reference-based super-resolution for remote sensing images.
Pretrained autoencoders in medical latent diffusion encode discriminative features well for reconstruction but structure their latent spaces in ways that hinder classifier learning, a gap that persists across architectures and is not closed by domain fine-tuning.
Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.
Combining diffusion priors as a product-of-experts and optimizing exponents via Bayesian evidence maximization enables prior tuning from one observation in inverse imaging problems.
FluSplat trains a model with geometric alignment constraints on multi-view edits to produce consistent 3D scene edits from sparse views in a single forward pass without test-time optimization.
Allo{SR}^2 rectifies one-step super-resolution trajectories with allomorphic generative flows via SNR initialization, velocity supervision, and self-adversarial matching to deliver state-of-the-art fidelity and realism.
A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
citing papers explorer
-
HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits
HairOrbit leverages video generation priors and a neural orientation extractor to achieve state-of-the-art strand-level 3D hair reconstruction from single-view portraits in visible and invisible regions.
-
ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment
ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.