OA-VAT improves visual active tracking by combining instance-level prototype discrimination with occlusion-aware diffusion planning, reporting gains over prior SOTA on simulated and real drone benchmarks.
hub
Improved denoising diffusion probabilistic models
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
fields
cs.CV 13representative citing papers
HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.
A diffusion model trained on synthetically damaged teeth from public datasets completes crowns with 81.8% IoU and 0.00034 Chamfer distance, and produces real-world restorations with minimal opposing-tooth interference.
MultiAnimate adds Identifier Assigner and Identifier Adapter modules to diffusion video models so they can handle multiple characters without identity mix-ups, generalizing from two-character training data to more characters.
PhysiGen reduces interpenetration in text-driven 3D human interaction generation by simplifying meshes to geometric primitives for fast collision detection and guiding optimization with collision regions.
Patch Forcing enables diffusion models to denoise image patches at varying rates based on predicted difficulty, advancing easier regions first to improve context and achieve better generation quality on ImageNet while scaling to text-to-image tasks.
Reward models used as quality scorers in text-to-image generation encode demographic biases that cause reward-guided training to sexualize female subjects, reinforce stereotypes, and reduce diversity.
EGLOCE erases target concepts in diffusion models at inference time by optimizing latents with dual energy guidance that repels unwanted concepts while retaining prompt alignment.
Reward Forcing combines EMA-Sink tokens and Rewarded Distribution Matching Distillation to deliver state-of-the-art streaming video generation at 23.1 FPS without copying initial frames.
KSDiff generates convolutional kernels in kernel space using low-rank core tensor and factor generators with multi-head attention for fast, high-quality pansharpening.
GSAL combines diffusion-based visual difficulty scoring with hierarchical semantic coverage to improve active learning retrieval of subtle and rare visual anomalies over standard uncertainty and diversity methods.
SocialMirror reconstructs 3D meshes of closely interacting humans from monocular videos using semantic guidance from vision-language models and geometric constraints in a diffusion model to handle occlusions and maintain temporal and spatial consistency.
Two new lightweight modules for diffusion-based real-world image super-resolution deliver competitive perceptual quality and better structure preservation on DIV2K and RealSR datasets.
citing papers explorer
-
Instance-level Visual Active Tracking with Occlusion-Aware Planning
OA-VAT improves visual active tracking by combining instance-level prototype discrimination with occlusion-aware diffusion planning, reporting gains over prior SOTA on simulated and real drone benchmarks.
-
HP-Edit: A Human-Preference Post-Training Framework for Image Editing
HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.
-
From Synthetic Data to Real Restorations: Diffusion Model for Patient-specific Dental Crown Completion
A diffusion model trained on synthetically damaged teeth from public datasets completes crowns with 81.8% IoU and 0.00034 Chamfer distance, and produces real-world restorations with minimal opposing-tooth interference.
-
MultiAnimate: Pose-Guided Image Animation Made Extensible
MultiAnimate adds Identifier Assigner and Identifier Adapter modules to diffusion video models so they can handle multiple characters without identity mix-ups, generalizing from two-character training data to more characters.
-
PhysiGen: Integrating Collision-Aware Physical Constraints for High-Fidelity Human-Human Interaction Generation
PhysiGen reduces interpenetration in text-driven 3D human interaction generation by simplifying meshes to geometric primitives for fast collision detection and guiding optimization with collision regions.
-
Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation
Patch Forcing enables diffusion models to denoise image patches at varying rates based on predicted difficulty, advancing easier regions first to improve context and achieve better generation quality on ImageNet while scaling to text-to-image tasks.
-
Bias at the End of the Score
Reward models used as quality scorers in text-to-image generation encode demographic biases that cause reward-guided training to sexualize female subjects, reinforce stereotypes, and reduce diversity.
-
EGLOCE: Training-Free Energy-Guided Latent Optimization for Concept Erasure
EGLOCE erases target concepts in diffusion models at inference time by optimizing latents with dual energy guidance that repels unwanted concepts while retaining prompt alignment.
-
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
Reward Forcing combines EMA-Sink tokens and Rewarded Distribution Matching Distillation to deliver state-of-the-art streaming video generation at 23.1 FPS without copying initial frames.
-
Fast Kernel-Space Diffusion for Remote Sensing Pansharpening
KSDiff generates convolutional kernels in kernel space using low-rank core tensor and factor generators with multi-head attention for fast, high-quality pansharpening.
-
Hard to See, Hard to Label: Generative and Symbolic Acquisition for Subtle Visual Phenomena
GSAL combines diffusion-based visual difficulty scoring with hierarchical semantic coverage to improve active learning retrieval of subtle and rare visual anomalies over standard uncertainty and diversity methods.
-
SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance
SocialMirror reconstructs 3D meshes of closely interacting humans from monocular videos using semantic guidance from vision-language models and geometric constraints in a diffusion model to handle occlusions and maintain temporal and spatial consistency.
-
Degradation-Aware and Structure-Preserving Diffusion for Real-World Image Super-Resolution
Two new lightweight modules for diffusion-based real-world image super-resolution deliver competitive perceptual quality and better structure preservation on DIV2K and RealSR datasets.