The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
Title resolution pending
35 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.
A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.
Ground4D resolves temporal conflicts in feedforward 4D Gaussian reconstruction for off-road scenes via voxel-grounded temporal aggregation with intra-voxel softmax and surface normal regularization, outperforming prior methods on ORAD-3D and RELLIS-3D while generalizing zero-shot.
A replay method for continual face forgery detection condenses real-fake distribution discrepancies into compact maps and synthesizes compatible samples from current real faces to reduce forgetting under tight memory budgets without storing historical images.
IAD-Unify unifies industrial anomaly segmentation, region-grounded language understanding, and mask-guided generation in one framework using DINOv2 token injection into Qwen3.5, supported by the new Anomaly-56K dataset of 59,916 images.
Multi-head Gaussian kernels inject temporal scale discrepancy as inductive bias to enable full-duplex talking-listening avatar generation, supported by a new decoupled VoxHear dataset and claimed SOTA naturalness.
DiV-INR integrates implicit neural representations as conditioning signals for diffusion models to achieve better perceptual quality than HEVC, VVC, and prior neural codecs at extremely low bitrates under 0.05 bpp.
SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.
PlanViz is a new benchmark with three sub-tasks and PlanScore metric to evaluate planning-oriented image generation and editing by unified multimodal models for computer-use tasks.
U-4DGS reformulates occluded dynamic human rendering as MAP estimation under heteroscedastic noise, using a Probabilistic Deformation Network and uncertainty-modulated joint rasterization plus confidence-aware regularizations to deliver SOTA fidelity and robustness on ZJU-MoCap and OcMotion.
The paper releases SR-Ground, a crowdsourced dataset for pixel-level segmentation of six artifact types in super-resolved images, and shows its use for training grounded IQA models and artifact-reducing fine-tuning.
SandSim reconstructs temporally coherent sand painting processes from single images using curve-guided Gaussian splatting, subtractive compositing for accumulation, and semantic-guided stroke planning.
EAD-Net uses a diffusion model with new spatio-temporal attention, graph-based temporal reasoning, and LLM-derived semantic descriptions to generate emotionally expressive talking head videos with improved lip-sync and coherence over prior methods.
Task-aware localization via attention cues and feature centroids from source/target streams in IIE models improves non-edit consistency while preserving instruction following.
LBFTI decomposes faces into three layers with dedicated generators and a three-stage training process to invert templates into fine-grained, identity-preserving images, claiming 25.3% better TAR than prior methods.
RF-CMG synthesizes high-quality mmWave and RFID signals from WiFi using a diffusion model with Modality-Guided Embedding for high-frequency details and Low-Frequency Modality Consistency to preserve physical structure.
DVFace uses a spatio-temporal dual-codebook and asymmetric fusion in a one-step diffusion model to deliver better video face restoration quality, temporal consistency, and identity preservation than recent methods.
ArtifactWorld restores artifacts in 3D Gaussian Splatting by training a video diffusion backbone on 107.5K paired clips with an isomorphic predictor for artifact heatmaps and an Artifact-Aware Triplet Fusion mechanism to achieve better sparse-view novel synthesis.
VersaVogue unifies garment generation and virtual dressing via trait-routing attention with mixture-of-experts and an automated multi-perspective preference optimization pipeline that uses DPO without human labels.
LLM-powered monitoring of UI similarity allows random testing tools to escape tarpits, yielding 45-55% higher coverage and more unique bugs across 12 apps.
RHVI-FDD hierarchically decouples luminance-chrominance and then frequency components in low-light images to correct color, suppress noise, and preserve details better than prior methods.
Introduces spatially adaptive modulation with a signal encoder and uncertainty-inspired loss for correcting non-uniform exposure degradations in images.
citing papers explorer
-
Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation
The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
-
A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation
CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.
-
What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers
A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.
-
Ground4D: Spatially-Grounded Feedforward 4D Reconstruction for Unstructured Off-Road Scenes
Ground4D resolves temporal conflicts in feedforward 4D Gaussian reconstruction for off-road scenes via voxel-grounded temporal aggregation with intra-voxel softmax and surface normal regularization, outperforming prior methods on ORAD-3D and RELLIS-3D while generalizing zero-shot.
-
Direct Discrepancy Replay: Distribution-Discrepancy Condensation and Manifold-Consistent Replay for Continual Face Forgery Detection
A replay method for continual face forgery detection condenses real-fake distribution discrepancies into compact maps and synthesizes compatible samples from current real faces to reduce forgetting under tight memory budgets without storing historical images.
-
IAD-Unify: A Region-Grounded Unified Model for Industrial Anomaly Segmentation, Understanding, and Generation
IAD-Unify unifies industrial anomaly segmentation, region-grounded language understanding, and mask-guided generation in one framework using DINOv2 token injection into Qwen3.5, supported by the new Anomaly-56K dataset of 59,916 images.
-
Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels
Multi-head Gaussian kernels inject temporal scale discrepancy as inductive bias to enable full-duplex talking-listening avatar generation, supported by a new decoupled VoxHear dataset and claimed SOTA naturalness.
-
DiV-INR: Extreme Low-Bitrate Diffusion Video Compression with INR Conditioning
DiV-INR integrates implicit neural representations as conditioning signals for diffusion models to achieve better perceptual quality than HEVC, VVC, and prior neural codecs at extremely low bitrates under 0.05 bpp.
-
Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation
SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
-
When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models
A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.
-
PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks
PlanViz is a new benchmark with three sub-tasks and PlanScore metric to evaluate planning-oriented image generation and editing by unified multimodal models for computer-use tasks.
-
Uncertainty-Aware 4D Gaussian Splatting for Monocular Occluded Human Rendering
U-4DGS reformulates occluded dynamic human rendering as MAP estimation under heteroscedastic noise, using a Probabilistic Deformation Network and uncertainty-modulated joint rasterization plus confidence-aware regularizations to deliver SOTA fidelity and robustness on ZJU-MoCap and OcMotion.
-
SR-Ground: Image Quality Grounding for Super-Resolved Content
The paper releases SR-Ground, a crowdsourced dataset for pixel-level segmentation of six artifact types in super-resolved images, and shows its use for training grounded IQA models and artifact-reducing fine-tuning.
-
SandSim: Curve-Guided Gaussian Splatting for Reconstructing Sand Painting Processes
SandSim reconstructs temporally coherent sand painting processes from single images using curve-guided Gaussian splatting, subtractive compositing for accumulation, and semantic-guided stroke planning.
-
EAD-Net: Emotion-Aware Talking Head Generation with Spatial Refinement and Temporal Coherence
EAD-Net uses a diffusion model with new spatio-temporal attention, graph-based temporal reasoning, and LLM-derived semantic descriptions to generate emotionally expressive talking head videos with improved lip-sync and coherence over prior methods.
-
Rethinking Where to Edit: Task-Aware Localization for Instruction-Based Image Editing
Task-aware localization via attention cues and feature centroids from source/target streams in IIE models improves non-edit consistency while preserving instruction following.
-
LBFTI: Layer-Based Facial Template Inversion for Identity-Preserving Fine-Grained Face Reconstruction
LBFTI decomposes faces into three layers with dedicated generators and a three-stage training process to invert templates into fine-grained, identity-preserving images, claiming 25.3% better TAR than prior methods.
-
Cross-Modal Generation: From Commodity WiFi to High-Fidelity mmWave and RFID Sensing
RF-CMG synthesizes high-quality mmWave and RFID signals from WiFi using a diffusion model with Modality-Guided Embedding for high-frequency details and Low-Frequency Modality Consistency to preserve physical structure.
-
DVFace: Spatio-Temporal Dual-Prior Diffusion for Video Face Restoration
DVFace uses a spatio-temporal dual-codebook and asymmetric fusion in a one-step diffusion model to deliver better video face restoration quality, temporal consistency, and identity preservation than recent methods.
-
ArtifactWorld: Scaling 3D Gaussian Splatting Artifact Restoration via Video Generation Models
ArtifactWorld restores artifacts in 3D Gaussian Splatting by training a video diffusion backbone on 107.5K paired clips with an isomorphic predictor for artifact heatmaps and an Artifact-Aware Triplet Fusion mechanism to achieve better sparse-view novel synthesis.
-
VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis
VersaVogue unifies garment generation and virtual dressing via trait-routing attention with mixture-of-experts and an automated multi-perspective preference optimization pipeline that uses DPO without human labels.
-
Improving Random Testing via LLM-powered UI Tarpit Escaping for Mobile Apps
LLM-powered monitoring of UI similarity allows random testing tools to escape tarpits, yielding 45-55% higher coverage and more unique bugs across 12 apps.
-
RHVI-FDD: A Hierarchical Decoupling Framework for Low-Light Image Enhancement
RHVI-FDD hierarchically decouples luminance-chrominance and then frequency components in low-light images to correct color, suppress noise, and preserve details better than prior methods.
-
Rethinking Exposure Correction for Spatially Non-uniform Degradation
Introduces spatially adaptive modulation with a signal encoder and uncertainty-inspired loss for correcting non-uniform exposure degradations in images.
-
TIQA: Human-Aligned Perceptual Text Quality Assessment in Generated Images
TIQA introduces datasets and a model that predict human perceptual quality of rendered text in AI images, achieving PLCC 0.942 on crops and improving selected image text quality by 0.36 MOS.
-
SmokeSVD: Smoke Reconstruction from A Single View via Progressive Novel View Synthesis and Refinement with Diffusion Models
SmokeSVD reconstructs dynamic smoke from a single video via diffusion-based side-view synthesis, progressive multi-view refinement, and Navier-Stokes-guided density-velocity estimation.
-
Towards a General-Purpose Zero-Shot Synthetic Low-Light Image and Video Pipeline
A self-supervised Degradation Estimation Network estimates parameters for physics-informed noise distributions to generate realistic synthetic low-light data, showing gains on noise replication, enhancement, and detection tasks.
-
Generator-Refiner-Examiner: A Tri-Module Data Augmentation Framework for 3D Human Avatar Learning from Monocular Videos
TrioMan is a tri-module data augmentation framework using a Generator for pose/camera perturbations, a Refiner with one-step diffusion, and an Examiner with dual-branch attention to improve 3D avatar learning from monocular videos, claiming better results than prior methods on two benchmarks.
-
DealMaTe: Multi-Dimensional Material Transfer via Diffusion Transformer
DealMaTe proposes a simplified diffusion framework for material transfer that injects multi-dimensional 3D conditions via Multi-Dim 3D Shader LoRA and Shader Causal Mutual Attention with KV caching.
-
SAMIC: A Lightweight Semantic-Aware Mamba for Efficient Perceptual Image Compression
SAMIC introduces semantic-aware Mamba blocks and SVD-based redundancy reduction to achieve efficient perceptual image compression with improved rate-distortion-perception tradeoffs.
-
Do Protective Perturbations Really Protect Portrait Privacy under Real-world Image Transformations?
Pixel-level protective perturbations for portrait privacy are ineffective against common image transformations, and a low-cost purification framework can strip them out.
-
Identity-Decoupled Anonymization for Visual Evidence in Multi-modal Retrieval-Augmented Generation
Proposes a three-part generative anonymization pipeline using disentangled variational encoding, manifold-aware identity replacement, and distilled latent diffusion to protect face identities in MRAG while preserving non-identity attributes.
-
Discrete Preference Learning for Personalized Multimodal Generation
DPPMG learns discrete modal-specific preferences via a dedicated GNN from multimodal user data, quantizes them into tokens, and feeds them into generators with a consistency reward to produce personalized text and images.
-
Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation
PASA uses curvature-aware dynamic budgeting, grouped approximations, and stochastic attention routing to accelerate video diffusion transformers while eliminating temporal flickering from sparse patterns.
- Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency