DiffIML applies score-based generative modeling to image manipulation localization, recovering coherent masks iteratively from noise to improve generalization on unseen manipulation types.
Canonical reference
Title resolution pending
Canonical reference. 100% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
roles
background 5polarities
background 5representative citing papers
LEGO uses multiple generator-specific LoRA modules modulated by an MLP and fused with attention to detect synthetic images, achieving better performance than prior methods while using under 10% of the training data.
ResetEdit embeds a recoverable discrepancy signal during image generation in diffusion models to reconstruct an approximate original latent for high-fidelity text-guided editing.
ZODIAC enables zero-shot inference of conflict-inducing conditions in O-RAN xApps from marginal offline data alone via uncertainty-penalized compositional diffusion reasoning.
OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.
DFAlign uses diffusion-based denoising to generate foreground knowledge prompts that improve cross-modal alignment for detecting unseen actions in untrimmed videos, reporting state-of-the-art results on OV-TAD benchmarks.
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
SynHAT uses a novel two-stage spatio-temporal diffusion framework with Latent Spatio-Temporal U-Net to synthesize realistic human activity traces, outperforming baselines by 52% on spatial and 33% on temporal metrics across four cities.
GeoLink uses offline 3D scene reconstruction to guide 2D feature refinement and relation distillation for improved generalization in cross-view geo-localization.
A replay method for continual face forgery detection condenses real-fake distribution discrepancies into compact maps and synthesizes compatible samples from current real faces to reduce forgetting under tight memory budgets without storing historical images.
IAD-Unify unifies industrial anomaly segmentation, region-grounded language understanding, and mask-guided generation in one framework using DINOv2 token injection into Qwen3.5, supported by the new Anomaly-56K dataset of 59,916 images.
SHARP applies a spectrum-aware dynamic RoPE scaling schedule that promotes resolution more strongly in early denoising stages and relaxes it later, outperforming static baselines on quality metrics for remote sensing images.
PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.
SENSE is a controllable diffusion model that jointly generates realistic urban satellite imagery and aligned building energy consumption and height maps from road networks and density inputs, improving downstream tasks with under 20% labeled data.
A new interests burn-down diffusion process models decaying user interests for personalized collaborative filtering and outperforms prior generative methods in the StageCF implementation.
SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
EAD-Net uses a diffusion model with new spatio-temporal attention, graph-based temporal reasoning, and LLM-derived semantic descriptions to generate emotionally expressive talking head videos with improved lip-sync and coherence over prior methods.
A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
Task-aware localization via attention cues and feature centroids from source/target streams in IIE models improves non-edit consistency while preserving instruction following.
A two-stage semi-supervised flow matching framework with random voting and conflict-free guidance fuses mosaiced hyperspectral and panchromatic images to generate superior high-resolution hyperspectral results on benchmarks.
Long-CODE isolates long-context video evaluation with a new benchmark dataset and shot-dynamics metric that correlates better with human judgments on narrative richness and global consistency than short-video metrics.
RF-CMG synthesizes high-quality mmWave and RFID signals from WiFi using a diffusion model with Modality-Guided Embedding for high-frequency details and Low-Frequency Modality Consistency to preserve physical structure.
ARGen generates high-fidelity dynamic facial expression videos using affective semantic injection and adaptive reinforcement diffusion to improve emotion recognition models facing data scarcity and long-tail distributions.
VersaVogue unifies garment generation and virtual dressing via trait-routing attention with mixture-of-experts and an automated multi-perspective preference optimization pipeline that uses DPO without human labels.