UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
Title resolution pending
12 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 12roles
background 3polarities
background 3representative citing papers
SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
Task-aware localization via attention cues and feature centroids from source/target streams in IIE models improves non-edit consistency while preserving instruction following.
ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.
An MLLM agent reformulates image editing tasks into executable operation sequences to improve reliability on challenging cases across existing generative backbones.
FASA bridges low-level forensic frequency signals and high-level semantic consistency to achieve state-of-the-art localization of both conventional and diffusion-generated image manipulations.
CAGE uses LLM-generated code for label-correct diagrams followed by ControlNet-conditioned diffusion refinement to produce both accurate and visually engaging educational graphics, backed by the new EduDiagram-2K dataset.
InsTraj generates realistic, instruction-faithful GPS trajectories by using an LLM to parse natural-language travel intent and a multimodal diffusion transformer to produce the paths.
SubFlow restores full mode coverage in one-step flow matching by conditioning on sub-modes from semantic clustering, yielding higher diversity on ImageNet-256 while preserving FID.
TIGFlow-GRPO uses a Trajectory-Interaction-Graph in conditional flow matching plus Flow-GRPO optimization to produce more accurate, socially compliant, and physically feasible trajectory forecasts on ETH/UCY and SDD datasets.
Face-D²CL fuses spatial and frequency features and uses dual continual learning to reduce forgetting while adapting to new DeepFakes, cutting average error rates by 60.7% and raising unseen-domain AUC by 7.9% over prior SOTA.
citing papers explorer
-
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
-
Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation
SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
-
SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness
SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
-
Rethinking Where to Edit: Task-Aware Localization for Instruction-Based Image Editing
Task-aware localization via attention cues and feature centroids from source/target streams in IIE models improves non-edit consistency while preserving instruction following.
-
ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation
ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.
-
Making Image Editing Easier via Adaptive Task Reformulation with Agentic Executions
An MLLM agent reformulates image editing tasks into executable operation sequences to improve reliability on challenging cases across existing generative backbones.
-
Bridging the Micro--Macro Gap: Frequency-Aware Semantic Alignment for Image Manipulation Localization
FASA bridges low-level forensic frequency signals and high-level semantic consistency to achieve state-of-the-art localization of both conventional and diffusion-generated image manipulations.
-
CAGE: Bridging the Accuracy-Aesthetics Gap in Educational Diagrams via Code-Anchored Generative Enhancement
CAGE uses LLM-generated code for label-correct diagrams followed by ControlNet-conditioned diffusion refinement to produce both accurate and visually engaging educational graphics, backed by the new EduDiagram-2K dataset.
-
InsTraj: Instructing Diffusion Models with Travel Intentions to Generate Real-world Trajectories
InsTraj generates realistic, instruction-faithful GPS trajectories by using an LLM to parse natural-language travel intent and a multimodal diffusion transformer to produce the paths.
-
SubFlow: Sub-mode Conditioned Flow Matching for Diverse One-Step Generation
SubFlow restores full mode coverage in one-step flow matching by conditioning on sub-modes from semantic clustering, yielding higher diversity on ImageNet-256 while preserving FID.
-
TIGFlow-GRPO: Trajectory Forecasting via Interaction-Aware Flow Matching and Reward-Guided Optimization
TIGFlow-GRPO uses a Trajectory-Interaction-Graph in conditional flow matching plus Flow-GRPO optimization to produce more accurate, socially compliant, and physically feasible trajectory forecasts on ETH/UCY and SDD datasets.
-
Face-D(^2)CL: Multi-Domain Synergistic Representation with Dual Continual Learning for Facial DeepFake Detection
Face-D²CL fuses spatial and frequency features and uses dual continual learning to reduce forgetting while adapting to new DeepFakes, cutting average error rates by 60.7% and raising unseen-domain AUC by 7.9% over prior SOTA.