CDM migrates distribution matching distillation to continuous time via dynamic random-length schedules and active off-trajectory latent alignment, yielding competitive few-step image fidelity on SD3 and Longcat-Image.
hub
arXiv preprint arXiv:2512.07584 , year=
18 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 18verdicts
UNVERDICTED 18roles
baseline 1polarities
baseline 1representative citing papers
Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.
Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.
A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.
DynT2I-Eval creates fresh prompts via dimension decomposition and dynamic sampling to evaluate text-to-image models on text alignment, quality, and aesthetics while maintaining a stable leaderboard.
Open-source image-editing models show competitive zero-shot performance on monocular depth, surface normals, and semantic segmentation, sometimes matching tuned models.
DDA-Thinker decouples planning from generation and applies dual-atomic RL with checklist-based rewards to boost reasoning in image editing, yielding competitive results on RISE-Bench and KRIS-Bench.
LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
FashionStylist is an expert-annotated benchmark dataset that unifies outfit-to-item grounding, completion, and evaluation tasks for multimodal large language models in fashion.
SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.
Gen-Searcher is the first search-augmented image generation agent trained with SFT followed by agentic RL using dual text and image rewards on custom datasets and the KnowGen benchmark.
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LIBERO, RoboTwin, and real-robot tasks.
Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
FineEdit adds multi-level bounding box injection to diffusion image editing, releases a 1.2M-pair dataset with box annotations, and shows better instruction following and background consistency than prior open models on new and existing benchmarks.
Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.
JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.
citing papers explorer
-
Continuous-Time Distribution Matching for Few-Step Diffusion Distillation
CDM migrates distribution matching distillation to continuous time via dynamic random-length schedules and active off-trajectory latent alignment, yielding competitive few-step image fidelity on SD3 and Longcat-Image.
-
Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
-
RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition
RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.
-
Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation
Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.
-
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.
-
DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models
DynT2I-Eval creates fresh prompts via dimension decomposition and dynamic sampling to evaluate text-to-image models on text alignment, quality, and aesthetics while maintaining a stable leaderboard.
-
Open-Source Image Editing Models Are Zero-Shot Vision Learners
Open-source image-editing models show competitive zero-shot performance on monocular depth, surface normals, and semantic segmentation, sometimes matching tuned models.
-
DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing
DDA-Thinker decouples planning from generation and applies dual-atomic RL with checklist-based rewards to boost reasoning in image editing, yielding competitive results on RISE-Bench and KRIS-Bench.
-
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
-
FashionStylist: An Expert Knowledge-enhanced Multimodal Dataset for Fashion Understanding
FashionStylist is an expert-annotated benchmark dataset that unifies outfit-to-item grounding, completion, and evaluation tasks for multimodal large language models in fashion.
-
SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing
SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.
-
Gen-Searcher: Reinforcing Agentic Search for Image Generation
Gen-Searcher is the first search-augmented image generation agent trained with SFT followed by agentic RL using dual text and image rewards on custom datasets and the KnowGen benchmark.
-
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
-
Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation
The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LIBERO, RoboTwin, and real-robot tasks.
-
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
-
FineEdit: Fine-Grained Image Edit with Bounding Box Guidance
FineEdit adds multi-level bounding box injection to diffusion image editing, releases a 1.2M-pair dataset with box annotations, and shows better instruction following and background consistency than prior open models on new and existing benchmarks.
-
Qwen-Image-2.0 Technical Report
Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.
-
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.