hub

arXiv preprint arXiv:2512.07584 , year=

Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al · 2025 · arXiv 2512.07584

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

read on arXiv browse 18 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 1

citation-polarity summary

baseline 1

representative citing papers

Continuous-Time Distribution Matching for Few-Step Diffusion Distillation

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

CDM migrates distribution matching distillation to continuous time via dynamic random-length schedules and active off-trajectory latent alignment, yielding competitive few-step image fidelity on SD3 and Longcat-Image.

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.

RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.

Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.

UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

cs.CV · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.

DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

DynT2I-Eval creates fresh prompts via dimension decomposition and dynamic sampling to evaluate text-to-image models on text alignment, quality, and aesthetics while maintaining a stable leaderboard.

Open-Source Image Editing Models Are Zero-Shot Vision Learners

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

Open-source image-editing models show competitive zero-shot performance on monocular depth, surface normals, and semantic segmentation, sometimes matching tuned models.

DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing

cs.CV · 2026-04-28 · unverdicted · novelty 6.0

DDA-Thinker decouples planning from generation and applies dual-atomic RL with checklist-based rewards to boost reasoning in image editing, yielding competitive results on RISE-Bench and KRIS-Bench.

LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.

FashionStylist: An Expert Knowledge-enhanced Multimodal Dataset for Fashion Understanding

cs.CV · 2026-04-10 · unverdicted · novelty 6.0

FashionStylist is an expert-annotated benchmark dataset that unifies outfit-to-item grounding, completion, and evaluation tasks for multimodal large language models in fashion.

SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.

Gen-Searcher: Reinforcing Agentic Search for Image Generation

cs.CV · 2026-03-30 · unverdicted · novelty 6.0

Gen-Searcher is the first search-augmented image generation agent trained with SFT followed by agentic RL using dual text and image rewards on custom datasets and the KnowGen benchmark.

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

cs.CV · 2026-05-12 · unverdicted · novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation

cs.RO · 2026-05-12 · unverdicted · novelty 5.0

The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LIBERO, RoboTwin, and real-robot tasks.

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

cs.CV · 2026-04-27 · unverdicted · novelty 5.0

Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.

FineEdit: Fine-Grained Image Edit with Bounding Box Guidance

cs.CV · 2026-04-13 · unverdicted · novelty 5.0

FineEdit adds multi-level bounding box injection to diffusion image editing, releases a 1.2M-pair dataset with box annotations, and shows better instruction following and background consistency than prior open models on new and existing benchmarks.

Qwen-Image-2.0 Technical Report

cs.CV · 2026-05-11 · unverdicted · novelty 4.0

Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.

Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

cs.GR · 2026-05-05 · unverdicted · novelty 4.0

JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.

citing papers explorer

Showing 18 of 18 citing papers.

Continuous-Time Distribution Matching for Few-Step Diffusion Distillation cs.CV · 2026-05-07 · unverdicted · none · ref 53
CDM migrates distribution matching distillation to continuous time via dynamic random-length schedules and active off-trajectory latent alignment, yielding competitive few-step image fidelity on SD3 and Longcat-Image.
Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling cs.CV · 2026-05-13 · unverdicted · none · ref 39
Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition cs.CV · 2026-05-12 · unverdicted · none · ref 46
RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.
Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation cs.CV · 2026-05-13 · unverdicted · none · ref 36
Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation cs.CV · 2026-05-12 · unverdicted · none · ref 29 · 2 links
A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.
DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models cs.CV · 2026-05-07 · unverdicted · none · ref 34
DynT2I-Eval creates fresh prompts via dimension decomposition and dynamic sampling to evaluate text-to-image models on text alignment, quality, and aesthetics while maintaining a stable leaderboard.
Open-Source Image Editing Models Are Zero-Shot Vision Learners cs.CV · 2026-05-06 · unverdicted · none · ref 23
Open-source image-editing models show competitive zero-shot performance on monocular depth, surface normals, and semantic segmentation, sometimes matching tuned models.
DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing cs.CV · 2026-04-28 · unverdicted · none · ref 3
DDA-Thinker decouples planning from generation and applies dual-atomic RL with checklist-based rewards to boost reasoning in image editing, yielding competitive results on RISE-Bench and KRIS-Bench.
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model cs.CV · 2026-04-22 · unverdicted · none · ref 35
LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
FashionStylist: An Expert Knowledge-enhanced Multimodal Dataset for Fashion Understanding cs.CV · 2026-04-10 · unverdicted · none · ref 45
FashionStylist is an expert-annotated benchmark dataset that unifies outfit-to-item grounding, completion, and evaluation tasks for multimodal large language models in fashion.
SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing cs.CV · 2026-04-06 · unverdicted · none · ref 46
SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.
Gen-Searcher: Reinforcing Agentic Search for Image Generation cs.CV · 2026-03-30 · unverdicted · none · ref 23
Gen-Searcher is the first search-augmented image generation agent trained with SFT followed by agentic RL using dual text and image rewards on custom datasets and the KnowGen benchmark.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture cs.CV · 2026-05-12 · unverdicted · none · ref 124
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation cs.RO · 2026-05-12 · unverdicted · none · ref 73
The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LIBERO, RoboTwin, and real-robot tasks.
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation cs.CV · 2026-04-27 · unverdicted · none · ref 36
Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
FineEdit: Fine-Grained Image Edit with Bounding Box Guidance cs.CV · 2026-04-13 · unverdicted · none · ref 48
FineEdit adds multi-level bounding box injection to diffusion image editing, releases a 1.2M-pair dataset with box annotations, and shows better instruction following and background consistency than prior open models on new and existing benchmarks.
Qwen-Image-2.0 Technical Report cs.CV · 2026-05-11 · unverdicted · none · ref 23
Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation cs.GR · 2026-05-05 · unverdicted · none · ref 73
JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.

arXiv preprint arXiv:2512.07584 , year=

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer