{"total":97,"items":[{"citing_arxiv_id":"2605.13062","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling","primary_cat":"cs.CV","submitted_at":"2026-05-13T06:33:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12724","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Inline Critic Steers Image Editing","primary_cat":"cs.CV","submitted_at":"2026-05-12T20:29:26+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12624","ref_index":45,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving","primary_cat":"cs.RO","submitted_at":"2026-05-12T18:09:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12622","ref_index":49,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Action Emergence from Streaming Intent","primary_cat":"cs.RO","submitted_at":"2026-05-12T18:09:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Streaming Intent lets a VLA model derive driving intent via streamed chain-of-thought reasoning and use it to steer a flow-matching action head, yielding competitive Waymo scores plus intent-based trajectory control without pre-built banks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12500","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"works [17, 20, 69, 158] validate that direct pixel-space modeling can rival or even surpass latent diffusion, pointing toward a fundamentally new direction via fully end-to-end optimization from raw pixels. 2.2 Native Multimodal Unified Models Early efforts to unify multimodal understanding and generation have largely converged on shared backbones, as exemplified by Show-o [145, 146], Janus [18, 92, 140], OmniGen [141, 144], and BAGEL [28]. While these systems demonstrate that perception and synthesis can coexist within a single model, they remain split across fundamentally different tokenizers, diffusion heads, or decoupled pathways, reflecting a deeper mismatch between understanding and generation. A complementary line of work shifts the focus to the visual interface itself, including shared discrete"},{"citing_arxiv_id":"2605.12309","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-05-12T15:56:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"G²TR reduces visual tokens and prefill computation by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency while preserving reasoning accuracy and editing quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12305","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation","primary_cat":"cs.CV","submitted_at":"2026-05-12T15:54:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12271","ref_index":49,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm","primary_cat":"cs.CV","submitted_at":"2026-05-12T15:35:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute binding and structural control.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12088","ref_index":7,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation","primary_cat":"cs.CV","submitted_at":"2026-05-12T13:10:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11567","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dynamic Execution Commitment of Vision-Language-Action Models","primary_cat":"cs.CV","submitted_at":"2026-05-12T05:52:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A3 determines the execution horizon in VLA models as the longest prefix of actions that passes consensus-based verification and sequential consistency checks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11541","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GeoR-Bench: Evaluating Geoscience Visual Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-12T05:13:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GeoR-Bench shows top multimodal models reach only 42.7% strict accuracy on geoscience visual reasoning tasks while open-source models reach 10.3%, with outputs often visually plausible yet scientifically inaccurate.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11400","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning","primary_cat":"cs.MM","submitted_at":"2026-05-12T01:43:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixed strategies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11363","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PresentAgent-2: Towards Generalist Multimodal Presentation Agents","primary_cat":"cs.CV","submitted_at":"2026-05-12T00:32:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PresentAgent-2 generates query-driven multimodal presentation videos with research grounding, supporting single-speaker, multi-speaker discussion, and interactive question-answering modes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11061","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer","primary_cat":"cs.CV","submitted_at":"2026-05-11T17:59:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B to over 200B parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10780","ref_index":5,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization","primary_cat":"cs.CV","submitted_at":"2026-05-11T16:14:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DRoRAE adaptively fuses multi-layer features from vision encoders via energy-constrained routing to enrich visual tokens, cutting rFID from 0.57 to 0.29 and generation FID from 1.74 to 1.65 on ImageNet-256 while revealing a log-linear scaling law with fusion capacity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10180","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers","primary_cat":"cs.CV","submitted_at":"2026-05-11T08:31:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09693","ref_index":48,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Do multimodal models imagine electric sheep?","primary_cat":"cs.CV","submitted_at":"2026-05-10T18:25:52+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09233","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Robust Sequential Decomposition for Complex Image Editing","primary_cat":"cs.CV","submitted_at":"2026-05-10T00:23:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Sequential decomposition trained on synthetic editing tasks improves robustness for complex image instructions and transfers to real images via co-training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08724","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SynerMedGen: Synergizing Medical Multimodal Understanding with Generation via Task Alignment","primary_cat":"cs.CV","submitted_at":"2026-05-09T06:13:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SynerMedGen introduces generation-aligned understanding tasks and a two-stage training strategy that enables strong zero-shot medical image synthesis performance and outperforms specialized models when generation training is added.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08043","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation","primary_cat":"cs.CV","submitted_at":"2026-05-08T17:32:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08029","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation","primary_cat":"cs.CV","submitted_at":"2026-05-08T17:14:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06548","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Continuous Latent Diffusion Language Model","primary_cat":"cs.CL","submitted_at":"2026-05-07T16:44:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing latent prior modeling as an alternative to token-level autoregressive language model","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05781","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Steering Visual Generation in Unified Multimodal Models with Understanding Supervision","primary_cat":"cs.CV","submitted_at":"2026-05-07T07:20:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05646","ref_index":95,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality","primary_cat":"cs.CV","submitted_at":"2026-05-07T03:53:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04128","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation","primary_cat":"cs.GR","submitted_at":"2026-05-05T15:49:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02892","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion","primary_cat":"cs.CV","submitted_at":"2026-05-04T17:59:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AlbumFill retrieves identity-consistent references from personal albums via VLM-inferred semantic cues to support personalized image completion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08163","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing","primary_cat":"cs.CV","submitted_at":"2026-05-04T16:21:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02641","ref_index":81,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE","primary_cat":"cs.CV","submitted_at":"2026-05-04T14:26:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01135","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ScribbleEdit: Synthetic Data for Image Editing with Scribbles and Text","primary_cat":"cs.CV","submitted_at":"2026-05-01T22:26:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ScribbleEdit is a synthetic dataset combining scribbles and text for training image editing models that produce spatially aligned and semantically consistent results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00707","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-01T14:51:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PhysEdit introduces adaptive reasoning depth and spatial masking to make image editing faster and more instruction-aligned without retraining the base model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02937","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Proteo-R1: Reasoning Foundation Models for De Novo Protein Design","primary_cat":"cs.LG","submitted_at":"2026-05-01T06:52:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proteo-R1 decouples an MLLM-based understanding expert that selects functional residues from a diffusion-based generation expert that builds protein structures under those explicit constraints.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.28185","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling","primary_cat":"cs.CV","submitted_at":"2026-04-30T17:59:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemphasizing perceptual quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27505","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Leveraging Verifier-Based Reinforcement Learning in Image Editing","primary_cat":"cs.CV","submitted_at":"2026-04-30T06:54:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26341","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness","primary_cat":"cs.CV","submitted_at":"2026-04-29T06:46:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25636","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-04-28T13:36:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models as conditional regeneration using prompt and semantic tokens from the initial image, yielding higher alignment scores than editing-based methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25477","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing","primary_cat":"cs.CV","submitted_at":"2026-04-28T10:30:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DDA-Thinker decouples planning from generation and applies dual-atomic RL with checklist-based rewards to boost reasoning in image editing, yielding competitive results on RISE-Bench and KRIS-Bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25072","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-04-27T23:57:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning objectives are across modalities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24763","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation","primary_cat":"cs.CV","submitted_at":"2026-04-27T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24625","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Meta-CoT: Enhancing Granularity and Generalization in Image Editing","primary_cat":"cs.CV","submitted_at":"2026-04-27T15:52:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23996","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs","primary_cat":"cs.CV","submitted_at":"2026-04-27T03:23:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23763","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing","primary_cat":"cs.CV","submitted_at":"2026-04-26T15:28:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A co-trained adapter framework enables mask-free local editing in DiTs by factorizing edit semantics from spatial location and jointly learning a mask predictor.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22875","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SketchVLM: Vision language models can annotate images to explain thoughts and guide users","primary_cat":"cs.CV","submitted_at":"2026-04-23T22:33:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22868","ref_index":74,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Probing Visual Planning in Image Editing Models","primary_cat":"cs.CV","submitted_at":"2026-04-23T19:00:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Image editing models fail zero-shot visual planning on abstract mazes and queen puzzles but generalize after finetuning, yet still cannot match human zero-shot efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21926","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs","primary_cat":"cs.CV","submitted_at":"2026-04-23T17:59:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21921","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Context Unrolling in Omni Models","primary_cat":"cs.CV","submitted_at":"2026-04-23T17:58:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21904","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection","primary_cat":"cs.CV","submitted_at":"2026-04-23T17:49:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"UniGenDet unifies generative and discriminative models through symbiotic self-attention and detector-guided alignment to co-evolve image generation and authenticity detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20570","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Exploring Spatial Intelligence from a Generative Perspective","primary_cat":"cs.CV","submitted_at":"2026-04-22T13:50:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20267","ref_index":5,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ATIR: Towards Audio-Text Interleaved Contextual Retrieval","primary_cat":"cs.SD","submitted_at":"2026-04-22T07:11:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19954","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens","primary_cat":"cs.CV","submitted_at":"2026-04-21T20:01:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Viewpoint tokens learned on a mixed 3D-rendered and photorealistic dataset enable precise camera control in text-to-image generation while factorizing geometry from appearance and transferring to unseen object categories.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19902","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings","primary_cat":"cs.CV","submitted_at":"2026-04-21T18:25:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}