{"total":26,"items":[{"citing_arxiv_id":"2606.31326","ref_index":87,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Bridging Video Understanding and Generation in a Unified Framework","primary_cat":"cs.CV","submitted_at":"2026-06-30T08:29:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Vega unifies video understanding and generation via shared vocabulary and hybrid autoregressive-diffusion architecture, reporting strong results on VBench and VideoMME.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29384","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Event-VLA: Action-Conditioned Event Fusion for Robust Vision-Language-Action Model","primary_cat":"cs.CV","submitted_at":"2026-06-28T13:19:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Event-VLA integrates event streams into VLA models through action-conditioned gated cross-attention to maintain performance in normal light while improving success rates under low-light and near-dark conditions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27745","ref_index":130,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Panoramic Scene Analysis: A Survey from Distortion-Aware Engineering to Sphere-Native Foundation Modeling","primary_cat":"cs.CV","submitted_at":"2026-06-26T05:54:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Survey organizing panoramic scene analysis literature by architectural design and training paradigm, identifying the absence of methods achieving both strict spherical equivariance and full reuse of perspective-pretrained weights, plus five evaluation protocol gaps and a six-point roadmap.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.23041","ref_index":81,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SPAR: Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-06-22T08:48:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SPAR introduces a semantic-pixel self-alignment tokenizer and dynamic token routing to create a unified multimodal model that performs both understanding and generation at claimed state-of-the-art levels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06194","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ActiveMimic: Egocentric Video Pretraining with Active Perception","primary_cat":"cs.RO","submitted_at":"2026-06-04T14:01:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ActiveMimic pretrains on egocentric human video by recovering and modeling active camera motion as viewpoint actions, matching robot-data pretraining performance on real-world tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04939","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UAT: Unified Audio-Text Diffusion for Audio Generation, Editing, and Captioning","primary_cat":"eess.AS","submitted_at":"2026-06-03T14:29:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UAT presents a diffusion-centric framework coupling continuous latent diffusion for audio with masked discrete diffusion for text in a shared dual-stream backbone to enable unified generation, editing, and captioning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00188","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PaintBench: Deterministic Evaluation of Precise Visual Editing","primary_cat":"cs.GR","submitted_at":"2026-05-29T16:01:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PaintBench provides a scalable deterministic benchmark for precise visual editing operations, revealing that even the best of 11 models achieves only 17.1% mIoU and that scores correlate strongly with applied data visualization editing performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25343","ref_index":214,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Toward Native Multimodal Modeling: A Roadmap","primary_cat":"cs.CV","submitted_at":"2026-05-25T01:57:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"More recent query-aware approaches, including Q-Zoom [213], further make the resolution decision conditional on the user instruction: the model first reasons over a coarse view, then spends high-resolution tokens only on regions likely to affect the answer. 6.2 Addressing the Dual Challenges of Heterogeneity and Scale in MLLMs In the progression toward artificial general intelligence, MLLMs must reconcile the dual challenges of heterogeneity and scale [214, 215]. Heterogeneity manifests as a fundamental representational chasm, reflecting the reality that human language is abstract, discrete, and symbolically structured, whereas visual, auditory, and sensory signals remain high- dimensional, continuous, and grounded in physical observables. This disparity extends beyond mere modality-specific 26 NMM Roadmap"},{"citing_arxiv_id":"2605.18678","ref_index":147,"ref_count":4,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Lance: Unified Multimodal Modeling by Multi-Task Synergy","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:18:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"Llama-adapter: Efficient fine-tuning of language models with zero-init attention.arXiv preprint arXiv:2303.16199, 2023. [147] Shanshan Zhao, Xinjie Zhang, Jintao Guo, Jiakui Hu, Lunhao Duan, Minghao Fu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, et al. Unified multimodal understanding and generation models: Advances, challenges, and opportunities.arXiv preprint arXiv:2505.02567, 2025. [148] Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690, 2025. [149] Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process."},{"citing_arxiv_id":"2605.17766","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LatentUMM: Dual Latent Alignment for Unified Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-05-18T02:35:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LatentUMM proposes dual latent alignment at modality and capacity levels plus latent dynamics stabilization to reduce semantic drift and improve consistency in unified multimodal models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15198","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both","primary_cat":"cs.CV","submitted_at":"2026-05-14T17:59:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11400","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning","primary_cat":"cs.MM","submitted_at":"2026-05-12T01:43:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixed strategies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11363","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PresentAgent-2: Towards Generalist Multimodal Presentation Agents","primary_cat":"cs.CV","submitted_at":"2026-05-12T00:32:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PresentAgent-2 generates query-driven multimodal presentation videos with research grounding, supporting single-speaker, multi-speaker discussion, and interactive question-answering modes.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18826-18836, 2025. [22] Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. [23] Shanshan Zhao, Xinjie Zhang, Jintao Guo, Jiakui Hu, Lunhao Duan, Minghao Fu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, et al. Unified multimodal understanding and generation models: Advances, challenges, and opportunities.arXiv preprint arXiv:2505.02567, 2025. [24] Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026."},{"citing_arxiv_id":"2605.03403","ref_index":37,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2026-05-05T06:23:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GRPO-TTA reformulates test-time adaptation for vision-language models as group-wise policy optimization via top-K sampling from CLIP distributions and alignment/dispersion rewards to tune the visual encoder.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01563","ref_index":92,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multi-Dataset Cross-Domain Knowledge Distillation for Unified Medical Image Segmentation, Classification, and Detection","primary_cat":"cs.CV","submitted_at":"2026-05-02T18:23:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A multi-dataset cross-domain knowledge distillation approach improves unified performance on medical image segmentation, classification, and detection by transferring domain-invariant features from a joint teacher model to task-specific students.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25072","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-04-27T23:57:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning objectives are across modalities.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"validate this diagnostic property through a human study in Sec. 4.4. 4 Experiments 4.1 Experimental Setup We evaluate eight open-sourced uMMs with XTC-Bench, including BAGEL 7B [10], BLIP3-o-8B [4], Janus-Pro-7B [5], MMaDA-8B [41], OmniGen-2 [36], Show-o [37], Show-o2-7B [38], and Tar-7B [15]. We chose those models to cover the taxonomy of uMM architectures proposed in recent work [46]. MMaDA lever- ages the diffusion paradigm for both visual and text modality. Tar, OmniGen2, BLIP3-o-8B, and Janus-Pro-7B adopt an autoregressive next-token prediction (NTP) backbone structure with sophisticated encoding and decoding strategies for both modalities. Show-o, Show-o2-7B, and BAGEL-7B predict text tokens with NTP and use diffusion for visual tokens."},{"citing_arxiv_id":"2604.21904","ref_index":71,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection","primary_cat":"cs.CV","submitted_at":"2026-04-23T17:49:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"UniGenDet unifies generative and discriminative models through symbiotic self-attention and detector-guided alignment to co-evolve image generation and authenticity detection.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"dominated by autoregressive models [3, 29], while genera- tion relied on separate diffusion [16, 21, 28, 49] or autore- gressive [15, 51, 55] frameworks. Recent advancements, spurred by systems like GPT-4o [42], have shifted focus toward unified encoder-decoder architectures that frame both tasks as a sequence modeling problem. Current ap- proaches primarily fall into three categories [71]: diffusion- based models that employ dual processes for joint gener- ation [33, 68], autoregressive models that leverage multi- scale visual tokenizers [14, 34, 66], and hybrid architec- tures that merge autoregressive reasoning with diffusion- based synthesis [8, 54, 59]. Recent works utilize techniques such as Mixture-of-Transformer [14], shared masked au-"},{"citing_arxiv_id":"2604.16879","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Adaptive Forensic Feature Refinement via Intrinsic Importance Perception","primary_cat":"cs.CV","submitted_at":"2026-04-18T07:07:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"I2P adaptively selects the most discriminative layers from visual foundation models for synthetic image detection and constrains task updates to low-sensitivity parameter subspaces to improve specificity without harming generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13540","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding","primary_cat":"cs.CV","submitted_at":"2026-04-15T06:41:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"UniRect-CoT is a training-free rectification chain-of-thought framework that treats diffusion denoising as visual reasoning and uses the model's inherent understanding to align and correct intermediate generation results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10784","ref_index":33,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training","primary_cat":"cs.AI","submitted_at":"2026-04-12T19:19:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TorchUMM is the first unified codebase and benchmark suite for multimodal understanding, generation, and editing across varied UMM models and datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08209","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering","primary_cat":"cs.CV","submitted_at":"2026-04-09T13:09:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoning benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ing alongside integrated architectures like Qwen-Omni [41,42] driven by the surging demand for holistic perception in embodied AI [17,51], current audio- visualenhancementspredominantlyrelyoncomputationallyintensivesupervised training with meticulously annotated data (e.g., Video-CoT [52], CoTasks [38], VIDEOP2R [11]), complex auxiliary objectives leveraging external reward mod- els [55] (e.g., VideoWorld 2 [26], Dual-IPO [48]), or elaborate multi-stage RL pipelines like Omni-R1 [56]. To bridge this gap without necessitating costly manual annotation or architectural complexity, our OmniJigsaw framework in- troduces a lightweight and verifiable self-supervised proxy task that strategically orchestrates synchronized video and audio streams to concurrently bolster in-"},{"citing_arxiv_id":"2604.07973","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace","primary_cat":"cs.AI","submitted_at":"2026-04-09T08:37:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[60] Zirui Zhao, Wee Sun Lee, and David Hsu. 2024. Large language models as commonsense knowledge for large-scale task planning. Advances in Neural Information Processing Systems 36 (2024). [61] Gengze Zhou, Yicong Hong, and Qi Wu. 2023. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. arXiv preprint arXiv:2305.16986 (2023). [62] Kaiwen Zhou, Kaizhi Zheng, Connor Pryor, Yilin Shen, Hongxia Jin, Lise Getoor, and Xin Eric Wang. 2023. Esc: Exploration with soft commonsense constraints for zero-shot object navigation. In International Conference on Machine Learning . PMLR, 42829-42842. [63] Fengda Zhu, Xiwen Liang, Yi Zhu, Qizhi Yu, Xiaojun Chang, and Xiaodan Liang. 2021. Soon: Scenario oriented object navigation with graph-based exploration."},{"citing_arxiv_id":"2604.04707","ref_index":161,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OpenWorldLib: A Unified Codebase and Definition of Advanced World Models","primary_cat":"cs.CV","submitted_at":"2026-04-06T14:19:48+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Many current world model architectures focus on next-frame prediction. This approach aligns with how humans process high-density sensory inputs, as humans are essentially \"pre-trained\" in the physical world, whereas large models are pre-trained on massive internet text corpora [78, 79]. However, based on existing architectures, VLMs might offer a practical solution. For example, Bagel [161] successfully achieves both multimodal reasoning and multimodal generation using the Qwen architecture. This demonstrates that Large Language Models (LLMs) pre-trained on internet data can possess all the capabilities required for a world model, showing their potential to serve as the foundational base. Therefore, before focusing entirely on the specific structural"},{"citing_arxiv_id":"2602.12286","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mind the Gap No More: Achieving Zero-Gap Multimodal Integration via One Tokenizer","primary_cat":"q-bio.GN","submitted_at":"2026-01-21T07:46:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"One Tokenizer achieves zero-gap multimodal integration by mapping all inputs to a unified token vocabulary, allowing native LLMs to perform deep cross-modal reasoning without modular encoders or fusion layers, and outperforming encoder-based baselines on DNA-text tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.10941","ref_index":72,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mull-Tokens: Modality-Agnostic Latent Thinking","primary_cat":"cs.CV","submitted_at":"2025-12-11T18:59:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mull-Tokens are modality-agnostic latent tokens that enable free-form multimodal thinking and deliver up to 16% gains on spatial reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.24211","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Speculative Coupled Decoding for Training-Free Lossless Acceleration of Autoregressive Visual Generation","primary_cat":"cs.CV","submitted_at":"2025-10-28T09:26:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Speculative Coupled Decoding stabilizes draft sampling in Speculative Jacobi Decoding via an information-theoretic coupling step, delivering up to 4.2x image and 13.6x video speedups with no quality loss or training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}