{"total":19,"items":[{"citing_arxiv_id":"2606.28049","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AirGroundBench: Probing Spatial Intelligence in Multimodal Large Models under Heterogeneous Multi-View Embodied Collaboration","primary_cat":"cs.CV","submitted_at":"2026-06-26T12:51:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AirGroundBench is a new diagnostic benchmark exposing that MLLMs handle basic spatial perception but struggle with cross-view alignment, transformation reasoning, and embodied navigation under heterogeneous air-ground views.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31148","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes","primary_cat":"cs.CV","submitted_at":"2026-05-29T10:59:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SpatialAct benchmark shows VLMs handle isolated spatial reasoning but fail to maintain coherent spatial beliefs and produce reliable actions in multi-turn 3D interactions, underperforming humans.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23897","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ETCHR: Editing To Clarify and Harness Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-22T17:58:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A decoupled question-conditioned image editor trained via supervised imitation then VLM-reward enhancement improves MLLM visual reasoning Pass@1 by 4.6-5.5 points across models and tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22558","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-21T14:40:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GeoWeaver performs token-adaptive geometric grounding on visual tokens from a multi-level bank prior to language modeling to support better spatio-temporal reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21642","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?","primary_cat":"cs.CV","submitted_at":"2026-05-20T18:55:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The Token Replacement Test shows VLMs keep most accuracy gains even after corrupting or replacing continuous thought token content, indicating the tokens are not used as information bottlenecks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20165","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-19T17:50:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18162","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency","primary_cat":"cs.CV","submitted_at":"2026-05-18T10:05:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SAGE adds duality consistency as an auxiliary reward in GRPO training with a dynamic operation pool to improve spatial reasoning robustness and generalization in VLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12500","ref_index":65,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"OCRBench [83] 82.10 81.90 89.20 91.90 83.90 91.00 86.30 86.50 Hallucination HallusionBench [49] 67.75 65.40 69.30 68.95 66.00 67.90 - - Visual Reasoning BabyVision [15] 25.00 17.78 25.80 31.70 18.60 29.60 11.34 - TiR [68] 28.15 22.30 31.90 29.30 22.50 42.30 24.19 - Spatial Intelligence VSI-Bench [150] 62.66 56.61* 55.67* 56.90 51.56* 58.10* 32.91 - ViewSpatial [65] 56.19 47.25 48.19 58.52 47.37 50.78 41.68 - MindCube-Tiny [156] 62.01 43.17 57.59 70.86 40.86 63.46 48.84 - 3DSR-Bench [91] 64.88 54.48 56.77 62.96 55.55 66.60 53.61 - Table 3 Quantitative evaluation results on multimodal understanding benchmarks. For spatial intelligence, we adopt EASI [ 9] as the standard evaluation, using 32 input frames on VSI-Bench for all models."},{"citing_arxiv_id":"2605.12413","ref_index":29,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:11:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MLLMs display a large perception-reasoning gap on perspective-conditioned spatial reasoning tasks from omnidirectional images, with sharp accuracy drops on advanced tasks like egocentric rotation, though partial gains are possible via RL reward shaping.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10106","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-11T07:20:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"setting of VSI-Bench videos and new questions, experimental results sufficiently prove that ViSRA generalizes better to OOD spatial question types than post-training methods. 4.4 Evaluation on Other Benchmarks To better demonstrate the generalization ability of our method, we expanded our evaluation to a broader set of spatial reasoning benchmarks. Specifically, these benchmarks include ViewSpatial- Bench [22], a benchmark for cross-viewpoint spatial reasoning from human-centered perspectives; OST-Bench [28], a benchmark for evaluating online spatio-temporal understanding during agent- centric scene exploration; and MMSI-Video-Bench [27], a comprehensive, fully human-annotated benchmark for video-based spatial intelligence in MLLMs. We slightly adjusted the benchmark"},{"citing_arxiv_id":"2604.22409","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments","primary_cat":"cs.CV","submitted_at":"2026-04-24T10:06:41+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"excel at symbolic belief revision but fail at perceptual revision using RGB images [32, 45]. A prevailing issue in current literature is theconflation of perceptual errors and memory failures[44, 45]. While an agent's inability to locate an objectisoftenattributedtodeficientlong-termmemory,recentevidencesuggests that the primary bottleneck frequently resides in the initial perceptual grounding and geometric alignment [18, 25, 34, 38, 43]. This perspective is also consistent with egocentric 3D localization / interaction datasets that require stable geom- etry and object identity over time [17, 22, 24]. Related long-horizon navigation settings that explicitly test map-like memory (e.g., multi-goal navigation) also report sharp degradation with increased episode complexity [12, 26, 36]."},{"citing_arxiv_id":"2604.17385","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning","primary_cat":"cs.CV","submitted_at":"2026-04-19T11:21:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Consequently, recent efforts empha- size large scale data curation and advanced training paradigms. SpatialVLM [7] pioneered spatial dataset synthesis, followed by SPAR [54] and Cambrian-S [51], which aggregate diverse 3D scenes for comprehensive spatial cognition. Beyond data scaling, models like VST [50], SpaceR [35], MindCube [53], and SpatialLadder [26] explore reinforcement learning and verifiable rewards to optimize structured spatial reasoning. Notably, the comprehensive scaling study SenseNova-SI [5] reveals that simply scaling text based CoT yields diminishing returns for spatial tasks. Despite these advances, existing methods predominantly rely on textual reasoning, facing a fundamental representation mismatch between discrete linguistic"},{"citing_arxiv_id":"2604.09037","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos","primary_cat":"cs.CV","submitted_at":"2026-04-10T06:58:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SiMing-Bench shows current MLLMs have weak agreement with physicians on procedural correctness in clinical videos, with intermediate step judgments remaining poor even when overall scores look acceptable.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09712","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-04-08T06:28:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.","context_count":1,"top_context_role":"method","top_context_polarity":"extend","context_text":"depth perception, metric measurement, instance counting, local detail inspection, and 3D coordinate estimation. The skill set is designed to be extensible-new skills can be integrated by defining the corresponding atomic operation compositions and I/O formats. Detailed API specifications for each skill are provided in Appen- dix A. 3.2 Progressive Training Strategy Inspired by prior work [18], we adopt a progressive tool-scheduling training strategy that progressively teaches the model how to in- voke tools and how to reason over tool outputs and intermediate results in a shallow-to-deep manner. Our training pipeline consists of three stages. An overall schematic diagram of the framework is shown in Figure 3. Stage 1: Warm-up.The objective of the warm-up stage is to fa-"},{"citing_arxiv_id":"2604.05695","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs","primary_cat":"cs.CV","submitted_at":"2026-04-07T10:45:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"implicit enhancement methods based on a pure RGB setting. These studies [4, 29, 32, 36, 40, 44, 46] attempt to implicitly elicit the spa- tial understanding capabilities of the models without introducing additional 3D encoders, either by massively scaling up synthetic datasets or by introducing advanced training strategies (e.g., the re- inforcement learning-based SpatialLadder [25]). However, although the scaling of data and the optimization of training strategies indeed bring improvements in generalization performance, these methods still struggle to break through their performance bottlenecks when facing rigorous geometric measurement and complex spatial rea- soning tasks, primarily due to the inherent lack of real 3D physical"},{"citing_arxiv_id":"2604.02870","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Token Warping Helps MLLMs Look from Nearby Viewpoints","primary_cat":"cs.CV","submitted_at":"2026-04-03T08:37:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Lee, Jihyeon Je, Chanho Park, Mikaela Angelina Uy,LeonidasGuibas,andMinhyukSung. Perspective-aware reasoning in vision-language models via mental imagery simulation. InICCV, 2025. 1, 3 [49] Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. TMLR, 2025. 2 [50] Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, et al. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision- language models.arXiv preprint arXiv:2505.21500, 2025. 3 [51] Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu,"},{"citing_arxiv_id":"2512.23365","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpatialMosaic: A Multiview VLM Dataset for Partial Visibility","primary_cat":"cs.CV","submitted_at":"2025-12-29T10:48:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SpatialMosaic introduces a 2M-pair multi-view QA dataset and 1M-pair benchmark for MLLMs on spatial reasoning under partial visibility, plus a hybrid baseline that integrates 3D reconstruction models as geometry encoders.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.21471","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition","primary_cat":"cs.AI","submitted_at":"2025-11-26T15:04:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SpatialBench creates a five-level framework and 15-task benchmark to measure hierarchical spatial reasoning in MLLMs, finding strong basic perception but weak symbolic reasoning, causal inference, and planning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.04978","ref_index":96,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI","primary_cat":"cs.AI","submitted_at":"2025-10-06T16:16:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"rely on large-scale datasets like COCO [89] or Im- ageNet [90], which focus on object identification and simple relationships but lack detailed annota- tions for more sophisticated spatial reasoning tasks. To overcome these limitations, recent research has begun to explore incorporation of spatially aware neural networks [91], [92], [93], [94] and multi-view learning [95], [96], [97], [98] to handle complex spa- tial relationships. 5 C.3 Identifying Intrinsic Property Understanding the physical world from vision re- quires not only recognizing objects but also infer- ring their intrinsic properties and dynamic behav- iors based on these properties. Intrinsic properties such as mass, viscosity and rigidity are inherent characteristics of objects that remain constant regard-"}],"limit":50,"offset":0}