{"total":13,"items":[{"citing_arxiv_id":"2606.28049","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AirGroundBench: Probing Spatial Intelligence in Multimodal Large Models under Heterogeneous Multi-View Embodied Collaboration","primary_cat":"cs.CV","submitted_at":"2026-06-26T12:51:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AirGroundBench is a new diagnostic benchmark exposing that MLLMs handle basic spatial perception but struggle with cross-view alignment, transformation reasoning, and embodied navigation under heterogeneous air-ground views.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30231","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:00:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GASP injects geometric priors into VLMs via a deep-supervised correspondence head trained on video point correspondences and depth consistency, raising internal matching accuracy and delivering gains on spatial benchmarks without any 3D VQA data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27318","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Q-GeoMem: Question-Guided Geometric Memory for Video Spatial Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-26T17:26:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Q-GeoMem uses question-guided scoring to maintain a Fine-Grained Context Bank and Semantic-Geometric Evidence Bank, achieving SOTA on VSI-Bench and VSTI-Bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21625","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly","primary_cat":"cs.CV","submitted_at":"2026-05-20T18:36:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20165","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-19T17:50:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10106","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-11T07:20:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"MLLMs in recent years, accompanied by the emergence of benchmarks that systematically evaluated video-based spatial intelligence [ 41, 24, 40, 27]. To our best knowl- edge, VSI-Bench was first proposed as a video-based visual- spatial intelligence benchmark to probe MLLMs' perceptual, linguistic, and temporal capabilities on spatial reasoning tasks [41]. Following benchmarks, such as STI-Bench [24] and MMSI-Video-Bench [27], were also designed to eval- uate MLLMs' spatio-temporal understanding through chal- lenging tasks, revealing MLLMs' limitations in real-world spatio-temporal understanding, ranging from spatial con- struction and motion understanding to planning, estimation, prediction, and cross-video reasoning. To endow MLLMs with stronger spatial intelligence, a"},{"citing_arxiv_id":"2604.09712","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-04-08T06:28:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"puter vision tools to perform physical operations [ 5, 12]. Recent research interests have begun to shift towards spatial reasoning in path planning [28, 40]. Tool-augmented Reasoning.A major research trend enhances LLMs by equipping them with external modules that supply com- plementary information. Typical examples integrate calculators [25, 44], code executors [10, 26] and symbolic solvers [20, 31, 32, 50], leveraging their reliability to handle complex reasoning beyond the native capacity of language models [ 30]. In the multimodal setting [11, 14], tools are extended to visual operations such as cropping, masking, or adjusting image attributes [ 46, 48], some- times coordinated through reinforcement learning for tool selection and sequencing [22, 49]."},{"citing_arxiv_id":"2603.27437","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning","primary_cat":"cs.CV","submitted_at":"2026-03-28T22:49:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.03944","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SCP: Spatial Causal Prediction in Video","primary_cat":"cs.CV","submitted_at":"2026-03-04T11:09:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.10719","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2025-12-11T14:59:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpaceDrive integrates 3D positional encodings derived from depth and ego-states into VLMs, replacing digit tokens to improve spatial reasoning and trajectory regression in autonomous driving.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.21471","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition","primary_cat":"cs.AI","submitted_at":"2025-11-26T15:04:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SpatialBench creates a five-level framework and 15-task benchmark to measure hierarchical spatial reasoning in MLLMs, finding strong basic perception but weak symbolic reasoning, causal inference, and planning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.23747","ref_index":53,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence","primary_cat":"cs.CV","submitted_at":"2025-05-29T17:59:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Spatial-MLLM adds a 3D spatial encoder initialized from a visual geometry model and space-aware frame sampling to MLLMs to improve spatial understanding and reasoning from purely 2D visual inputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.01805","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpaceR: Reinforcing MLLMs in Video Spatial Reasoning","primary_cat":"cs.CV","submitted_at":"2025-04-02T15:12:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}