Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651

Fangyu Liu, Guy Emerson, Nigel Collier · 2023

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

browse 8 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

PhotoFlow: Agentic 3D Virtual Photography Missions

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

PhotoFlow is a closed-loop agent framework that searches for camera parameters in 3D scenes according to language intent and outperforms one-shot, reflection, and random baselines on the new VPhotoBench of 47 scenes and 141 missions.

DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

DriveSpatial benchmark shows the best of 15 VLMs trails humans by 28.4 points on spatiotemporal driving tasks, with cognitive scene construction as the main failure mode.

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

cs.AI · 2026-05-22 · unverdicted · novelty 6.0

VLMs fail to ground numerical values in spatial perception on new bidirectional tasks, relying on shallow cues instead of coordinate-aware representations.

Do Vision--Language Models Understand 3D Scenes or Just Catalogue Objects?

cs.CV · 2026-05-19 · accept · novelty 6.0

VLMs achieve 53-97% on volumetric rearrangement planning but only 6-45% on occlusion and under 7% on reflections in a new 3,034-sample benchmark, with white-box analysis localizing the failure to visual-token merger in Qwen3-VL-8B-Thinking.

Learning to Perceive "Where": Spatial Pretext Tasks for Robust Self-Supervised Learning

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

Spatial Prediction pretext task learns spatial structure in self-supervised learning by regressing relative position and scale between image views, yielding more structured representations and better generalization.

Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

cs.CV · 2025-05-23 · unverdicted · novelty 6.0

Slot-MLLM introduces a slot-attention-based object-centric visual tokenizer with Q-Former encoder, diffusion decoder, and residual vector quantization for improved local visual comprehension and generation in multimodal LLMs.

LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models

cs.CV · 2025-05-21 · unverdicted · novelty 5.0

LENS is a new multi-level benchmark dataset for evaluating MLLMs on perception-to-reasoning tasks using the same images across all levels with recent social media content.

citing papers explorer

Showing 8 of 8 citing papers.

PhotoFlow: Agentic 3D Virtual Photography Missions cs.CV · 2026-05-22 · unverdicted · none · ref 20
PhotoFlow is a closed-loop agent framework that searches for camera parameters in 3D scenes according to language intent and outperforms one-shot, reflection, and random baselines on the new VPhotoBench of 47 scenes and 141 missions.
DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving cs.CV · 2026-05-22 · unverdicted · none · ref 49
DriveSpatial benchmark shows the best of 15 VLMs trails humans by 28.4 points on spatiotemporal driving tasks, with cognitive scene construction as the main failure mode.
VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis cs.CV · 2026-05-21 · unverdicted · none · ref 51
VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.
SPACENUM: Revisiting Spatial Numerical Understanding in VLMs cs.AI · 2026-05-22 · unverdicted · none · ref 17
VLMs fail to ground numerical values in spatial perception on new bidirectional tasks, relying on shallow cues instead of coordinate-aware representations.
Do Vision--Language Models Understand 3D Scenes or Just Catalogue Objects? cs.CV · 2026-05-19 · accept · none · ref 18
VLMs achieve 53-97% on volumetric rearrangement planning but only 6-45% on occlusion and under 7% on reflections in a new 3,034-sample benchmark, with white-box analysis localizing the failure to visual-token merger in Qwen3-VL-8B-Thinking.
Learning to Perceive "Where": Spatial Pretext Tasks for Robust Self-Supervised Learning cs.CV · 2026-05-11 · unverdicted · none · ref 26
Spatial Prediction pretext task learns spatial structure in self-supervised learning by regressing relative position and scale between image views, yielding more structured representations and better generalization.
Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM cs.CV · 2025-05-23 · unverdicted · none · ref 41
Slot-MLLM introduces a slot-attention-based object-centric visual tokenizer with Q-Former encoder, diffusion decoder, and residual vector quantization for improved local visual comprehension and generation in multimodal LLMs.
LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models cs.CV · 2025-05-21 · unverdicted · none · ref 55
LENS is a new multi-level benchmark dataset for evaluating MLLMs on perception-to-reasoning tasks using the same images across all levels with recent social media content.

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer