arxiv: 2605.12413 · v2 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images

(2) Guangzhou University, (3) Queen Mary University of London, 4) ((1) The Hong Kong Polytechnic University, (4) HKUST (Guangzhou)), Ioannis Patras (3), Jiaxing Li (2), Wai Keung Wong (1), Xu Zheng (3, Yuangong Chen (1)

Pith reviewed 2026-05-13 06:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal large language modelsspatial reasoningomnidirectional imagesperspective conditioningbenchmark evaluationperception-reasoning gapreinforcement learning

0 comments

The pith

Multimodal large language models exhibit a large gap between basic spatial perception and advanced perspective-conditioned reasoning on omnidirectional images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper diagnoses limitations in MLLMs for handling spatial reasoning that depends on the observer's perspective in 360-degree images. It creates PCSR-Bench with over 84,000 questions across basic and advanced tasks to measure this. Evaluations show models perform decently on simple relative directions but collapse on tasks requiring egocentric rotations or multi-step compositional reasoning. An RL experiment indicates that some improvement is possible through optimization but gains are selective and not comprehensive.

Core claim

We observe a substantial perception-reasoning gap: accuracy reaches 57.59% on foundational relative direction, but drops to 13.49% on egocentric rotation, 7.13% on egocentric distortion, and 0.64% on open-ended compositional reasoning. RL-based reward shaping improves a 7B model from 31.10% to 60.06% in controlled settings, showing partial plasticity.

What carries the argument

PCSR-Bench, a benchmark of 84,373 question-answer pairs from 2,600 omnidirectional images across 26 indoor environments, testing eight tasks that span foundational perception to advanced perspective-conditioned spatial reasoning.

If this is right

PCSR acts as a key bottleneck limiting the deployment of MLLMs in applications requiring viewpoint-aware spatial understanding.
Targeted optimization like RL can recover meaningful performance on some PCSR tasks but remains sensitive to reward formulation.
The gap highlights that current models rely heavily on direct perception rather than inferring spatial relations under perspective changes.
Indoor omnidirectional images provide a controlled setting where full scene coverage still demands viewpoint-dependent inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the benchmark to outdoor or dynamic scenes could reveal whether the gap is environment-specific.
Integrating explicit 3D modeling or geometric priors into MLLMs might address the perspective conditioning issue more robustly than reward shaping alone.
Human validation of the generated QA pairs would strengthen the claim that the benchmark isolates true reasoning capabilities.
Similar perception-reasoning gaps may appear in other multimodal tasks involving temporal or causal reasoning under changing conditions.

Load-bearing premise

The automatically generated question-answer pairs correctly measure perspective-conditioned spatial reasoning without artifacts from the omnidirectional projection or the indoor scene choices.

What would settle it

Compare model performance on the generated benchmark questions against performance on a set of equivalent questions written and verified by human experts on the same images.

Figures

Figures reproduced from arXiv: 2605.12413 by (2) Guangzhou University, (3) Queen Mary University of London, 4) ((1) The Hong Kong Polytechnic University, (4) HKUST (Guangzhou)), Ioannis Patras (3), Jiaxing Li (2), Wai Keung Wong (1), Xu Zheng (3, Yuangong Chen (1).

**Figure 1.** Figure 1: Diagnostic task structure of PCSR-Bench and examples, with foundational perception tasks (T0–T2, upper part) and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: PCSR-Bench construction pipeline: ➀ a four-stage construction pipeline that programmatically generates diagnostic QA pairs from 3D ground truth; ➁ the resulting PCSR-Data; and ➂ an evaluation protocol for assessing MLLMs on the benchmark. 5% 10% 15% T0 (8.9%) T1 (16.7%) T2 (12.7%) T3 T4 (8.8%) (7.6%) T6 (6.7%) T7 (11.7%) T5 (26.9%) 5.8% Count Class 3.1% Count Total 6.6% Dist MCQ 10.1% Dist Open Direction F… view at source ↗

**Figure 3.** Figure 3: Task distribution of PCSR-Bench. The benchmark [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) show strong visual perception, yet remain limited in reasoning about space under changing viewpoints. We study this challenge as Perspective-Conditioned Spatial Reasoning (PCSR) in 360-degree omnidirectional images, where broad scene coverage reduces ambiguity from partial observations without eliminating the need for viewpoint-dependent inference. To assess this capability, we introduce PCSR-Bench, a diagnostic benchmark of 84,373 question-answer pairs from 2,600 omnidirectional images across 26 indoor environments. PCSR-Bench contains eight tasks spanning foundational perception (e.g., object counting, relative distance, and relative direction) and advanced PCSR, including compositional chains, egocentric rotation, perspective re-anchoring, ego-distortion, and limited-FOV visibility. We evaluate 14 representative MLLMs and observe a substantial perception-reasoning gap: accuracy reaches 57.59% on foundational relative direction, but drops to 13.49% on egocentric rotation, 7.13% on egocentric distortion, and 0.64% on open-ended compositional reasoning. To probe the plasticity of this gap, we conduct an RL-based diagnostic study on a 7B-scale model. Reward shaping improves a matched 7B baseline from 31.10% to 60.06% under a controlled setting, suggesting that PCSR is partial plasticity rather than being fully immutable. Still, the gains are task-selective, sensitive to reward design including both weight allocation and reward formulation, and partially dependent on the evaluation protocol. These results position PCSR as a key bottleneck in current MLLMs and highlight limited but meaningful room for recovery under targeted optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PCSR-Bench shows MLLMs drop sharply on viewpoint-dependent spatial tasks in 360 images and that RL can lift a 7B model partway, but the automatic question pipeline needs checks before the gap size can be taken at face value.

read the letter

The paper's real addition is PCSR-Bench: 84k automatically generated QA pairs drawn from 2600 omnidirectional images across 26 indoor scenes, split into eight tasks that move from basic perception to egocentric rotation, distortion, and open compositional reasoning. They test 14 models and report the drop you would expect—57% on relative direction falling to 13% on rotation, 7% on distortion, and under 1% on compositional questions. The RL run on a 7B model then raises a matched baseline from 31% to 60% in their controlled setting, which at least shows the limitation is not completely fixed.

Referee Report

2 major / 2 minor

Summary. The paper introduces PCSR-Bench, a diagnostic benchmark of 84,373 QA pairs generated from 2,600 omnidirectional images across 26 indoor environments, to evaluate perspective-conditioned spatial reasoning (PCSR) in MLLMs. It reports a substantial performance gap across 14 models, with foundational tasks like relative direction reaching 57.59% accuracy while advanced tasks such as egocentric rotation (13.49%), egocentric distortion (7.13%), and open-ended compositional reasoning (0.64%) remain near chance, and demonstrates partial plasticity via RL optimization that lifts a 7B model from 31.10% to 60.06% under controlled reward shaping.

Significance. If the benchmark's automatic QA generation correctly isolates viewpoint-dependent reasoning without systematic projection or phrasing artifacts, the work provides a large-scale empirical diagnosis of a key limitation in current MLLMs and shows that targeted optimization can produce meaningful but selective gains. This positions PCSR as a concrete bottleneck and supplies a reproducible testbed for future multimodal spatial-reasoning research.

major comments (2)

[Benchmark Construction (PCSR-Bench description)] The central claim of a substantial, task-selective perception-reasoning gap rests on PCSR-Bench correctly isolating perspective-conditioned reasoning. The abstract describes automatic generation of the 84k pairs from omnidirectional images but reports no validation metrics, human audit results, or checks for projection-geometry mismatches and reference-frame ambiguities; this directly affects interpretability of the reported drops (e.g., 57.59% relative direction vs. 13.49% egocentric rotation).
[RL-based Diagnostic Study] The RL diagnostic study reports an improvement from 31.10% to 60.06% on a 7B model, yet supplies no exact reward formulations, weight-allocation details, or statistical significance tests. Because the abstract states that gains are sensitive to reward design and evaluation protocol, the plasticity conclusion cannot be assessed without these load-bearing specifics.

minor comments (2)

[Abstract] The abstract lists eight tasks but does not enumerate them explicitly; a concise bullet list would improve readability.
[Benchmark Construction] No mention of inter-annotator agreement or error analysis on the automatically generated QA pairs is provided, which would strengthen the benchmark's credibility even if full human validation is deferred to supplementary material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our manuscript. We address the major comments point by point below and plan to incorporate revisions to enhance the paper's clarity and completeness.

read point-by-point responses

Referee: [Benchmark Construction (PCSR-Bench description)] The central claim of a substantial, task-selective perception-reasoning gap rests on PCSR-Bench correctly isolating perspective-conditioned reasoning. The abstract describes automatic generation of the 84k pairs from omnidirectional images but reports no validation metrics, human audit results, or checks for projection-geometry mismatches and reference-frame ambiguities; this directly affects interpretability of the reported drops (e.g., 57.59% relative direction vs. 13.49% egocentric rotation).

Authors: We appreciate the referee highlighting the need for validation of the benchmark. The manuscript details the automatic QA generation process in Section 3.2, utilizing 3D scene graphs from Matterport3D to create viewpoint-specific questions that enforce perspective conditioning. However, we did not include quantitative validation metrics or human audit results in the original submission. We will revise the manuscript to add a new subsection (e.g., 3.3) reporting: (1) human evaluation on a random sample of 1,000 QA pairs showing 92% agreement with ground truth, (2) checks for projection-geometry mismatches by comparing rendered views, and (3) analysis confirming minimal reference-frame ambiguities through template design. These additions will support the reported performance gaps. revision: yes
Referee: [RL-based Diagnostic Study] The RL diagnostic study reports an improvement from 31.10% to 60.06% on a 7B model, yet supplies no exact reward formulations, weight-allocation details, or statistical significance tests. Because the abstract states that gains are sensitive to reward design and evaluation protocol, the plasticity conclusion cannot be assessed without these load-bearing specifics.

Authors: We agree that more specifics on the RL experiments are required for full reproducibility and assessment. Section 5 describes the RL setup and notes the sensitivity to reward design, but exact formulations were omitted for brevity. In the revised version, we will expand this section to include: the precise mathematical formulation of the reward function (a weighted sum of task accuracy, format compliance, and reasoning chain rewards), the specific weight allocations used (e.g., 0.6 for accuracy, 0.3 for format, 0.1 for chain), and statistical significance results from 5 independent runs with p-values from t-tests. This will clarify the conditions under which the gains from 31.10% to 60.06% were achieved. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical benchmark and evaluation study

full rationale

The paper introduces PCSR-Bench as an independently defined diagnostic dataset of 84k QA pairs generated from omnidirectional images and evaluates 14 MLLMs plus an RL fine-tuning experiment on a 7B model. No mathematical derivations, equations, or first-principles results are present. All reported accuracies (e.g., 57.59% on relative direction vs. lower scores on advanced tasks) are direct empirical measurements on fixed benchmark tasks whose definitions do not depend on the model outputs or any fitted parameters from the present work. The RL study similarly reports observed performance changes under explicit reward shaping without reducing to self-referential constructions. The work is therefore self-contained against external benchmarks with no load-bearing steps that collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that accuracy on the constructed QA pairs faithfully reflects perspective-conditioned spatial reasoning ability; no free parameters are fitted to produce the reported numbers, and no new physical or mathematical entities are postulated.

axioms (1)

domain assumption Standard assumptions in multimodal LLM evaluation hold, namely that multiple-choice and open-ended accuracy on generated QA pairs measures the targeted cognitive capability.
Invoked implicitly when interpreting the 57.59% to 0.64% accuracy range as evidence of a reasoning gap.

pith-pipeline@v0.9.0 · 5677 in / 1498 out tokens · 78836 ms · 2026-05-13T06:49:00.234335+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
We introduce PCSR-Bench, a diagnostic benchmark of 84,373 question-answer pairs from 2,600 omnidirectional images... tasks spanning foundational perception... and advanced PCSR, including compositional chains, egocentric rotation, perspective re-anchoring, ego-distortion...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
Reward shaping improves a matched 7B baseline from 31.10% to 60.06%... PCSR is partial plasticity rather than being fully immutable.