arxiv: 2605.09883 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space

Brian Potetz, Chun-Ta Lu, Howard Zhou, Leonidas Guibas, Xia Hu, Zhenrui Yue, Zhicheng Wang

Pith reviewed 2026-05-12 04:27 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords visual reasoningmultimodal large language modelsCartesian shortcutpolar coordinatesbenchmarkstopology-invariant reasoningMLLMs

0 comments

The pith

Current multimodal models achieve high visual reasoning scores by exploiting grid-based coordinates rather than understanding spatial relationships directly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that many visual reasoning benchmarks allow models to bypass genuine visual processing by converting scenes into explicit textual coordinates on orthogonal grids. To test this, the authors create paired versions of 53 tasks where the same logic and semantics are expressed in polar coordinates instead. Frontier models that score 70 to 83 percent on the original Cartesian versions drop to 31 to 39 percent on the polar versions, and previously observed reasoning improvements largely disappear. This matters because it shows that strong benchmark results can reflect exploitation of a coordinate shortcut rather than robust, geometry-invariant visual understanding.

Core claim

Multimodal large language models systematically exploit the Cartesian Shortcut in existing visual reasoning benchmarks, where orthogonal grid layouts can be discretized into explicit textual coordinates, allowing heavy reliance on text-based deduction. When the same 53 tasks are reformulated in polar coordinate space while keeping logical constraints and task semantics identical, performance collapses even for the strongest models, revealing that current systems lack topology-invariant visual reasoning.

What carries the argument

Polaris-Bench, which reformulates 53 visual reasoning tasks in polar coordinate space with paired Cartesian versions to remove the orthogonal grid prior that models exploit.

If this is right

Reasoning gains observed on Cartesian layouts largely vanish on polar equivalents.
Existing benchmarks systematically overestimate genuine visual understanding in multimodal models.
High performance on current tests does not guarantee robustness when spatial representations change.
Models depend more on text-based deductive shortcuts than on direct spatial topology processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-world scenes rarely align with perfect Cartesian grids, so the shortcut may limit reliability in practical applications.
Training regimes could incorporate polar or other non-grid representations to reduce coordinate exploitation.
Benchmark design should routinely include coordinate-system variants to test for invariant reasoning.

Load-bearing premise

Re-formulating the tasks in polar coordinates keeps the exact same logical constraints, task semantics, and difficulty without adding new visual or reasoning problems.

What would settle it

Demonstrating that state-of-the-art models achieve comparable accuracy on the polar and Cartesian versions of the tasks when logical equivalence is strictly maintained would falsify the shortcut claim.

read the original abstract

As current Multimodal Large Language Models rapidly saturate canonical visual reasoning benchmarks, a key question emerges: do these strong scores genuinely reflect robust visual understanding? We identify a pervasive vulnerability, the \textbf{Cartesian Shortcut}: visual reasoning benchmarks prevalently build on orthogonal grid-based layouts that can be readily discretized into explicit textual coordinates. Models systematically exploit this property, heavily leveraging text-based deductive reasoning to assist visual problem-solving. To systematically dismantle this shortcut, we introduce \textbf{Polaris-Bench}, which re-formulates 53 visual reasoning tasks in Polar coordinate space with paired Cartesian counterparts as reference, while preserving consistent logical constraints and task semantics -- thus fundamentally breaking the orthogonal prior that models exploit. Comprehensive evaluation across $14$ state-of-the-art MLLMs reveals that frontier models achieving $70$--$83\%$ on Cartesian layouts collapse to $31$--$39\%$ on Polar equivalents, with degradation persisting even under complete logical equivalence. Moreover, reasoning gains observed on Cartesian layouts are severely diminished on Polar equivalents. These findings expose a critical deficiency in current MLLMs: the lack of topology-invariant visual reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Models drop sharply on polar reformulations of 53 tasks, but the gap may partly reflect new visual distortions rather than pure removal of a Cartesian shortcut.

read the letter

The paper's main point is that MLLMs exploit orthogonal grid layouts in visual reasoning benchmarks by turning them into text coordinates, and switching to polar space exposes this. They created paired versions of 53 tasks in Polaris-Bench and saw frontier models fall from 70-83% on Cartesian to 31-39% on polar, with reasoning improvements also disappearing. The scale of the evaluation across 14 models and the consistent pattern are the strongest parts of the work. It gives a concrete empirical probe that wasn't in the prior literature they cite, and it flags a real issue with how current benchmarks might overstate visual understanding. The construction of the paired set is a useful addition for testing topology invariance claims. The soft spot is the equivalence assumption. Polar conversion curves straight lines, changes local densities, and introduces rasterization effects like aliasing or uneven sampling that could add unrelated perceptual load. The abstract states that logical constraints and semantics stay the same, but without reported human parity checks, formal isomorphism tests, or ablations on rendering parameters, it's difficult to isolate the shortcut effect from these incidental changes. The stress-test concern holds until the methods section shows otherwise. This is for people working on multimodal benchmarks and robustness who want to move past saturated Cartesian setups. It has enough substance and a clear empirical hook to merit peer review, though referees will probably press on the controls for task difficulty.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that multimodal large language models (MLLMs) exploit a pervasive 'Cartesian Shortcut' in visual reasoning benchmarks, where orthogonal grid layouts allow models to discretize scenes into explicit textual coordinates and rely on text-based deduction rather than robust visual understanding. To dismantle this, the authors introduce Polaris-Bench, which reformulates 53 visual reasoning tasks into polar coordinate space with paired Cartesian counterparts while preserving logical constraints and task semantics. Comprehensive evaluations on 14 state-of-the-art MLLMs show frontier models dropping from 70--83% accuracy on Cartesian versions to 31--39% on polar equivalents, with diminished reasoning gains, exposing a lack of topology-invariant visual reasoning.

Significance. If the polar reformulations isolate coordinate representation without confounding factors, the result would be significant for the field: it would demonstrate that high benchmark scores often reflect exploitation of grid-based shortcuts rather than genuine visual reasoning, motivating new evaluation paradigms and model architectures focused on topology invariance. The paired direct-comparison design avoids circularity and provides falsifiable evidence against current MLLM capabilities.

major comments (3)

[Polaris-Bench description and task reformulation] Polaris-Bench construction: The central claim that polar reformulations preserve identical logical constraints, task semantics, and perceptual difficulty (without introducing unrelated visual challenges) is load-bearing but unsupported by quantitative checks such as human parity tests, formal isomorphism verification, or ablation on rendering parameters. Polar conversion warps lines to curves and alters pixel densities, which could drive part of the 70--83% to 31--39% drop independently of the Cartesian shortcut.
[Experimental setup and results] Evaluation protocol: The abstract reports consistent large drops across 14 models, yet the manuscript supplies no details on the exact reformulation methods for the 53 tasks, controls for visual complexity (e.g., aliasing, sampling uniformity), or statistical tests. This leaves open whether the degradation is attributable solely to removal of the orthogonal prior.
[Analysis of reasoning behavior] Reasoning gains analysis: The assertion that reasoning gains observed on Cartesian layouts are severely diminished on polar equivalents lacks specific supporting tables or breakdowns; without them, it is unclear how much of the performance collapse stems from reduced reasoning versus incidental visual artifacts.

minor comments (2)

The abstract would be clearer if it briefly enumerated the categories of the 53 tasks (e.g., counting, spatial relations) to contextualize the scope.
Ensure all example figures include side-by-side Cartesian/polar pairs with explicit annotations for any rendering differences.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important areas for strengthening the presentation of Polaris-Bench and the supporting analyses. We address each major comment below and have revised the manuscript accordingly to provide additional empirical validation, methodological details, and breakdowns.

read point-by-point responses

Referee: Polaris-Bench construction: The central claim that polar reformulations preserve identical logical constraints, task semantics, and perceptual difficulty (without introducing unrelated visual challenges) is load-bearing but unsupported by quantitative checks such as human parity tests, formal isomorphism verification, or ablation on rendering parameters. Polar conversion warps lines to curves and alters pixel densities, which could drive part of the 70--83% to 31--39% drop independently of the Cartesian shortcut.

Authors: We agree that explicit quantitative validation strengthens the core claim. In the revised manuscript we have added: (1) human parity results on a stratified 20% subset of tasks (n=10 tasks, 50 participants) showing human accuracy differs by at most 4.2% between paired Cartesian and polar versions; (2) a formal isomorphism argument in Section 3.1 that maps each logical constraint (e.g., relative ordering, containment) to an equivalent polar predicate while preserving satisfiability; and (3) ablation tables varying sampling density (0.5x–2x) and anti-aliasing levels, where the model performance gap remains statistically stable (average drop 42–48%). These additions indicate the observed degradation is not explained by rendering artifacts alone. revision: yes
Referee: Evaluation protocol: The abstract reports consistent large drops across 14 models, yet the manuscript supplies no details on the exact reformulation methods for the 53 tasks, controls for visual complexity (e.g., aliasing, sampling uniformity), or statistical tests. This leaves open whether the degradation is attributable solely to removal of the orthogonal prior.

Authors: We have substantially expanded the Methods section (now 3.2–3.4) with: a complete task-by-task reformulation protocol including pseudocode for polar projection and semantic-preserving adaptations; explicit controls that match edge density, element count, and use uniform angular sampling plus anti-aliased rasterization; and paired t-test results (all p < 0.001) together with 95% confidence intervals on the accuracy differences. These controls and tests are now reported for every model and task category. revision: yes
Referee: Reasoning gains analysis: The assertion that reasoning gains observed on Cartesian layouts are severely diminished on polar equivalents lacks specific supporting tables or breakdowns; without them, it is unclear how much of the performance collapse stems from reduced reasoning versus incidental visual artifacts.

Authors: We have added Supplementary Table S3 that reports per-task accuracy with and without chain-of-thought prompting. On Cartesian versions the average reasoning gain is +11.8%; on polar versions the same prompting yields only +2.9%. The table also includes error-type breakdowns (coordinate mis-mapping vs. logical inference failures), showing that the majority of additional errors on polar inputs are logical rather than low-level visual. This evidence supports that the collapse is driven primarily by loss of the Cartesian shortcut rather than rendering artifacts. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark comparison contains no circular derivation

full rationale

The paper's central result is an empirical measurement: frontier MLLMs score 70-83% on Cartesian task versions and 31-39% on their Polar reformulations. This difference is obtained by direct evaluation on explicitly constructed paired datasets rather than any equation, fitted parameter, or self-referential derivation. The statement that logical constraints and semantics are preserved is an explicit modeling assumption, not a step that reduces the reported performance gap to a tautology. No self-citation chain, uniqueness theorem, or ansatz is invoked to force the outcome. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that polar reformulations are semantically and logically equivalent to their Cartesian counterparts.

axioms (1)

domain assumption Polar reformulations of the 53 tasks preserve consistent logical constraints and task semantics.
Explicitly stated in the abstract as the basis for fair comparison.

pith-pipeline@v0.9.0 · 5513 in / 1099 out tokens · 85487 ms · 2026-05-12T04:27:29.877728+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
Polaris-Bench... re-formulates 53 visual reasoning tasks in Polar coordinate space with paired Cartesian counterparts... preserving consistent logical constraints and task semantics
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
frontier models achieving 70--83% on Cartesian layouts collapse to 31--39% on Polar equivalents

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 5 internal anchors

[1]

M. AI. Mistral-small-3.1-24b-instruct-2503, March 2025. URLhttps://mistral.ai/news/ mistral-small-3-1

work page 2025
[2]

Anthropic

A. Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

work page 2024
[3]

G. B. Arfken, H. J. Weber, and F. E. Harris.Mathematical methods for physicists: a comprehensive guide. Academic press, 2011

work page 2011
[4]

Asadi, J

M. Asadi, J. W. O’Sullivan, F. Cao, T. Nedaee, K. Rajabalifardi, F.-F. Li, E. Adeli, and E. Ashley. Mirage: The illusion of visual understanding, 2026. URLhttps://arxiv.org/abs/2603. 21687. 11 The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space

work page 2026
[5]

M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges.arXiv preprint arXiv:2104.13478, 2021

work page internal anchor Pith review arXiv 2021
[6]

J. Chen, T. Liang, S. Siu, Z. Wang, K. Wang, Y. Wang, Y. Ni, W. Zhu, Z. Jiang, B. Lyu, D. Jiang, X. He, Y. Liu, H. Hu, X. Yue, and W. Chen. Mega-bench: Scaling multimodal evaluation to over 500 real-world tasks. 2025

work page 2025
[7]

L. Chen, J. Li, X. Dong, P. Zhang, Y. Zang, Z. Chen, H. Duan, J. Wang, Y. Qiao, D. Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

work page 2024
[8]

L. Chen, W. Xie, Y. Liang, H. He, H. Zhao, Z. Yang, Z. Huang, H. Wu, H. Lu, Y. charles, Y. Bao, Y. Fan, G. Li, H. Shen, X. Chen, W. Xu, S. Si, Z. Cai, W. Chai, Z. Huang, F. Liu, T. Liu, B. Chang, X. Hu, K. Chen, Y. Ren, Y. Liu, Y. Gong, and K. Li. Babyvision: Visual reasoning beyond language,

work page
[9]

URLhttps://arxiv.org/abs/2601.06521

work page arXiv
[10]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G.Comanici, E.Bieber, M.Schaekermann, I.Pasupat, N.Sachdeva, I.Dhillon, M.Blistein, O.Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodal- ity, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

DeepMind

G. DeepMind. Gemma 4: Our most intelligent open models to date. https://blog. google/innovation-and-ai/technology/developers-tools/gemma-4/, 2026. Ac- cessed: 2026-05-01

work page 2026
[12]

Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia, J. Xu, Z. Wu, T. Liu, B. Chang, X. Sun, L. Li, and Z. Sui. A survey on in-context learning, 2024. URLhttps://arxiv.org/abs/2301. 00234

work page 2024
[13]

Blink: Multimodal large language models can see but not perceive

X.Fu, Y.Hu, B.Li, Y.Feng, H.Wang, X.Lin, D.Roth, N.A.Smith, W.-C.Ma, andR.Krishna. Blink: Multimodal large language models can see but not perceive.arXiv preprint arXiv:2404.12390, 2024

work page arXiv 2024
[14]

Geirhos, J.-H

R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

work page 2020
[15]

Gemini 3 flash: Frontier intelligence built for speed

Gemini Team and Google DeepMind. Gemini 3 flash: Frontier intelligence built for speed. Technical blog / model release, Google, 12 2025. URL https://blog.google/ products-and-platforms/products/gemini/gemini-3-flash/

work page 2025
[16]

Gemini 3.1 flash-lite: Built for intelli- gence at scale

Gemini Team and Google DeepMind. Gemini 3.1 flash-lite: Built for intelli- gence at scale. Technical blog / model release, Google, 3 2026. URL https: //blog.google/innovation-and-ai/models-and-research/gemini-models/ gemini-3-1-flash-lite/

work page 2026
[17]

Gemini 3 pro technical report

Gemini Team, Google. Gemini 3 pro technical report. 2025. URL https://deepmind. google/models/gemini/pro/

work page 2025
[18]

J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y. Shen, S. Ma, H. Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024

work page 2024
[19]

Y. Hao, J. Gu, H. W. Wang, L. Li, Z. Yang, L. Wang, and Y. Cheng. Can MLLMs reason in multimodality? EMMA: An enhanced multimodal reasoning benchmark. InForty-second In- ternational Conference on Machine Learning, 2025. URLhttps://openreview.net/forum? id=v26vwjxOEz. 12 The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space

work page 2025
[20]

M. Jia, Z. Qi, S. Zhang, W. Zhang, X. Yu, J. He, H. Wang, and L. Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/ forum?id=6nZKT2rL0H

work page 2026
[21]

A. S. Kanade and T. Ganu. Do you see me : A multidimensional benchmark for evaluating visual perception in multimodal LLMs. In V. Demberg, K. Inui, and L. Marquez, editors,Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7285–7326, Rabat, Morocco, Mar. 2026. Associat...

work page doi:10.18653/v1/2026.eacl-long.343 2026
[22]

P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[23]

F. Meng, J. Wang, C. Li, Q. Lu, H. Tian, J. Liao, X. Zhu, J. Dai, Y. Qiao, P. Luo, K. Zhang, and W. Shao. Mmiu: Multimodal multi-image understanding for evaluating large vision-language models, 2024. URLhttps://arxiv.org/abs/2408.02718

work page arXiv 2024
[24]

S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the 2022 conference on empirical methods in natural language processing, pages 11048–11064, 2022

work page 2022
[25]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

work page 2026
[26]

Y. Ren, K. Tertikas, S. Maiti, J. Han, T. Zhang, S. Süsstrunk, and F. Kokkinos. Vgrp-bench: Visual grid reasoning puzzle benchmark for large vision-language models.arXiv preprint arXiv:2503.23064, 2025

work page arXiv 2025
[27]

Saxena, A

R. Saxena, A. P. Gema, and P. Minervini. Lost in time: Clock and calendar understanding challenges in multimodal llms, 2025. URLhttps://arxiv.org/abs/2502.05092

work page arXiv 2025
[28]

OpenAI GPT-5 System Card

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Tang and M

Z. Tang and M. Kejriwal. Grasp: A grid-based benchmark for evaluating commonsense spatial reasoning, 2025. URLhttps://arxiv.org/abs/2407.01892

work page arXiv 2025
[30]

K. Team, T. Bai, Y. Bai, Y. Bao, S. Cai, Y. Cao, Y. Charles, H. Che, C. Chen, G. Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

S. Tong, E. L. B. II, P. Wu, S. Woo, A. J. IYER, S. C. Akula, S. Yang, J. Yang, M. Middepogu, Z. Wang, X. Pan, R. Fergus, Y. LeCun, and S. Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=Vi8AepAXGy

work page 2024
[32]

S. Tong, Z. Liu, Y. Zhai, Y. Ma, Y. LeCun, and S. Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024

work page 2024
[33]

F. Wang, X. Fu, J. Y. Huang, Z. Li, Q. Liu, X. Liu, M. D. Ma, N. Xu, W. Zhou, K. Zhang, et al. Muirbench: A comprehensive benchmark for robust multi-image understanding.arXiv preprint arXiv:2406.09411, 2024. 13 The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space

work page arXiv 2024
[34]

Z. Wang, M. Xia, L. He, H. Chen, Y. Liu, R. Zhu, K. Liang, X. Wu, H. Liu, S. Malladi, A. Chevalier, S. Arora, and D. Chen. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.arXiv preprint arXiv:2406.18521, 2024

work page arXiv 2024
[35]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[36]

Grok-4 model card

xAI. Grok-4 model card. 2025. URLhttps://x.ai/news/grok-4

work page 2025
[37]

J. Xia, Y. Zang, P. Gao, S. Li, and K. Zhou. Visionary-r1: Mitigating shortcuts in visual reasoning with reinforcement learning, 2025. URLhttps://arxiv.org/abs/2505.14677

work page arXiv 2025
[38]

W. Xu, J. Wang, W. Wang, Z. Chen, W. Zhou, A. Yang, L. Lu, H. Li, X. Wang, X. Zhu, W. Wang, J. Dai, and J. Zhu. Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279, 2025. URLhttps://arxiv.org/abs/ 2504.15279

work page arXiv 2025
[39]

Z. Xu, C. Liu, Q. Wei, J. Wu, J. Zou, X. E. Wang, Y. Zhou, and S. Liu. More thinking, less seeing? assessing amplified hallucination in multimodal reasoning models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[40]

J. Ye, D. Jiang, J. He, B. Zhou, Z. Huang, Z. Yan, H. Li, C. He, and W. Li. BLINK-twice: You see, but do you observe? a reasoning benchmark on visual perception. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. URL https://openreview.net/forum?id=g0AMmWiHCq

work page 2026
[41]

X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024

work page 2024
[42]

X. Yue, T. Zheng, Y. Ni, Y. Wang, K. Zhang, S. Tong, Y. Sun, B. Yu, G. Zhang, H. Sun, et al. Mmmu- pro: A more robust multi-discipline multimodal understanding benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15134–15186, 2025

work page 2025
[43]

Zetzsche, G

C. Zetzsche, G. Krieger, and B. Wegmann. The atoms of vision: Cartesian or polar?Journal of the Optical Society of America A, 16(7):1554–1565, 1999

work page 1999
[44]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

work page 2023
[45]

K. Zou, Z. Huang, Y. Dong, S. Tian, D. Zheng, H. Liu, J. He, B. Liu, Y. Qiao, and Z. Liu. Uni-mmmu: A massive multi-discipline multimodal unified benchmark, 2026. URLhttps: //arxiv.org/abs/2510.13759. 14 The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space A. Benchmark Details A.1. Task Taxonomy The Polaris-Bench comprises 53 pro...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

Visual clarity:Is the rendered image visually clear and legible? Can all relevant elements (labels, shapes, paths, grid lines) be unambiguously perceived?

work page
[47]

Logical correctness:Is the task logic (rules, constraints, problem formulation) consistent and well-defined?

work page
[48]

row 2”, “column 3

Answer correctness:Is the provided ground-truth answer correct? (Options:Correct,Incorrect,I don’t know.) C. Evaluation Setup Details C.1. Model Query Details All models are evaluated via their publicly available APIs using default generation configurations to ensure optimal performance. We set the maximum output token limit to each model’s maximum suppor...

work page
[49]

Use above examples as reference

Think step-by-step about the spatial layout and rules. Use above examples as reference

work page
[50]

Output your final answer clearly on the very last line. Ensure your response strictly follows this structure: <Analysis> (Your step-by-step reasoning here) </Analysis> (Your final exact answer here, on the absolute last line) [Question] {Question} <img> Figure 11|Prompt template used in the few-shot in context learning. Random Baseline.We additionally rep...

work page