pith. machine review for the scientific record. sign in

arxiv: 2511.20814 · v2 · submitted 2025-11-25 · 💻 cs.CV · cs.AI· cs.LG

Recognition: no theorem link

SPHINX: A Synthetic Environment for Visual Perception and Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-17 04:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords synthetic benchmarkvisual perceptionvisual reasoninglarge vision-language modelsprocedural generationreinforcement learningmultimodal reasoningcognitive primitives
0
0 comments X

The pith

Sphinx benchmark shows state-of-the-art vision-language models reach only 51.1 percent accuracy on core visual reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Sphinx, a synthetic environment that procedurally generates visual puzzles from basic elements such as motifs, tiles, charts, icons, and geometric primitives. These puzzles form 25 task types covering symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction, each supplied with exact ground-truth answers. Tests on recent large vision-language models find even the strongest performer, GPT-5, reaches only 51.1 percent accuracy, well below human levels. The authors then apply reinforcement learning with verifiable rewards derived directly from the correct answers and report substantial accuracy gains both on Sphinx itself and on separate external visual reasoning benchmarks. This combination supplies a scalable route to measure gaps in multimodal reasoning and to train models using automatically generated, precisely scored data.

Core claim

Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, reinforcement learning with verifiable rewards substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks.

What carries the argument

Procedural generation of puzzles with verifiable ground-truth solutions that supports both direct evaluation of models and reinforcement learning with verifiable rewards.

If this is right

  • Reinforcement learning with verifiable rewards can substantially raise accuracy on the synthetic visual reasoning tasks.
  • Gains achieved through this training transfer to performance on external visual reasoning benchmarks.
  • The environment enables construction of large-scale training datasets with automatic, exact scoring.
  • Current large vision-language models have a clear performance gap relative to human levels on these tasks.
  • Precise evaluation of core visual perception primitives becomes possible at scale without manual annotation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same procedural approach could be extended to generate training curricula that progressively increase task difficulty for multimodal models.
  • If the observed improvements persist, training pipelines may shift emphasis toward reward signals derived from verifiable answers rather than purely supervised data.
  • The benchmark offers a controlled testbed for studying whether gains in synthetic settings reliably predict better performance in unstructured real-world visual scenes.

Load-bearing premise

The 25 procedurally generated task types using motifs, tiles, charts, icons, and geometric primitives accurately capture core cognitive primitives for visual perception and reasoning.

What would settle it

If models trained with reinforcement learning on Sphinx show no accuracy gains on established external visual reasoning benchmarks, this would indicate the synthetic tasks do not measure transferable skills.

Figures

Figures reproduced from arXiv: 2511.20814 by Justin Yang Chae, Md Tanvirul Alam, Nidhi Rastogi, Saksham Aggarwal.

Figure 1
Figure 1. Figure 1: Radar plot shows accuracies (%) achieved by LVLMs [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example Motifs (from left): Crescent, Glyph, Pinwheel, Polygon, Polyomino and Icons [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example Tilings (from left): circles, square, triangular, hexagonal, rhombille. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: SPHINX task illustrations 16. Transform Pair Infer: Given two tiles, determine the transformation that maps the source to the target. 17. Transform Similarity Identify: Identify which op￾tion is similar or dissimilar to a base shape under Euclidean similarity transformations (uniform scaling, rotation, reflection). 18. Sequence Rotation: Predict the missing panel in a sequence of rotated motifs. 19. Sequen… view at source ↗
Figure 5
Figure 5. Figure 5: Familiarity vs Accuracy - Human Evaluators [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example Tiles Composition task. Correct answer:(b) [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per–task accuracy comparison between GPT-5 and [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per–task accuracy comparison between GPT-5 and [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Average prediction token lengths between Base and [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Randomly sampled example from each Motif family. [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Examples of Geometric Reasoning and Chart tasks. [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Examples of Counting tasks. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Examples of Symmetry tasks. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Examples of Transformation and Sequence tasks. [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Examples of Topological and Tiling tasks. [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Web application interface used for the human evaluation. Participants were shown a visual prompt (image and/or text) and [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Human evaluation results. (a) Plot of participant perceived difficulty versus accuracy (b) Task-level time and accuracy. [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Example responses from GPT-5 on Shape Counting Task, with incorrect reasoning highlighted in red. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Example responses from GPT-5 on Tiles Composition Task, with incorrect reasoning highlighted in red. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Example responses from GPT-5 on Tiles Line Length Task, with incorrect reasoning highlighted in red. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Example responses from GPT-5 on Tiles Recoloring Task, with incorrect reasoning highlighted in red. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Example responses from GPT-5 on Frieze Group Task, with incorrect reasoning highlighted in red. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Example responses from GPT-5 on Wallpaper Group Task, with incorrect reasoning highlighted in red. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: GPT-5 vs. GPT-5 Mini response on Stack Count task, with incorrect reasoning highlighted in red. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: GPT-5 vs. GPT-5 Mini response on Transform Result Identify task, with incorrect reasoning highlighted in red. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: GPT-5 vs. GPT-5 Mini response on Transform Similarity Identify task, with incorrect reasoning highlighted in red. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: GPT-5 vs. GPT-5 Mini response on Sequence Arithmetic task, with incorrect reasoning highlighted in red. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: GPT-5 vs. GPT-5 Mini response on Sequence Multi-Column Arithmetic task, with incorrect reasoning highlighted in red. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Training reward curves for the four models during [PITH_FULL_IMAGE:figures/full_fig_p039_31.png] view at source ↗
Figure 33
Figure 33. Figure 33: Average response length and accuracy for each task [PITH_FULL_IMAGE:figures/full_fig_p039_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Qwen3-VL-4B vs. Qwen3-VL-4B-RL response on [PITH_FULL_IMAGE:figures/full_fig_p040_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Qwen3-VL-8B vs. Qwen3-VL-8B-RL response on [PITH_FULL_IMAGE:figures/full_fig_p041_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Qwen2.5-VL-3B vs. Qwen2.5-VL-3B-RL response on [PITH_FULL_IMAGE:figures/full_fig_p042_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Qwen2.5-VL-7B vs. Qwen2.5-VL-7B-RL response on [PITH_FULL_IMAGE:figures/full_fig_p043_37.png] view at source ↗
read the original abstract

We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces SPHINX, a synthetic environment for visual perception and reasoning. SPHINX procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives across 25 task types with verifiable ground-truth solutions. Evaluations show that state-of-the-art LVLMs like GPT-5 achieve only 51.1% accuracy, below human performance. The authors demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves accuracy on these tasks and yields gains on external visual reasoning benchmarks.

Significance. This work offers a scalable, controllable benchmark for visual reasoning that addresses limitations in existing datasets by providing procedural generation and exact ground truth. The finding of a significant performance gap in current models and the positive results from RLVR training highlight potential directions for improving multimodal models. The transfer to external benchmarks suggests broader applicability.

minor comments (3)
  1. The paper would benefit from including a table that summarizes the 25 task types, their descriptions, and example inputs/outputs for clarity.
  2. Details on the human baseline evaluation, including number of participants and task presentation method, should be expanded in the main text or appendix.
  3. The RLVR implementation details, such as the specific reward formulation and training hyperparameters, could be more explicitly stated to facilitate reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation of minor revision. We appreciate the recognition of SPHINX as a scalable benchmark with procedural generation and exact ground truth, as well as the value of the RLVR results and their transfer to external tasks.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces a procedurally generated synthetic benchmark with 25 explicitly enumerated task families using motifs, tiles, charts, icons, and geometric primitives, each with verifiable ground-truth solutions. Central claims rest on direct empirical evaluation (e.g., GPT-5 at 51.1% accuracy) and RLVR training results that transfer to external benchmarks. No equations, parameter fits, or derivations are presented as predictions; task validity is framed as an assumption open to external testing rather than self-referential. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The work is self-contained against reproduction and falsification.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central contribution is the new benchmark environment itself; the main unverified premises are that the generated tasks measure core cognitive primitives and that RLVR produces generalizable visual reasoning gains.

axioms (1)
  • domain assumption Procedurally generated visual puzzles using motifs, tiles, charts, icons, and geometric primitives can target core cognitive primitives
    Invoked in the description of the benchmark construction and task coverage.
invented entities (1)
  • SPHINX synthetic environment no independent evidence
    purpose: To enable precise evaluation and large-scale dataset construction for visual perception and reasoning
    Newly defined benchmark introduced in the paper.

pith-pipeline@v0.9.0 · 5437 in / 1328 out tokens · 59382 ms · 2026-05-17T04:15:47.207494+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

128 extracted references · 128 canonical work pages · 19 internal anchors

  1. [1]

    Unibench: Visual reasoning requires rethinking vision- language beyond scaling.NeurIPS Datasets and Bench- marks, 2024

    Haider Al-Tahan, Quentin Garrido, Randall Balestriero, Diane Bouchacourt, Caner Hazirbas, Mark Ibrahim, et al. Unibench: Visual reasoning requires rethinking vision- language beyond scaling.NeurIPS Datasets and Bench- marks, 2024. 8

  2. [2]

    Big- math: A large-scale, high-quality math dataset for re- inforcement learning in language models.arXiv preprint arXiv:2502.17387, 2025

    Alon Albalak, Duy Phung, Nathan Lile, Rafael Rafailov, Kanishk Gandhi, Louis Castricato, Anikait Singh, Chase Blagden, Violet Xiang, Dakota Mahan, et al. Big- math: A large-scale, high-quality math dataset for re- inforcement learning in language models.arXiv preprint arXiv:2502.17387, 2025. 1

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 4, 7

  4. [4]

    Visual riddles: A commonsense and world knowledge challenge for large vision and language models.NeurIPS Dataset / arXiv,

    Nitzan Bitton-Guetta, Aviv Slobodkin, Aviya Maimon, Eliya Habba, Royi Rassin, Yonatan Bitton, Idan Szpektor, Amir Globerson, Yuval Elovici, et al. Visual riddles: A commonsense and world knowledge challenge for large vision and language models.NeurIPS Dataset / arXiv,

  5. [5]

    Mm-iq: Benchmarking human-like abstraction and reasoning in multimodal models

    Huanqia Cai, Yijun Yang, and Winston Hu. Mm-iq: Benchmarking human-like abstraction and reasoning in multimodal models. 2025. 8

  6. [6]

    What is the visual cognition gap between humans and multimodal llms?arXiv preprint arXiv:2406.10424, 2024

    Xu Cao, Bolin Lai, Wenqian Ye, Yunsheng Ma, Joerg Heintz, Jintai Chen, Jianguo Cao, and James M Rehg. What is the visual cognition gap between humans and multimodal llms?arXiv preprint arXiv:2406.10424, 2024. 1, 2, 8

  7. [7]

    What one intelligence test measures: a theoretical account of the processing in the raven progressive matrices test.Psy- chological review, 97(3):404, 1990

    Patricia A Carpenter, Marcel A Just, and Peter Shell. What one intelligence test measures: a theoretical account of the processing in the raven progressive matrices test.Psy- chological review, 97(3):404, 1990. 2, 8

  8. [8]

    Ai models solve maths problems at level of top students.Nature, 644:7, 2025

    Davide Castelvecchi. Ai models solve maths problems at level of top students.Nature, 644:7, 2025. 1

  9. [9]

    Enigmata: Scaling logical reasoning in large language models with synthetic verifiable puzzles

    Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Qiying Yu, Xue- feng Li, Jiaze Chen, et al. Enigmata: Scaling logical reasoning in large language models with synthetic verifi- able puzzles.arXiv preprint arXiv:2505.19914, 2025. 2, 8

  10. [10]

    V-star: Bench- marking video-llms on video spatio-temporal reasoning

    Zixu Cheng, Jian Hu, Ziquan Liu, Chenyang Si, Wei Li, and Shaogang Gong. V-star: Benchmarking video- llms on video spatio-temporal reasoning.arXiv preprint arXiv:2503.11495, 2025. 7

  11. [11]

    Are deep neural net- works smarter than second graders? InProceedings of CVPR (Open Access), 2023

    Anoop Cherian, Kuan-Chuan Peng, Suhas Lohit, Kevin A Smith, and Joshua B Tenenbaum. Are deep neural net- works smarter than second graders? InProceedings of CVPR (Open Access), 2023. 1, 8

  12. [12]

    ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

    Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new chal- lenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831, 2025. 2

  13. [13]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1

  14. [14]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedings of the 32nd ACM international conference on multimedia, pages 11198–11201, 2024. 4, 7

  15. [15]

    The role of symmetry in infant form discrimination.Child development, pages 457–462, 1981

    Celia B Fisher, Kay Ferdinandsen, and Marc H Bornstein. The role of symmetry in infant form discrimination.Child development, pages 457–462, 1981. 2

  16. [16]

    Blink: Multimodal large lan- guage models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large lan- guage models can see but not perceive. InEuropean Con- ference on Computer Vision, pages 148–166. Springer,

  17. [17]

    Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

    Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307,

  18. [18]

    Artificial general intelligence: Concept, state of the art, and future prospects.Journal of Artificial General Intelligence, 5(1):1, 2014

    Ben Goertzel. Artificial general intelligence: Concept, state of the art, and future prospects.Journal of Artificial General Intelligence, 5(1):1, 2014. 1

  19. [19]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1, 9

  20. [20]

    DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

    Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathemat- ical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456, 2025. 1

  21. [21]

    Mathruler

    hiyouga. Mathruler. https : / / github . com / hiyouga/MathRuler, 2025. 4, 7 9

  22. [22]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Alek- sander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 1

  23. [23]

    Marvel: Mul- tidimensional abstraction and reasoning through visual evaluation and learning.Advances in Neural Information Processing Systems, 37:46567–46592, 2024

    Yifan Jiang, Kexuan Sun, Zhivar Sourati, Kian Ahrabian, Kaixin Ma, Filip Ilievski, Jay Pujara, et al. Marvel: Mul- tidimensional abstraction and reasoning through visual evaluation and learning.Advances in Neural Information Processing Systems, 37:46567–46592, 2024. 2, 4, 8

  24. [24]

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

    Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017. 3

  25. [25]

    Reasoning abilities of large language models: In-depth analysis on the abstraction and reason- ing corpus

    Seungpil Lee, Woochang Sim, Donghyeon Shin, Wongyu Seo, Jiwon Park, Seokki Lee, Sanha Hwang, Sejin Kim, and Sundong Kim. Reasoning abilities of large language models: In-depth analysis on the abstraction and reason- ing corpus. 2024. 2, 8

  26. [26]

    Modomodo: Multi-domain data mixtures for multimodal llm reinforcement learning

    Yiqing Liang, Jielin Qiu, Wenhao Ding, Zuxin Liu, James Tompkin, Mengdi Xu, Mengzhou Xia, Zhengzhong Tu, Laixi Shi, and Jiacheng Zhu. Modomodo: Multi-domain data mixtures for multimodal llm reinforcement learning

  27. [27]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 1, 8

  28. [28]

    Inter-gps: Interpretable geometry problem solving with formal language and sym- bolic reasoning.arXiv preprint arXiv:2105.04165, 2021

    Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter- gps: Interpretable geometry problem solving with for- mal language and symbolic reasoning.arXiv preprint arXiv:2105.04165, 2021. 3

  29. [29]

    IconQA: A new benchmark for abstract diagram un- derstanding and visual language reasoning

    Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. IconQA: A new benchmark for abstract diagram un- derstanding and visual language reasoning. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. 8

  30. [30]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 7

  31. [31]

    Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions.arXiv preprint arXiv:2506.07527, 2025

    Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Run- ming He, Yanhao Li, et al. Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions.arXiv preprint arXiv:2506.07527, 2025. 8

  32. [32]

    Reasoning limitations of multimodal large language models

    Mikołaj Małki´nski, Szymon Pawlonka, and Jacek Ma´ndz- iuk. Reasoning limitations of multimodal large language models. a case study of bongard problems.arXiv preprint arXiv:2411.01173, 2024. 1, 2, 8

  33. [33]

    A-i-raven and i-raven-mesh: Two new benchmarks for abstract visual reasoning

    Mikołaj Małki´nski and Jacek Ma´ndziuk. A-i-raven and i-raven-mesh: Two new benchmarks for abstract visual reasoning. 2025. 8

  34. [34]

    Deep learning methods for abstract visual reasoning: A survey on raven’s progressive matrices.ACM Computing Surveys, 57(7): 1–36, 2025

    Mikołaj Małki´nski and Jacek Ma´ndziuk. Deep learning methods for abstract visual reasoning: A survey on raven’s progressive matrices.ACM Computing Surveys, 57(7): 1–36, 2025. 8

  35. [35]

    Llama 3.2: Revolutionizing edge ai and vision with open, customizable models.Meta AI Blog

    AI Meta. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models.Meta AI Blog. Retrieved December, 20:2024, 2024. 4

  36. [36]

    Bongard-logo: A new benchmark for human-level concept learning and reasoning.NeurIPS 2020 (Dataset), 2020

    Weili Nie, Zhiding Yu, Lei Mao, Ankit B Patel, Yuke Zhu, Animashree Anandkumar, et al. Bongard-logo: A new benchmark for human-level concept learning and reasoning.NeurIPS 2020 (Dataset), 2020. 8

  37. [37]

    Introducing gpt-5

    OpenAI. Introducing gpt-5. https://openai.com/ index/introducing- gpt- 5/, 2025. Accessed August 2025. 4

  38. [38]

    Ntsebench: Cog- nitive reasoning benchmark for vision language models

    Pranshu Pandya, Vatsal Gupta, Agney S Talwarr, Tushar Kataria, Dan Roth, and Vivek Gupta. Ntsebench: Cog- nitive reasoning benchmark for vision language models

  39. [39]

    LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

    Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl. arXiv preprint arXiv:2503.07536, 2025. 1

  40. [40]

    The concept of symmetry and the theory of perception.Frontiers in Computational Neuroscience, 15:681162, 2021

    Zygmunt Pizlo and J Acacio De Barros. The concept of symmetry and the theory of perception.Frontiers in Computational Neuroscience, 15:681162, 2021. 2

  41. [41]

    Pqa: Perceptual question answering

    Yonggang Qi, Kai Zhang, Aneeshan Sain, and Yi-Zhe Song. Pqa: Perceptual question answering. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12056–12064, 2021. 4

  42. [42]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sen- tence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084, 2019. 7

  43. [43]

    Tarr, Aviral Kumar, and Katerina Fragki- adaki

    Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J. Tarr, Aviral Kumar, and Katerina Fragki- adaki. Grounded reinforcement learning for visual rea- soning. 2025. 8

  44. [44]

    Visual cognition in multimodal large language models.Nature Machine Intelligence, 7(1):96– 106, 2025

    Luca M Schulze Buschoff, Elif Akata, Matthias Bethge, and Eric Schulz. Visual cognition in multimodal large language models.Nature Machine Intelligence, 7(1):96– 106, 2025. 1

  45. [45]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junx- iao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 7

  46. [46]

    Roger N Shepard and Lynn A Cooper.Mental images and their transformations.The MIT Press, 1986. 2

  47. [47]

    Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2505.24760, 2025

    Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, Andreas Köpf, et al. Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2505.24760, 2025. 2, 8, 9

  48. [48]

    Reason-rft: Reinforcement fine-tuning for visual reason- ing

    Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reason- ing. 2025. 8

  49. [49]

    Qwen3-vl

    Qwen Team. Qwen3-vl. https://github.com/ QwenLM/Qwen3-VL, 2025. Accessed: 2025-11-14. 4, 7 10

  50. [50]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024. 7

  51. [51]

    Vl-rethinker: Incen- tivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint, 2025

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incen- tivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint, 2025. 8

  52. [52]

    Measuring multimodal mathe- matical reasoning with math-vision dataset

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathe- matical reasoning with math-vision dataset. 2024. 3, 7, 8

  53. [53]

    Jigsaw-r1: A study of rule-based visual reinforcement learning with jigsaw puzzles.arXiv preprint arXiv:2505.23590, 2025

    Zifu Wang, Junyi Zhu, Bo Tang, Zhiyu Li, Feiyu Xiong, Jiaqian Yu, and Matthew B Blaschko. Jigsaw-r1: A study of rule-based visual reinforcement learning with jigsaw puzzles.arXiv preprint arXiv:2505.23590, 2025. 1, 8

  54. [54]

    Wechsler intelligence scale for children

    David Wechsler. Wechsler intelligence scale for children

  55. [55]

    Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 1

  56. [56]

    Rendering graphs for graph reasoning in multimodal large language models.arXiv preprint arXiv:2402.02130, 1, 2024

    Yanbin Wei, Shuai Fu, Weisen Jiang, James T Kwok, and Yu Zhang. Rendering graphs for graph reasoning in multimodal large language models.arXiv preprint arXiv:2402.02130, 1, 2024. 4

  57. [57]

    Motor processes in mental rotation.Cognition, 68(1): 77–94, 1998

    Mark Wexler, Stephen M Kosslyn, and Alain Berthoz. Motor processes in mental rotation.Cognition, 68(1): 77–94, 1998. 3

  58. [58]

    On the visual analytic intelligence of neural networks.Nature Communications, 14(1):5978, 2023

    Stanisław Wo´ zniak, Hlynur Jónsson, Giovanni Cherubini, Angeliki Pantazi, and Evangelos Eleftheriou. On the visual analytic intelligence of neural networks.Nature Communications, 14(1):5978, 2023. 3

  59. [59]

    Reasoning or reciting? explor- ing the capabilities and limitations of language models through counterfactual tasks

    Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob An- dreas, and Yoon Kim. Reasoning or reciting? explor- ing the capabilities and limitations of language models through counterfactual tasks. Association for Computa- tional Linguistics, 2024. 1

  60. [60]

    LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

    Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Log- icvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024. 7

  61. [61]

    Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

    Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2502.14768, 2025. 1

  62. [62]

    Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279, 2025

    Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wen- gang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, et al. Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279, 2025. 2, 3, 8

  63. [63]

    Learning to Reason under Off-Policy Guidance

    Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learn- ing to reason under off-policy guidance.arXiv preprint arXiv:2504.14945, 2025. 8

  64. [64]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1

  65. [65]

    R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing gen- eralized multimodal reasoning through cross-modal for- malization.arXiv preprint arXiv:2503.10615, 2025. 1

  66. [66]

    Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi.arXiv preprint arXiv:2404.16006,

    Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, et al. Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi.arXiv preprint arXiv:2404.16006,

  67. [67]

    Vl-cogito: Progressive curriculum rein- forcement learning for advanced multimodal reasoning

    Ruifeng Yuan, Chenghao Xiao, Sicong Leng, Jianyu Wang, Long Li, Tingyang Xu, Zhongyu Wei, Hao Zhang, Yu Rong, et al. Vl-cogito: Progressive curriculum rein- forcement learning for advanced multimodal reasoning. arXiv preprint, 2025. 8

  68. [68]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms be- yond the base model?arXiv preprint arXiv:2504.13837,

  69. [69]

    When and why vision- language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936, 2022

    Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision- language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936, 2022. 1, 3

  70. [70]

    A benchmark for composi- tional visual reasoning.Advances in neural information processing systems, 35:29776–29788, 2022

    Aimen Zerroug, Mohit Vaishnav, Julien Colin, Sebastian Musslick, and Thomas Serre. A benchmark for composi- tional visual reasoning.Advances in neural information processing systems, 35:29776–29788, 2022. 3, 8

  71. [71]

    Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, pages 169–

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, pages 169–

  72. [72]

    Springer, 2024. 2, 3, 7

  73. [73]

    Easyr1: An efficient, scalable, multi-modality rl training framework,

    Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, and Yuwen Xiong. Easyr1: An efficient, scalable, multi-modality rl training framework,

  74. [74]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 4 11 A. Implementation Summary A.1. Overview SPHINXis a framework for programmatically generating ...

  75. [75]

    Geometry varies in radius, sweep angle, thickness, and end-cap style

    Arc.Circular arc defined by center, radius, start angle, and sweep; optionally closed into a sector. Geometry varies in radius, sweep angle, thickness, and end-cap style. Appearance includes stroke color/width, op- tional fills, and dashed or solid rendering. Layout covers position, rotation, and multi-arc groupings

  76. [76]

    Geometry varies via head angle, shaft width/length, curvature, and tail caps

    Arrow.Vector-like shape with a shaft and triangular or chevron head, optionally double-headed. Geometry varies via head angle, shaft width/length, curvature, and tail caps. Appearance includes filled or outlined styles, color palettes, and shading suppression. Layout controls orientation, alignment, and crowding

  77. [77]

    Geometry includes bar count, width/height, spacing, and jitter

    Bars.Parallel rectangular bars (horizontal or ver- tical) used for counts or measurements. Geometry includes bar count, width/height, spacing, and jitter. Appearance supports solid or gradient fills and op- tional outlines. Layout covers grouping, stacking, and background grid usage

  78. [78]

    Geometry includes grid size, cell aspect ratio, bit density, and mask struc- ture

    Bitgrid.Binary on/off cell grid. Geometry includes grid size, cell aspect ratio, bit density, and mask struc- ture. Appearance includes on/off colors, padding, rounding, and borders. Layout supports margins, rota- tion, and embedding within scenes

  79. [79]

    Geometry varies via tick count, numeral style, and hand lengths/angles

    Clock.Analog clock with ticks, numerals, and hands. Geometry varies via tick count, numeral style, and hand lengths/angles. Appearance includes face and background styles. Layout supports centering and partial occlusion

  80. [80]

    Geometry varies in side count, number of layers, spac- ing, and relative rotation

    Concentric Polygon.Multiple nested regular poly- gons sharing a center, optionally with rotation offsets. Geometry varies in side count, number of layers, spac- ing, and relative rotation. Appearance includes filled or outlined layers and alternating colors. Layout con- trols scale and juxtaposition

Showing first 80 references.