pith. sign in

arxiv: 2606.00148 · v1 · pith:Q3RSAZ6Knew · submitted 2026-05-29 · 💻 cs.CV · cs.AI

StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning

Pith reviewed 2026-06-28 23:14 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords abstract visual reasoningmultimodal large language modelsdiagnostic benchmarkrule-to-instance mappingSternberg reasoning stagesperception rule full tasksbinding gap
0
0 comments X

The pith

MLLMs name the rule in abstract visual puzzles yet still pick the wrong answer more than half the time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates StemBind, a benchmark that asks the same visual stem three separate questions: what is visible, what rule governs the pattern, and which option completes it. This separation shows that models often describe the image and state the rule correctly but then fail to apply that rule when choosing among candidates. The dominant error occurs during the step that links the abstract rule back to the concrete visual elements rather than during perception or rule discovery. Current AVR tests hide this breakdown because they only score the final choice. Tests across many models indicate that simply increasing size or adding explicit reasoning steps does not close the gap.

Core claim

On abstract visual reasoning tasks, multimodal large language models can correctly perceive the image and state the governing rule yet still select the wrong completion because they fail to map the rule onto the specific instances present. StemBind isolates this by running aligned Perception, Rule, and Full questions on identical stems and annotating errors with Sternberg's four stages; it finds that rule accuracy exceeds full accuracy on 22 of 24 models and that even correct perception plus correct rule still yields an incorrect full answer 51.2 percent of the time, with process diagnostics localizing the main failure to stage S3 rule-to-instance mapping.

What carries the argument

StemBind shared-stem diagnostic benchmark that runs three aligned questions (Perception, Rule, Full) on each visual stem and tags every item with Sternberg's four reasoning stages to attribute final-answer errors to one sub-step.

If this is right

  • Rule accuracy exceeds full-item accuracy on 22 of 24 models, so most failures occur after the rule has been identified.
  • Even when perception and rule answers are both correct on the same stem, the full answer is still wrong 51.2 percent of the time.
  • Stage-wise diagnostics and stimulus augmentation both point to S3 rule-to-instance mapping as the dominant bottleneck.
  • Neither larger model size nor explicit thinking mode reliably narrows the binding gap and thinking can lower both rule and full accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training regimes that explicitly supervise the mapping between extracted rules and visual instances could be tested on the same stems to measure gap reduction.
  • The same binding failure may limit performance on other tasks that require applying an abstract relation to a new visual scene.
  • Extending the benchmark to include explicit mapping supervision examples would allow direct measurement of whether the S3 step is trainable.

Load-bearing premise

The 2,298 stems are knowledge-light and the three aligned questions allow unambiguous attribution of final-answer errors to a single sub-step on identical visual evidence.

What would settle it

A controlled test in which a model or prompt explicitly trained on rule-to-instance mapping reduces the 51.2 percent full-answer error rate on the same StemBind stems while perception and rule accuracies remain high.

Figures

Figures reproduced from arXiv: 2606.00148 by Ao Cheng, Baiqi Wu, Qingyong Hu, Qiyao Sun, Xingming Li, Xixiang He, Xuanyu Ji.

Figure 1
Figure 1. Figure 1: STEMBIND overview: 9 RI/VP operations, shared-stem P/R/F probes, and S1–S4 process stages. P, R, and F probes share the same visual stem; S1–S4 annotates the F item’s solution path. a corner case but a dominant mode of how today’s multimodal large language models (MLLMs) fail at abstract visual reasoning, and no existing AVR benchmark can isolate it on the same visual stem. Abstract visual reasoning (AVR),… view at source ↗
Figure 2
Figure 2. Figure 2: STEMBIND construction pipeline. Source pools are standardized and deduplicated into 2,298 stems, ex￾panded into 19,533 shared-stem P/R/F tasks through machine drafting and hu￾man adjudication, and released after quality, leakage, and split checks. 9 knowledge-light operations [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Aggregate P/R/F performance. Many models preserve stronger P or R accuracy while dropping on F. 3.3 Evaluation protocol Evaluation levels and metrics. STEMBIND reports L1 exact-match ACC, L2 StepAcc from a free-trace judge, and L3 AttrTag from first-failure stage crossed with perception-load tag. The main analyses use P/R/F accuracy, R–F gaps, stem-level failure decomposition, strict Binding Gap, L2 StepAc… view at source ↗
Figure 4
Figure 4. Figure 4: Binding evidence. Top: R–F chasm. Middle: failure decomposition. Bottom: strict conditional Binding Gap. Claude-Opus-4.7 36.7% can’t-bind). Gemini-3.1-Pro is the non-MoE outlier (34.5% can’t-reason, 20.4% can’t-see, only 9.2% can’t-bind), corroborating its rule-collapse profile above. Finding 1b. On stronger models, F errors are dominated by stems where perception and rule are both correct; binding is the … view at source ↗
Figure 5
Figure 5. Figure 5: Qwen3.5 S1–S4/SSA localization: S3 is weakest and gains concentrate at H2/H3. Boundary. This localization is behavioral, not mechanistic [38, 41]. The full L2 and SSA evidence covers Qwen3.5 with a deterministic GPT-4o trace judge as the primary scorer (nrepeat=1). A 180-item agreement check between the trace judge and human annotators clears κ=0.70 on all four stages, with S3 lowest (κ=0.71; Appendix C.3)… view at source ↗
Figure 6
Figure 6. Figure 6: Family scaling: Qwen3.5 peaks pre-MoE; Gemma 4 improves on F; R–F gaps remain. Finding 4. Explicit thinking does not repair the F-side binding gap. Across paired direct/thinking rows, P rises on nine of ten rows but R and F fall on every row. Paired direct-vs-thinking deltas. We report THINKGAIN@X = ACCthink,X − ACCnon-think,X for matched direct and thinking rows. The signed pattern is uniform: P rises on … view at source ↗
Figure 7
Figure 7. Figure 7: Paired direct versus thinking modes. Thinking lifts P on most rows but lowers R and F on all rows. 5 Conclusion and Limitations Takeaway. STEMBIND evaluates MLLMs with shared-stem perception, rule, and full-item probes, four-stage process annotations, and paired thinking controls. Rule accuracy exceeds full-item accuracy on 22 of 24 direct-mode rows, and a strict conditional Binding Gap of 0.51 remains whe… view at source ↗
Figure 8
Figure 8. Figure 8: Shared-stem P/R/F diagnostic card. P probes, R probes, and F tasks are derived from the same visual stem, so they share visual evidence while requiring different outputs: local visual attributes, rule selection, and the final answer option. This shared evidence makes the within-stem Binding Gap well-defined. A.3 S1–S4 Annotation Schema and L3 Note The S1–S4 schema defines the expected solution path for an … view at source ↗
Figure 9
Figure 9. Figure 9: Operation distribution across the 2,298 result-bearing stems. The inner ring aggregates RI/VP families; the outer ring shows the nine operation labels used throughout STEMBIND. V1 V2 V3 V4 V5 Content family 0 200 400 600 800 Stems 778 578 105 131 706 P-heavy R-heavy Mixed Perception-load tag 0 200 400 600 800 1000 1200 918 1,087 293 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visual-content and perception-load distributions. Probe level (P/R/F) is distinct from the item-level perception-load tag used for L3 metadata. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Answer-option balance and released P/R/F task counts. This audit checks that answer letters are approximately balanced and that the public release exposes 14,937 P probes, 2,298 R probes, and 2,298 F items. 0 5000 10000 15000 20000 25000 Task instances Released full split P 14,937 R F 2,298 stems 19,533 tasks P/stem 6.50 P R F Easy Medium Hard 0 20 40 60 80 100 Share (%) 91 4.0% 1,770 77.0% 437 19.0% F di… view at source ↗
Figure 12
Figure 12. Figure 12: Released full-split statistics and easy/medium/hard distributions for F, R, and P across the 2,298 stems and 19,533 P/R/F tasks. C Evaluation Protocol and Judge Calibration C.1 Direct and Thinking Prompt Library All benchmark rows use English stems, full-image input, temperature 0, and fixed max-token budgets. Direct prompts ask the model to solve the item and place the final choice in <ANSWER>X</ANSWER>.… view at source ↗
Figure 13
Figure 13. Figure 13: GPT-4o L2 judge vs. human Cohen’s κ across S1–S4 on the 180-item calibration set. All four stages clear the κ=0.70 reliability threshold; S3 Map is the lowest, consistent with the boundary cases noted in Sec. A.3 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Operation-family R–F chasm. Bars summarize macro R–F gaps separately over RI and VP operations for representative rows, showing that the chasm is not driven by a single operation family. D.2 Strict Binding Gap Denominators and Confidence Intervals The strict Binding Gap conditions on stems where every P probe and the R probe are correct on the same stem. This denominator is intentionally stricter than the… view at source ↗
Figure 15
Figure 15. Figure 15: Gemma 4 SSA replication. The cross-family profile is appendix-only and is used to check whether verified intermediate structure helps beyond the Qwen3.5 family. D.4 Paired Direct vs. Thinking Raw Values The matched thinking comparison uses ten direct/thinking pairs. Thinking improves P on nine of ten rows, while R and F drop on all ten under this protocol. This supports the bounded claim that longer trace… view at source ↗
Figure 16
Figure 16. Figure 16: Compact visualization of paired direct-vs-thinking deltas from [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Solved RI-Pos reference case. The figure shows a fully resolved position-reasoning example with the F, R, and P probes, the selected correct option, and the corresponding S1–S4 process annotation. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Can’t-see P-probe case. The figure illustrates VP-View perception errors on option B: the model outputs hallucinate extra circles, misidentify the nested structure, or misplace the two inner circles. The error is localized to S1 visual encoding rather than rule induction or rule-to-instance binding. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Can’t-reason R-probe case. The figure illustrates RI-Attr rule-induction errors: the models refer to visible symbol properties, but replace the ground-truth enclosed-region rule with spurious grouping rules such as contour style, symmetry, or display layout. The error is localized to S2 rule inference rather than S1 visual encoding or S3 rule-to-instance binding. E.4 Case-Error: Can’t-Bind (E.4, main appe… view at source ↗
Figure 20
Figure 20. Figure 20: Can’t-bind RI-Quantity case. The figure illustrates a quantity-binding failure: the models identify the additive face-count rule and the required target count of 27 faces, but then bind that count to the wrong answer option. The error is localized to S3 rule-to-option mapping rather than S1 visual encoding or S2 rule induction. E.5 Hard S1–S4 Worked Trace (E.5) Worked traces show how a full solution is de… view at source ↗
Figure 21
Figure 21. Figure 21: Hard VP-View worked trace. The case illustrates how a three-view matching problem decomposes into S1 visual encoding, S2 view-consistency rule inference, S3 projection mapping, and S4 option elimination. E.6 Case-Error: Direct vs. Thinking Pairs (E.6) Thinking cases visualize the paired THINKGAIN effect from a single stem perspective: longer reasoning can change the stated rule and does not provide a stab… view at source ↗
Figure 22
Figure 22. Figure 22: Illustrative direct-vs-thinking RI-Style case. The direct response correctly applies the row-wise overlay rule and selects option C. The thinking response over-analyzes local black and hollow triangle positions, drifts to an incorrect transformation rule, and selects option D. This paired case shows that longer reasoning can destabilize the correct rule rather than repair the final answer. F Extended Rela… view at source ↗
read the original abstract

Multimodal large language models (MLLMs) often know the rule but pick the wrong answer: on abstract visual reasoning (AVR) tasks, a model can describe what it sees and name the underlying pattern, yet still fail to choose the matching candidate. Existing AVR benchmarks cannot detect this because they collapse perception, rule induction, and answer selection into a single right-or-wrong signal. We introduce StemBind, a shared-stem diagnostic benchmark that probes the same visual stem with three aligned questions: Perception (what is in the image), Rule (what pattern governs it), and Full (which option completes it), so a final-answer error can be attributed to a specific sub-step on the same evidence. StemBind contains 2,298 curated knowledge-light stems across nine auditable visual operations, totaling 19,533 P/R/F tasks, with each full item annotated by Sternberg's four reasoning stages (S1 Encode, S2 Infer, S3 Map, S4 Apply). Evaluating 24 frontier MLLM configurations yields four findings. (i) The R-F chasm: rule accuracy exceeds full-item accuracy on 22 of 24 models, so most failures happen after the rule is identified. (ii) A persistent binding gap: even when P and R are both correct on the same stem, models still answer F incorrectly 51.2% of the time. (iii) The bottleneck is S3: process diagnostics and Stage-wise Stimulus Augmentation localize the dominant failure to rule-to-instance mapping. (iv) Scaling and thinking do not help: neither larger models nor explicit thinking mode reliably closes the gap, and thinking even lowers rule and full-item accuracy. StemBind reframes AVR evaluation from final-answer ranking to locating where abstract visual reasoning breaks down, identifying rule-to-instance binding as a concrete next target for vision-grounded reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces StemBind, a shared-stem diagnostic benchmark containing 2,298 knowledge-light AVR stems across nine visual operations (19,533 P/R/F tasks total). Each stem is probed with three aligned questions—Perception, Rule, and Full—plus Sternberg stage annotations, allowing final-answer errors to be attributed to specific sub-steps. Evaluation of 24 MLLM configurations reports an R-F chasm (rule accuracy > full accuracy on 22/24 models), a 51.2% binding gap (F incorrect despite correct P and R on the same stem), localization of the dominant failure to S3 (rule-to-instance mapping) via process diagnostics and Stage-wise Stimulus Augmentation, and that neither scale nor explicit thinking reliably closes the gap.

Significance. If the localization holds, StemBind supplies a reproducible, stage-annotated benchmark that shifts AVR evaluation from end-to-end accuracy to concrete bottleneck identification, with the 2,298-stem scale and auditable operations constituting a clear methodological contribution. The empirical measurement of the binding gap on external models (zero free parameters or self-referential definitions) is a strength that could guide targeted improvements in vision-grounded rule application.

major comments (2)
  1. [§4] §4 (Evaluation) and the binding-gap definition: the central claim that the 51.2% gap localizes failure to S3 assumes that a correct R response on an independent forward pass supplies the same rule representation the model attempts to bind under the F prompt. The manuscript provides no stability check (e.g., rule verbalization consistency or controlled re-prompting on identical stems) to rule out inconsistent retrieval, which directly undermines attribution of F errors to mapping rather than S2/S3 inconsistency.
  2. [§3.2] §3.2 (Stage-wise Stimulus Augmentation) and process diagnostics: the claim that diagnostics isolate S3 as the bottleneck rests on the premise that P+R correct implies S1 and S2 success on the identical visual evidence; because the three questions are posed separately, the paper must demonstrate that the extracted rule is the one active during F, yet no such verification (e.g., cross-prompt rule equivalence metrics) is reported.
minor comments (2)
  1. [§3.1] Table 1 or §3.1: the nine visual operations are listed but lack a compact summary table of stem counts per operation and example images; this would improve auditability without altering the central results.
  2. [Abstract] Abstract and §5: the phrase "knowledge-light" is used repeatedly; a short operational definition or inter-annotator agreement statistic for this property would clarify the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments regarding the attribution of the binding gap to S3. These points correctly identify the need for explicit verification that the rule representation remains consistent across the independent R and F prompts. We address each comment below and will incorporate the suggested checks in the revised manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Evaluation) and the binding-gap definition: the central claim that the 51.2% gap localizes failure to S3 assumes that a correct R response on an independent forward pass supplies the same rule representation the model attempts to bind under the F prompt. The manuscript provides no stability check (e.g., rule verbalization consistency or controlled re-prompting on identical stems) to rule out inconsistent retrieval, which directly undermines attribution of F errors to mapping rather than S2/S3 inconsistency.

    Authors: We agree that the absence of an explicit stability analysis leaves open the possibility of retrieval inconsistency between the R and F passes. Our current design conditions the binding gap strictly on stems where both P and R are answered correctly and uses identical visual stems with aligned prompt structures, but this does not fully rule out variability in the internal rule representation. In the revision we will add a stability check: for a subset of stems we will re-prompt the rule question multiple times (with temperature 0 where supported) and report the rate at which the verbalized rule remains semantically equivalent. This analysis will be placed in §4 and will directly support or qualify the S3 localization. revision: yes

  2. Referee: [§3.2] §3.2 (Stage-wise Stimulus Augmentation) and process diagnostics: the claim that diagnostics isolate S3 as the bottleneck rests on the premise that P+R correct implies S1 and S2 success on the identical visual evidence; because the three questions are posed separately, the paper must demonstrate that the extracted rule is the one active during F, yet no such verification (e.g., cross-prompt rule equivalence metrics) is reported.

    Authors: The referee is correct that separate prompting requires additional evidence that the rule extracted under the R prompt is the same representation engaged during the F prompt. The Stage-wise Stimulus Augmentation experiments provide indirect support by showing that targeted interventions at S3 improve performance while earlier-stage interventions do not, but they do not include a direct equivalence metric. We will add, in the revised §3.2, a cross-prompt rule equivalence analysis: for stems where F is answered correctly we will extract the rule from the F response (via a follow-up probe) and compute semantic similarity to the R response; we will also report the rate at which the model can restate the same rule after completing F. These metrics will be used to strengthen the claim that the dominant failure occurs at the mapping stage. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on external models via new benchmark

full rationale

The paper constructs StemBind (2,298 stems, 19,533 tasks) and reports accuracy statistics (e.g., 51.2% F-error rate conditional on P+R correct) from evaluations of 24 external MLLM configurations. No equations, fitted parameters, or derivations appear; all claims are direct counts on held-out model outputs. No self-citations are load-bearing for the central attribution, and the benchmark is presented as an independent diagnostic tool rather than a self-referential definition. This matches the default expectation of a non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical construction and annotation of the StemBind dataset rather than on mathematical derivation; no free parameters, ad-hoc axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5897 in / 1181 out tokens · 21000 ms · 2026-06-28T23:14:20.902295+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 28 canonical work pages · 10 internal anchors

  1. [1]

    Don’t just assume; look and answer: Overcoming priors for visual question answering

    Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. InCVPR, 2018

  2. [2]

    Introducing claude opus 4.7, 2026

    Anthropic. Introducing claude opus 4.7, 2026. URL https://www.anthropic.com/news/ claude-opus-4-7

  3. [3]

    Measuring abstract reasoning in neural networks

    David Barrett, Felix Hill, Adam Santoro, Ari Morcos, and Timothy Lillicrap. Measuring abstract reasoning in neural networks. InICML, 2018

  4. [4]

    Mm-iq: Benchmarking human-like abstraction and reasoning in multimodal models.arXiv preprint arXiv:2502.00698, 2025

    Huanqia Cai, Yijun Yang, and Winston Hu. Mm-iq: Benchmarking human-like abstraction and reasoning in multimodal models.arXiv preprint arXiv:2502.00698, 2025

  5. [5]

    Morse-500: A programmatically con- trollable video benchmark to stress-test multimodal reasoning.arXiv preprint arXiv:2506.05523, 2025

    Zikui Cai, Andrew Wang, Anirudh Satheesh, Ankit Nakhawa, Hyunwoo Jae, Keenan Powell, Minghui Liu, Neel Jay, Sungbin Oh, Xiyao Wang, et al. Morse-500: A programmatically con- trollable video benchmark to stress-test multimodal reasoning.arXiv preprint arXiv:2506.05523, 2025

  6. [6]

    M3CoT: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought

    Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M3CoT: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought. InACL, 2024

  7. [7]

    OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

    Qiguang Chen, Chengyu Luan, Jiajun Wu, Qiming Yu, Yi Yang, Yizhuo Li, Jingqi Tong, Xiachong Feng, Libo Qin, and Wanxiang Che. Omibench: Benchmarking olympiad-level multi-image reasoning in large vision-language model.arXiv preprint arXiv:2604.20806, 2026

  8. [8]

    Cogflow: Bridging perception and reasoning through knowledge internalization for visual mathematical problem solving.arXiv preprint arXiv:2601.01874, 2026

    Shuhang Chen, Yunqiu Xu, Junjie Xie, Aojun Lu, Tao Feng, Zeying Huang, Ning Zhang, Yi Sun, Yi Yang, and Hangjie Yuan. Cogflow: Bridging perception and reasoning through knowledge internalization for visual mathematical problem solving.arXiv preprint arXiv:2601.01874, 2026

  9. [9]

    Enc-bench: A benchmark for evaluating multimodal large language models in electronic navigational chart understanding.arXiv preprint arXiv:2603.22763, 2026

    Ao Cheng, Xingming Li, Xuanyu Ji, Xixiang He, Qiyao Sun, Chunping Qiu, Runke Huang, and Qingyong Hu. Enc-bench: A benchmark for evaluating multimodal large language models in electronic navigational chart understanding.arXiv preprint arXiv:2603.22763, 2026

  10. [10]

    Visual thoughts: A unified perspective of understanding multimodal chain-of-thought.arXiv preprint arXiv:2505.15510, 2025

    Zihui Cheng, Qiguang Chen, Xiao Xu, Jiaqi Wang, Weiyun Wang, Hao Fei, Yidong Wang, Alex Jinpeng Wang, Zhi Chen, Wanxiang Che, et al. Visual thoughts: A unified perspective of understanding multimodal chain-of-thought.arXiv preprint arXiv:2505.15510, 2025

  11. [11]

    Evaluating mllms with multimodal multi-image reasoning benchmark.arXiv preprint arXiv:2506.04280, 2025

    Ziming Cheng, Binrui Xu, Lisheng Gong, Zuhe Song, Tianshuo Zhou, Shiqi Zhong, Siyu Ren, Mingxiang Chen, Xiangchao Meng, Yuxin Zhang, et al. Evaluating mllms with multimodal multi-image reasoning benchmark.arXiv preprint arXiv:2506.04280, 2025

  12. [12]

    Puzzlevqa: Diagnosing multimodal reasoning challenges of language models with abstract visual patterns

    Yew Ken Chia, Vernon Toh, Deepanway Ghosal, Lidong Bing, and Soujanya Poria. Puzzlevqa: Diagnosing multimodal reasoning challenges of language models with abstract visual patterns. InFindings of ACL, 2024

  13. [13]

    On the Measure of Intelligence

    François Chollet. On the measure of intelligence.arXiv preprint arXiv:1911.01547, 2019

  14. [14]

    Smith, Wei-Chiu Ma, and Ranjay Krishna

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. BLINK: multimodal large language models can see but not perceive. InECCV, 2024

  15. [15]

    Wichmann

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2020

  16. [16]

    Gemini 3.1 pro model card, 2026

    Google DeepMind. Gemini 3.1 pro model card, 2026. URL https://deepmind.google/ models/model-cards/gemini-3-1-pro/

  17. [17]

    Gemma 4 model card, 2026

    Google DeepMind. Gemma 4 model card, 2026. URL https://ai.google.dev/gemma/ docs/core/model_card_4. 10

  18. [18]

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InACL, 2024

  19. [19]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

  20. [20]

    Stratified rule-aware network for abstract visual reasoning

    Sheng Hu, Yuqing Ma, Xianglong Liu, Yanlu Wei, and Shihao Bai. Stratified rule-aware network for abstract visual reasoning. InAAAI, 2021

  21. [21]

    Smith, and Ranjay Krishna

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InNeurIPS, 2024

  22. [22]

    Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

    Jen-Tse Huang, Dasen Dai, Jen-Yuan Huang, Youliang Yuan, Xiaoyuan Liu, Wenxuan Wang, Wenxiang Jiao, Pinjia He, Zhaopeng Tu, and Haodong Duan. Human cognitive benchmarks reveal foundational visual gaps in mllms.arXiv preprint arXiv:2502.16435, 2025

  23. [23]

    Mantis: Interleaved multi-image instruction tuning.Transactions on Machine Learning Research, 2024

    Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning.Transactions on Machine Learning Research, 2024

  24. [24]

    Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency

    Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, et al. Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency. InICML, 2025

  25. [25]

    Beyond perception: Evaluating abstract visual reasoning through multi-stage task

    Yanbei Jiang, Yihao Ding, Chao Lei, Jiayang Ao, Jey Han Lau, and Krista A Ehinger. Beyond perception: Evaluating abstract visual reasoning through multi-stage task. InFindings of ACL, 2025

  26. [26]

    MARVEL: multidimensional abstraction and reasoning through visual evaluation and learning

    Yifan Jiang, Jiarui Zhang, Kexuan Sun, Zhivar Sourati, Kian Ahrabian, Kaixin Ma, Filip Ilievski, and Jay Pujara. MARVEL: multidimensional abstraction and reasoning through visual evaluation and learning. InNeurIPS, 2024

  27. [27]

    Remi: A dataset for reasoning with multiple images

    Mehran Kazemi, Nishanth Dikkala, Ankit Anand, Petar Devic, Ishita Dasgupta, Fangyu Liu, Bahare Fatemi, Pranjal Awasthi, Sreenivas Gollapudi, Dee Guo, and Ahmed Qureshi. Remi: A dataset for reasoning with multiple images. InNeurIPS, 2024

  28. [28]

    Vriq: Benchmarking and analyzing visual-reasoning iq of vlms.arXiv preprint arXiv:2602.05382, 2026

    Tina Khezresmaeilzadeh, Jike Zhong, and Konstantinos Psounis. Vriq: Benchmarking and analyzing visual-reasoning iq of vlms.arXiv preprint arXiv:2602.05382, 2026

  29. [29]

    Mibench: Evaluating multimodal large language models over multiple images

    Haowei Liu, Xi Zhang, Haiyang Xu, Yaya Shi, Chaoya Jiang, Ming Yan, Ji Zhang, Fei Huang, Chunfeng Yuan, Bing Li, et al. Mibench: Evaluating multimodal large language models over multiple images. InEMNLP, 2024

  30. [30]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS, 2022

  31. [31]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InICLR, 2024

  32. [32]

    Kevin S. McGrew. Chc theory and the human cognitive abilities project: Standing on the shoulders of the giants of psychometric intelligence research.Intelligence, 37(1):1–10, 2009. doi: https://doi.org/10.1016/j.intell.2008.08.004

  33. [33]

    MMIU: multimodal multi-image understanding for evaluating large vision-language models

    Fanqing Meng, Jin Wang, Chuanhao Li, Quanfeng Lu, Hao Tian, Tianshuo Yang, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, Kaipeng Zhang, and Wenqi Shao. MMIU: multimodal multi-image understanding for evaluating large vision-language models. InICLR, 2025. 11

  34. [34]

    The conceptarc bench- mark: Evaluating understanding and generalization in the arc domain.arXiv preprint arXiv:2305.07141, 2023

    Arseny Moskvichev, Victor Vikram Odouard, and Melanie Mitchell. The conceptarc bench- mark: Evaluating understanding and generalization in the arc domain.arXiv preprint arXiv:2305.07141, 2023

  35. [35]

    Patel, Yuke Zhu, and Anima Anandkumar

    Weili Nie, Zhiding Yu, Lei Mao, Ankit B. Patel, Yuke Zhu, and Anima Anandkumar. Bongard- logo: A new benchmark for human-level concept learning and reasoning. InNeurIPS, 2020

  36. [36]

    GPT-4o System Card

    OpenAI. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  37. [37]

    Gpt-5.4 thinking system card, 2026

    OpenAI. Gpt-5.4 thinking system card, 2026. URL https://openai.com/index/ gpt-5-4-thinking-system-card/. Accessed: 2026-05-06

  38. [38]

    Cambridge university press, 2009

    Judea Pearl.Causality. Cambridge university press, 2009

  39. [39]

    Iqbench: How" smart”are vision-language models? a study with human iq tests.arXiv preprint arXiv:2505.12000, 2025

    Tan-Hanh Pham, Phu-Vinh Nguyen, Dang The Hung, Bui Trong Duong, Vu Nguyen Thanh, Chris Ngo, Tri Quang Truong, and Truong-Son Hy. Iqbench: How" smart”are vision-language models? a study with human iq tests.arXiv preprint arXiv:2505.12000, 2025

  40. [40]

    Qwen3.5: Towards native multimodal agents, 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, 2026. URL https://qwen.ai/ blog?id=qwen3.5

  41. [41]

    Donald B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 1974

  42. [42]

    The cattell-horn-carroll model of intelligence

    W Joel Schneider and Kevin S McGrew. The cattell-horn-carroll model of intelligence. 2012

  43. [43]

    M3gia: A cognition inspired multilingual and multimodal general intelligence ability benchmark.arXiv preprint arXiv:2406.05343, 2024

    Wei Song, Yadong Li, Jianhua Xu, Guowei Wu, Lingfeng Ming, Kexin Yi, Weihua Luo, Houyi Li, Yi Du, Fangda Guo, et al. M3gia: A cognition inspired multilingual and multimodal general intelligence ability benchmark.arXiv preprint arXiv:2406.05343, 2024

  44. [44]

    Visualpuz- zles: Decoupling multimodal reasoning evaluation from domain knowledge.arXiv preprint arXiv:2504.10342, 2025

    Yueqi Song, Tianyue Ou, Yibo Kong, Zecheng Li, Graham Neubig, and Xiang Yue. Visualpuz- zles: Decoupling multimodal reasoning evaluation from domain knowledge.arXiv preprint arXiv:2504.10342, 2025

  45. [45]

    Intelligence, information processing, and analogical reasoning: The componential analysis of human abilities, 1977

    RJ Sternberg. Intelligence, information processing, and analogical reasoning: The componential analysis of human abilities, 1977

  46. [46]

    Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

    Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025

  47. [47]

    Mm-math: Advancing multimodal math evaluation with process evaluation and fine-grained classification

    Kai Sun, Yushi Bai, Ji Qi, Lei Hou, and Juanzi Li. Mm-math: Advancing multimodal math evaluation with process evaluation and fine-grained classification. InFindings of EMNLP, 2024

  48. [48]

    Kimi K2.5: Visual Agentic Intelligence

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  49. [49]

    Lê Khac, Ankit Singh, Sofian Chaybouti, and Sanath Narayan

    Brigitta Malagurski Törtei, Yasser Dahou, Ngoc Dung Huynh, Wamiq Reyaz Para, Phúc H. Lê Khac, Ankit Singh, Sofian Chaybouti, and Sanath Narayan. Visres bench: On evaluating the visual reasoning capabilities of vlms.arXiv preprint arXiv:2512.21194, 2025

  50. [50]

    Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al

    Fei Wang, Xingyu Fu, James Y . Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. Muirbench: A comprehensive benchmark for robust multi-image understanding. InICLR, 2025

  51. [51]

    Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models

    Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. InACL, 2023

  52. [52]

    Spatialviz-bench: A cognitively-grounded benchmark for diagnosing spatial visualization in mllms

    Siting Wang, Minnan Pei, Luoyang Sun, Cheng Deng, Yuchen Li, Kun Shao, Zheng Tian, Haifeng Zhang, and Jun Wang. Spatialviz-bench: A cognitively-grounded benchmark for diagnosing spatial visualization in mllms. InICLR, 2025. 12

  53. [53]

    Visualprm: An effective process reward model for multimodal reasoning.arXiv preprint arXiv:2503.10291, 2025

    Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, et al. Visualprm: An effective process reward model for multimodal reasoning.arXiv preprint arXiv:2503.10291, 2025

  54. [54]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  55. [55]

    Mementos: A comprehensive benchmark for multimodal large language model reasoning over image sequences

    Xiyao Wang, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Fuxiao Liu, Gedas Bertasius, et al. Mementos: A comprehensive benchmark for multimodal large language model reasoning over image sequences. InACL, 2024

  56. [56]

    Slow perception: Let’s perceive geometric figures step-by-step.arXiv preprint arXiv:2412.20631, 2024

    Haoran Wei, Youyang Yin, Yumeng Li, Jia Wang, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang, and Daxin Jiang. Slow perception: Let’s perceive geometric figures step-by-step.arXiv preprint arXiv:2412.20631, 2024

  57. [57]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022

  58. [58]

    Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279, 2025

    Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, et al. Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279, 2025

  59. [59]

    Mc-bench: A benchmark for multi-context visual grounding in the era of mllms

    Yunqiu Xu, Linchao Zhu, and Yi Yang. Mc-bench: A benchmark for multi-context visual grounding in the era of mllms. InICCV, 2025

  60. [60]

    Visuriddles: Fine-grained perception is a primary bottleneck for multimodal large language models in abstract visual reasoning.arXiv preprint arXiv:2506.02537, 2025

    Hao Yan, Xingchen Liu, Hao Wang, Zhenbiao Cao, Handong Zheng, Liang Yin, Xinxing Su, Zihao Chen, Jihao Wu, Minghui Liao, et al. Visuriddles: Fine-grained perception is a primary bottleneck for multimodal large language models in abstract visual reasoning.arXiv preprint arXiv:2506.02537, 2025

  61. [61]

    LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models

    Ruilin Yao, Bo Zhang, Jirui Huang, Xinwei Long, Yifang Zhang, Tianyu Zou, Yufei Wu, Shichao Su, Yifan Xu, Wenxi Zeng, et al. Lens: Multi-level evaluation of multimodal reasoning with large language models.arXiv preprint arXiv:2505.15616, 2025

  62. [62]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InCVPR, 2024

  63. [63]

    Raven: A dataset for relational and analogical visual reasoning

    Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, and Song-Chun Zhu. Raven: A dataset for relational and analogical visual reasoning. InCVPR, 2019

  64. [64]

    Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InECCV, 2024

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InECCV, 2024

  65. [65]

    AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process

    Xintong Zhang, Xiaowen Zhang, Jongrong Wu, Zhi Gao, Shilin Yan, Zhenxin Diao, Kunpeng Gao, Xuanyan Chen, Yuwei Wu, Yunde Jia, and Qing Li. Adaptmmbench: Benchmarking adaptive multimodal reasoning for mode selection and reasoning process.arXiv preprint arXiv:2602.02676, 2026

  66. [66]

    Benchmarking multi-image understanding in vision and language models: Perception, knowledge, reasoning, and multi-hop reasoning.arXiv preprint arXiv:2406.12742, 2024

    Bingchen Zhao, Yongshuo Zong, Letian Zhang, and Timothy Hospedales. Benchmarking multi-image understanding in vision and language models: Perception, knowledge, reasoning, and multi-hop reasoning.arXiv preprint arXiv:2406.12742, 2024

  67. [67]

    synergy deficit

    Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. InICLR, 2023. 13 Appendix for STEMBIND When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning A Benchmark Specificatio...