pith. machine review for the scientific record. sign in

arxiv: 2604.08541 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.AI· cs.CL

Recognition: unknown

Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:58 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords multimodal mixture of expertsrouting distractionvisual reasoningdomain expertsvision-language modelsseeing but not thinking
0
0 comments X

The pith

Multimodal MoE models see images correctly but fail to reason because routing skips domain experts in middle layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that multimodal Mixture-of-Experts models accurately perceive visual content yet produce weak reasoning on the same problems that text inputs solve easily. Systematic checks confirm cross-modal semantic sharing, so the gap is not simple misalignment. Instead, visual and domain experts separate across layers, and image inputs create routing patterns that diverge from text inputs precisely in the middle layers where domain experts cluster. This routing distraction leaves task-relevant reasoning experts under-activated. A targeted intervention that steers routing toward those experts raises accuracy by as much as 3.17 percent on complex visual reasoning benchmarks across three models.

Core claim

Visual inputs trigger routing divergence from text inputs in middle layers of multimodal MoE models, where domain experts concentrate; as a result the routing mechanism under-activates task-relevant reasoning experts even though perception succeeds, producing a seeing-but-not-thinking failure that a routing-guided intervention corrects.

What carries the argument

Layer-wise separation of visual experts from domain experts, which produces routing divergence on image inputs and thereby under-activates reasoning capacity.

If this is right

  • Enhancing domain-expert activation via routing guidance yields consistent gains on complex visual reasoning tasks.
  • Domain experts identified by the method capture transferable cognitive functions rather than task-specific solutions.
  • Cross-modal semantic sharing rules out alignment failure as the sole explanation for the performance gap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same routing-steering approach could be applied to other MoE vision-language models without full retraining.
  • Routing design choices may need explicit separation of perception and reasoning pathways in future multimodal architectures.
  • Testing whether forcing text-like routing on visual inputs closes the entire performance gap would directly probe the hypothesis.

Load-bearing premise

The layer-wise separation of visual and domain experts together with the observed routing divergence is the primary driver of the reasoning shortfall rather than training data composition or optimization dynamics.

What would settle it

If the routing-guided intervention produces no accuracy gains on the six benchmarks or if middle-layer routing divergence shows no correlation with reasoning errors, the Routing Distraction hypothesis would be falsified.

Figures

Figures reproduced from arXiv: 2604.08541 by Haiwen Hong, Haolei Xu, Hongxing Li, Hui Xue, Longtao Huang, Rui Zhou, Weiming Lu, Yang Zhang, Yongliang Shen, Yueting Zhuang.

Figure 1
Figure 1. Figure 1: Illustration of the Seeing but Not Thinking phenomenon. See Appendix B for details. experts for each input, MoE models efficiently han￾dle the intricate interactions between visual and textual information while maintaining computa￾tional tractability. However, beneath this success lies a puzzling phenomenon that challenges our fundamental understanding of how these models integrate perception and reasoning… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our work. We first conduct cross-modal concept intervention to verify semantic sharing in MoE architectures (left, §3.1), then identify domain experts by comparing activation frequencies on domain-specific versus general data (middle, §3.2), and finally analyze routing divergence across modalities and apply routing guidance to enhance domain expert activation (right, §3.3-§4). et al., 2022; Sch… view at source ↗
Figure 3
Figure 3. Figure 3: Analysis of routing mechanisms in multimodal MoE models. (a) Cross-modal semantic sharing verification [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Layer-wise distribution of domain experts and visual experts. Left: Heatmap showing activation frequency [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of enhancement coefficient λ on rea￾soning accuracy gains across three models. mance with λ ∈ [0.4, 0.6]; excessive values de￾grade accuracy by overriding input-specific routing decisions. Llama4 requires weaker intervention (λ = 0.2), due to its Top-1 routing mechanism where activating only one expert per layer makes routing decisions more sensitive to logit changes. 6 Conclusion This paper investi… view at source ↗
Figure 6
Figure 6. Figure 6: Examples of the three semantically equivalent [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Sample image from MATH-Vision dataset. - **Flower 1 (far left):** In the center circle is the number **48**. - An arrow points from Flower 1 to **Flower 2** with the label **-20** above the arrow. - **Flower 2:** Center circle is blank (no number shown). - An arrow points from Flower 2 to **Flower 3** with the label **+9** above the arrow. - **Flower 3:** Center circle is blank. - An arrow points from Flow… view at source ↗
read the original abstract

Multimodal Mixture-of-Experts (MoE) models have achieved remarkable performance on vision-language tasks. However, we identify a puzzling phenomenon termed Seeing but Not Thinking: models accurately perceive image content yet fail in subsequent reasoning, while correctly solving identical problems presented as pure text. Through systematic analysis, we first verify that cross-modal semantic sharing exists in MoE architectures, ruling out semantic alignment failure as the sole explanation. We then reveal that visual experts and domain experts exhibit layer-wise separation, with image inputs inducing significant routing divergence from text inputs in middle layers where domain experts concentrate. Based on these findings, we propose the Routing Distraction hypothesis: when processing visual inputs, the routing mechanism fails to adequately activate task-relevant reasoning experts. To validate this hypothesis, we design a routing-guided intervention method that enhances domain expert activation. Experiments on three multimodal MoE models across six benchmarks demonstrate consistent improvements, with gains of up to 3.17% on complex visual reasoning tasks. Our analysis further reveals that domain expert identification locates cognitive functions rather than sample-specific solutions, enabling effective transfer across tasks with different information structures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper identifies a 'Seeing but Not Thinking' phenomenon in multimodal MoE models, where visual inputs are perceived accurately but lead to reasoning failures unlike equivalent text problems. Analysis reveals cross-modal semantic sharing but also layer-wise separation of visual and domain experts, with image inputs causing routing divergence in middle layers. The authors propose the Routing Distraction hypothesis and validate it via a routing-guided intervention that increases domain-expert activation, reporting consistent gains up to 3.17% on complex visual reasoning tasks across three models and six benchmarks. They further claim that identified domain experts capture general cognitive functions transferable across tasks.

Significance. If the causal mechanism holds, the work provides a useful diagnostic for routing behavior in multimodal MoE architectures and a lightweight intervention to improve reasoning performance without retraining. The multi-model, multi-benchmark consistency and the observation that domain-expert identification supports cross-task transfer are notable strengths that could inform future MoE interpretability and design.

major comments (2)
  1. [§4] §4 (Routing-Guided Intervention): The validation experiments apply a targeted intervention to enhance domain-expert activation but omit control conditions that apply comparable routing perturbations (e.g., forcing an equal number of non-domain or random experts in the same middle layers) while preserving the original visual inputs. Without these controls, the reported gains cannot be unambiguously attributed to reversal of the specific layer-wise separation rather than general rebalancing, increased capacity, or optimization artifacts.
  2. [§5] §5 (Experimental Results): The performance improvements (up to 3.17%) are presented without statistical significance tests, standard deviations from multiple random seeds, or ablations that isolate the intervention from baseline routing variance. This weakens the claim of consistent, robust gains supporting the hypothesis, especially given that intervention parameters are selected based on the same routing observations used to define the problem.
minor comments (2)
  1. [Abstract and §3] The abstract and §3 would benefit from explicit quantitative metrics (e.g., KL-divergence or routing probability differences) for the reported layer-wise divergence between text and image inputs.
  2. [§2 and §3] Notation for expert types (visual vs. domain) and routing scores could be standardized with a single table or equation early in the paper to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help improve the rigor of our experimental validation. We address the major concerns point by point below.

read point-by-point responses
  1. Referee: [§4] §4 (Routing-Guided Intervention): The validation experiments apply a targeted intervention to enhance domain-expert activation but omit control conditions that apply comparable routing perturbations (e.g., forcing an equal number of non-domain or random experts in the same middle layers) while preserving the original visual inputs. Without these controls, the reported gains cannot be unambiguously attributed to reversal of the specific layer-wise separation rather than general rebalancing, increased capacity, or optimization artifacts.

    Authors: We agree that control experiments are necessary to establish the specificity of the routing-guided intervention. In the revised version of the manuscript, we will add experiments that apply comparable perturbations by forcing activation of non-domain experts or randomly selected experts in the middle layers, while keeping the visual inputs unchanged. These controls will allow us to attribute the performance gains more confidently to the targeted activation of domain experts, addressing potential confounds from general rebalancing or capacity increases. revision: yes

  2. Referee: [§5] §5 (Experimental Results): The performance improvements (up to 3.17%) are presented without statistical significance tests, standard deviations from multiple random seeds, or ablations that isolate the intervention from baseline routing variance. This weakens the claim of consistent, robust gains supporting the hypothesis, especially given that intervention parameters are selected based on the same routing observations used to define the problem.

    Authors: We acknowledge the importance of statistical rigor and ablations for validating the robustness of our results. We will revise the experimental section to include multiple runs with different random seeds, reporting mean performance with standard deviations. We will also conduct statistical significance tests, such as paired t-tests, to assess the improvements. Furthermore, we will perform additional ablations on the intervention parameters to isolate their effects from baseline routing variance and to demonstrate that the gains are not artifacts of parameter selection based on the routing analysis. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's chain moves from routing measurements and layer-wise expert separation analysis to a proposed Routing Distraction hypothesis, then to a routing-guided intervention tested for performance gains. This uses three distinct models and six external benchmarks for validation, with no equations, fitted parameters renamed as predictions, or self-citation chains that reduce the central claim to its own inputs by construction. The intervention is presented as a test of the hypothesis rather than a definitional re-expression of the initial observations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on standard MoE routing assumptions plus the empirical observation that middle-layer routing divergence explains reasoning gaps; the main addition is the post-hoc hypothesis and intervention rather than new axioms or entities.

axioms (2)
  • domain assumption MoE models route tokens to experts via a learned gating function that can be measured per layer and modality.
    Standard architectural property invoked to interpret routing divergence.
  • domain assumption Semantic content is shared between vision and language pathways in the tested models.
    Stated after verification to rule out alignment failure.
invented entities (1)
  • Routing Distraction no independent evidence
    purpose: Hypothesized mechanism explaining why visual inputs fail to activate reasoning experts.
    New explanatory construct introduced to account for the measured routing divergence.

pith-pipeline@v0.9.0 · 5524 in / 1477 out tokens · 93763 ms · 2026-05-10T16:58:05.175018+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

10 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Kimi-VL Technical Report

    Kimi-vl technical report.arXiv preprint arXiv:2504.07491. Angela van Sprang, Laurens Samson, Ana Lucic, Erman Acar, Sennay Ghebreab, and Yuki M Asano. 2025. Same content, different answers: Cross-modal incon- sistency in mllms.arXiv preprint arXiv:2512.08923. Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li

  2. [2]

    Mengru Wang, Xingyu Chen, Yue Wang, Zhiwei He, Jiahao Xu, Tian Liang, Qiuzhi Liu, Yunzhi Yao, Wenxuan Wang, Ruotian Ma, and 1 others

    Measuring multimodal mathematical reason- ing with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169. Mengru Wang, Xingyu Chen, Yue Wang, Zhiwei He, Jiahao Xu, Tian Liang, Qiuzhi Liu, Yunzhi Yao, Wenxuan Wang, Ruotian Ma, and 1 others. 2025a. Two experts are all you need for steering think- ing: Reinforcing cognitive ef...

  3. [3]

    OpenMoE: An early effort on open mixture-of-experts language models.arXiv preprint arXiv:2402.01739, 2024

    Openmoe: An early effort on open mixture-of-experts language models.arXiv preprint arXiv:2402.01739. Fan Yuan, Yuchen Yan, Yifan Jiang, Haoran Zhao, Tao Feng, Jinyan Chen, Yanwei Lou, Wenqi Zhang, Yongliang Shen, Weiming Lu, and 1 others. 2025. Gsm8k-v: Can vision language models solve grade school math word problems in visual contexts.arXiv preprint arXi...

  4. [4]

    **Text Problem**: `{text_problem}`

  5. [5]

    A model answered based on the image and produced:

    **Correct Text Solution** (guaranteed correct ): `{text_solution}` Now the same problem has been converted into an **equivalent image** (same content, just visual). A model answered based on the image and produced:

  6. [6]

    **Model Answer (from the image)**: `{model_answer}` This model answer is **wrong**. **Task:** Determine the most likely error type: - **Information Reading Error**: The model misread, overlooked, or failed to extract key information from the image (e.g., missed a condition, misread a number/symbol, ignored part of the diagram/text). - **Reasoning Error**:...

  7. [7]

    Briefly list the key information that must be extracted from the problem to solve it

  8. [8]

    Compare the **Model Answer** with the ** Correct Text Solution** and infer whether the model likely failed at reading/extraction or at reasoning

  9. [9]

    Output **only one label**:`Information Reading Error`or`Reasoning Error`

  10. [10]

    Acc (OCR first)

    Then give a **one-sentence justification**. Dataset Acc Acc (OCR first) Perception Error Reasoning Error MATH500 92.8 - - - MATH500-v1 89.0 87.4 31.8% (7/22) 68.2% (15/22) MATH500-v2 88.2 86.8 26.9% (7/26) 73.1% (19/26) MATH500-v3 87.4 86.8 31.0% (9/29) 69.0% (20/29) Table 6: Accuracy comparison and error analysis on MATH500 and its three image versions. ...