WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
under- standing
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
The paper introduces a three-source decomposition showing that answer flips in multi-agent LLM debate include 37% spontaneous instability and 29% harmful conformity, with even vacuous reasoning persuading 20-39% of resistant agents and interventions reducing harmful conformity by 13.6 points.
Four axioms (Causality, Minimality, Separability, Stability) are formalized for latent thought representations; audits of open LLMs on 23 tasks show none satisfy all four and representations add little beyond input embeddings.
Gemma 3 introduces multimodal open models with architectural changes for efficient long context, trained via distillation and a new post-training recipe that makes the 4B version competitive with prior 27B models and the 27B version comparable to Gemini-1.5-Pro.
citing papers explorer
-
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.