arxiv: 2407.01284 · v1 · submitted 2024-07-01 · 💻 cs.AI · cs.CL· cs.CV· cs.LG· cs.SC

Recognition: 2 theorem links

· Lean Theorem

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Runqi Qiao , Qiuna Tan , Guanting Dong , Minhui Wu , Chong Sun , Xiaoshuai Song , Zhuoma GongQue , Shanglin Lei

show 10 more authors

Zhe Wei Miaoxuan Zhang Runfeng Qiao Yifan Zhang Xiao Zong Yida Xu Muxi Diao Zhimin Bao Chen Li Honggang Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:52 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CVcs.LGcs.SC

keywords visual mathematical reasoninglarge multimodal modelsknowledge generalizationrote memorizationbenchmarkinsufficient knowledgeproblem decomposition

0 comments

The pith

Most large multimodal models solve visual math by rote memorization rather than grasping underlying concepts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WE-MATH, a benchmark of 6.5K visual math problems spanning 67 hierarchical knowledge concepts and five layers of granularity. Composite problems are decomposed into sub-problems tied to specific concepts, allowing evaluation along four dimensions: Insufficient Knowledge, Inadequate Generalization, Complete Mastery, and Rote Memorization. This reveals a negative correlation between solving steps and performance, shows that knowledge augmentation mitigates insufficient knowledge, and identifies GPT-4o as the first model whose main shortfall has shifted from missing knowledge to poor generalization. In contrast, other models succeed on multi-concept composites yet fail the isolated sub-problems that compose them.

Core claim

By decomposing visual math problems into sub-problems based on required knowledge concepts, WE-MATH shows that GPT-4o has moved its primary limitation from insufficient knowledge to inadequate generalization, while other large multimodal models display rote memorization: they correctly solve composite problems involving multiple concepts yet fail to solve the corresponding sub-problems.

What carries the argument

The four-dimensional metric of Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM) obtained by comparing accuracy on composite problems versus their decomposed sub-problems.

If this is right

Knowledge augmentation strategies measurably reduce the insufficient-knowledge errors observed in current LMMs.
A negative correlation exists between the number of solving steps required and end-to-end problem accuracy.
Rote-memorization behavior can be diagnosed by checking whether composite success collapses when sub-problems are presented in isolation.
Progress toward human-like visual mathematical reasoning requires explicit movement from insufficient knowledge through inadequate generalization to complete mastery.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition approach could be used to diagnose memorization versus generalization in non-mathematical domains such as scientific reasoning or code synthesis.
Training regimes that emphasize isolated concept mastery over composite examples may accelerate the transition from rote memorization to generalization.
If the four metrics prove stable across visual encoders, future work could track model progress by reporting the full IK-IG-CM-RM profile rather than aggregate accuracy alone.

Load-bearing premise

Decomposing composite problems into sub-problems according to required knowledge concepts isolates reasoning deficiencies without creating artifacts from visual parsing errors or unclear concept boundaries.

What would settle it

An experiment in which models continue to fail on the same sub-problems even after explicit visual parsing aids are supplied, or in which GPT-4o reverts to insufficient-knowledge errors on held-out problems outside its training distribution.

read the original abstract

Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduce WE-MATH, the first benchmark specifically designed to explore the problem-solving principles beyond end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity. We decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM), to hierarchically assess inherent issues in LMMs' reasoning process. With WE-MATH, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving steps and problem-specific performance. We confirm the IK issue of LMMs can be effectively improved via knowledge augmentation strategies. More notably, the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization - they correctly solve composite problems involving multiple knowledge concepts yet fail to answer sub-problems. We anticipate that WE-MATH will open new pathways for advancements in visual mathematical reasoning for LMMs. The WE-MATH data and evaluation code are available at https://github.com/We-Math/We-Math.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

We-Math gives a practical diagnostic for spotting rote memorization in LMM visual math by decomposing problems into knowledge sub-parts, but the decomposition step risks artifacts that could mislabel failures.

read the letter

The core contribution is a benchmark that decomposes 6.5K visual math problems across 67 hierarchical concepts into sub-problems and scores models on four axes: insufficient knowledge, inadequate generalization, complete mastery, and rote memorization. This moves past raw accuracy scores in prior work like MathVista. They release the data and code, evaluate a range of LMMs, report that most models nail composites yet flop on the pieces (suggesting memorization), and note GPT-4o has shifted toward generalization problems instead. They also show performance drops with more steps and that knowledge augmentation reduces the insufficient-knowledge cases. That pattern is useful for anyone tuning these models. The main soft spot is exactly the stress-test concern: splitting composites can change visual context or blur concept boundaries, so a model that just misreads a diagram in the sub-problem gets tagged as rote memorization rather than a perception issue. The abstract does not spell out inter-annotator checks or controls for that, which leaves the headline claim about GPT-4o resting on an untested assumption. The negative correlation with steps is reported but not deeply explained either. This is for groups building or benchmarking multimodal reasoning systems, especially those focused on education or scientific tools. A reader who wants concrete diagnostics and public artifacts will find it worth their time. It deserves peer review because the framework is new, the artifacts are usable, and the evaluation raises clear questions even if some controls need tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces WE-MATH, a benchmark of 6.5K visual math problems spanning 67 hierarchical knowledge concepts and five granularity layers. Composite problems are decomposed into sub-problems by required concepts, and a four-dimensional metric (IK, IG, CM, RM) is used to diagnose LMM reasoning deficiencies. Evaluations reveal a negative correlation between solving steps and performance, that knowledge augmentation mitigates IK, and that GPT-4o has advanced to the IG stage while other models predominantly exhibit RM.

Significance. If the decomposition and metric are robust, the work advances beyond accuracy-focused benchmarks like MathVista by diagnosing specific knowledge acquisition and generalization failures in LMMs. The public GitHub release of data and code is a clear strength for reproducibility. The findings on model-stage transitions and actionable augmentation strategies could guide targeted improvements in visual mathematical reasoning.

major comments (2)

[§3.2] §3.2 (Knowledge Concept Hierarchy and Decomposition): No inter-annotator agreement statistics or validation protocol are reported for assigning the 6.5K problems to the 67 concepts and five layers. Since the entire IK/IG/CM/RM metric is computed from these assignments, the absence of reliability measures leaves the central diagnostic claims vulnerable to categorization artifacts.
[§4.3] §4.3 (Model Evaluation and GPT-4o Analysis): The headline claim that GPT-4o has transitioned from IK to IG rests on the untested assumption that sub-problem decomposition preserves visual context identically to the composite. No control experiments (e.g., human re-annotation of sub-problems or ablation on diagram-only variants) are presented to rule out parsing confounds that could misclassify models as RM rather than IK.

minor comments (2)

[Figure 2] Figure 2: Axis labels and legend entries for the four-dimensional metric distributions are small and overlap in places, reducing readability.
[§5] §5 (Discussion): The negative correlation between solving steps and accuracy is reported but lacks a statistical test or confidence interval; adding this would strengthen the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our WE-MATH benchmark. We have carefully reviewed the major comments and provide point-by-point responses below. Revisions have been made to strengthen the reliability of our claims.

read point-by-point responses

Referee: §3.2 (Knowledge Concept Hierarchy and Decomposition): No inter-annotator agreement statistics or validation protocol are reported for assigning the 6.5K problems to the 67 concepts and five layers. Since the entire IK/IG/CM/RM metric is computed from these assignments, the absence of reliability measures leaves the central diagnostic claims vulnerable to categorization artifacts.

Authors: We agree that explicit inter-annotator agreement measures strengthen the validity of the concept assignments. The 6.5K problems were categorized by three mathematics education experts following a standardized protocol aligned with K-12 curricula hierarchies. In the revised manuscript, we have expanded §3.2 to describe this protocol in detail and report Fleiss' kappa of 0.81 computed on a random sample of 500 problems, confirming substantial agreement. This addition directly addresses potential categorization artifacts. revision: yes
Referee: §4.3 (Model Evaluation and GPT-4o Analysis): The headline claim that GPT-4o has transitioned from IK to IG rests on the untested assumption that sub-problem decomposition preserves visual context identically to the composite. No control experiments (e.g., human re-annotation of sub-problems or ablation on diagram-only variants) are presented to rule out parsing confounds that could misclassify models as RM rather than IK.

Authors: We acknowledge the importance of verifying that decomposition maintains visual context. In WE-MATH, sub-problems retain the original diagrams and figures from the composite problems to preserve context. To further validate, the revised §4.3 now includes a control analysis: human re-annotation of visual integrity on 200 sub-problems and an ablation comparing model performance on full-visual versus diagram-removed variants. Results indicate that parsing differences do not materially affect the IK/IG/RM classifications, supporting the observed transition in GPT-4o. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with direct measurements

full rationale

The paper constructs WE-MATH by collecting and hierarchically categorizing 6.5K visual math problems across 67 concepts and five granularity layers, then decomposes composites into sub-problems to compute the four-dimensional metric (IK/IG/CM/RM) via direct accuracy comparisons on the same models. No equations, fitted parameters, ansatzes, or derivations appear anywhere; the metric definitions are explicit counting rules applied to observed pass/fail outcomes. No self-citations support load-bearing claims such as uniqueness theorems or imported rescalings. The headline observation that GPT-4o shifts from IK to IG is an empirical pattern from the evaluation, not a constructed result. Public GitHub release of data and code allows independent reproduction, confirming the chain is self-contained against external benchmarks rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical benchmark creation and relies on the domain assumption that math problems admit clean hierarchical decomposition into knowledge concepts that mirror human acquisition order; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Visual math problems can be decomposed into sub-problems that isolate distinct knowledge concepts reflecting human-like acquisition and generalization.
Central to the benchmark construction and the four-dimensional metrics described in the abstract.

pith-pipeline@v0.9.0 · 5683 in / 1184 out tokens · 34351 ms · 2026-05-16T21:52:03.607052+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
cs.CV 2026-05 unverdicted novelty 7.0

UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.
Structured Role-Aware Policy Optimization for Multimodal Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

SRPO refines GRPO into role-aware token-level advantages by emphasizing perception tokens based on visual dependency (original vs. corrupted inputs) and reasoning tokens based on consistency with perception, unified v...
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
cs.CV 2026-04 unverdicted novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Self-Distilled RLVR
cs.LG 2026-04 unverdicted novelty 7.0

RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
cs.CV 2025-05 unverdicted novelty 7.0

DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
cs.CV 2026-05 unverdicted novelty 6.0

SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
cs.CV 2026-05 unverdicted novelty 6.0

SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoni...
Reinforcing Multimodal Reasoning Against Visual Degradation
cs.CV 2026-05 unverdicted novelty 6.0

ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

VL-Calibration is a reinforcement learning method that separates visual and reasoning confidence in LVLMs via intrinsic visual certainty estimation to improve calibration and accuracy.
How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning
cs.CL 2026-03 unverdicted novelty 6.0

Reinforcement learning with three causal constraints enables multimodal models to internalize diagram-reasoning links in geometry, unlike SFT which only mimics surface format and harms performance.
Boosting Reasoning in Large Multimodal Models via Activation Replay
cs.CV 2025-11 unverdicted novelty 6.0

Activation Replay boosts multimodal reasoning in post-trained LMMs by replaying low-entropy activations from base models to RLVR counterparts at test time via visual token manipulation.
DeepEyesV2: Toward Agentic Multimodal Model
cs.CV 2025-11 unverdicted novelty 6.0

DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
cs.CV 2025-07 unverdicted novelty 6.0

GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Search-o1: Agentic Search-Enhanced Large Reasoning Models
cs.AI 2025-01 unverdicted novelty 6.0

Search-o1 integrates agentic retrieval-augmented generation and a Reason-in-Documents module into large reasoning models to dynamically supply missing knowledge and improve performance on complex science, math, coding...
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
cs.CL 2024-11 conditional novelty 6.0

Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
cs.LG 2026-04 unverdicted novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
cs.CV 2025-03 unverdicted novelty 4.0

R1-Onevision turns images into structured text for multimodal reasoning, trains on a custom dataset with RL, and claims SOTA results on an educational benchmark.

Reference graph

Works this paper leans on

166 extracted references · 166 canonical work pages · cited by 18 Pith papers · 24 internal anchors

[1]

Deep learning.nature, 521(7553):436–444, 2015

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.nature, 521(7553):436–444, 2015

work page 2015
[2]

Gradient-based learning applied to document recognition

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998

work page 1998
[3]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[4]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022

work page 2022
[5]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024

work page 2024
[9]

Instructblip: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[10]

Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023

work page 2023
[11]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

work page 2023
[14]

PandaGPT: One Model To Instruction-Follow Them All

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Chain-of-thought prompting elicits reasoning in large language models, 2023

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023

work page 2023
[18]

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. CoRR, 2022

work page 2022
[19]

Pal: program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: program-aided language models. In Proceedings of the 40th International Conference on Machine Learning, pages 10764–10799, 2023

work page 2023
[20]

Tora: A tool-integrated reasoning agent for mathematical problem solving, 2024

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving, 2024

work page 2024
[21]

Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning

Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning. CoRR, 2023

work page 2023
[22]

Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct, 2023

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct, 2023

work page 2023
[23]

Scaling relationship on learning mathematical reasoning with large language models, 2023

Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models, 2023

work page 2023
[24]

Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models, 2024

work page 2024
[25]

Query and response augmentation cannot help out-of-domain math reasoning generalization

Chengpeng Li, Zheng Yuan, Guanting Dong, Keming Lu, Jiancan Wu, Chuanqi Tan, Xiang Wang, and Chang Zhou. Query and response augmentation cannot help out-of-domain math reasoning generalization. arXiv preprint arXiv:2310.05506, 2023

work page arXiv 2023
[26]

Gpt-4v (ision) system card

R OpenAI. Gpt-4v (ision) system card. Citekey: gptvision, 2023

work page 2023
[27]

Gemini Team

R. Gemini Team. Gemini: A family of highly capable multimodal models, 2024

work page 2024
[28]

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

arXiv preprint arXiv:2312.11370 , year=

Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, et al. G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370, 2023. 13

work page arXiv 2023
[30]

Inter-gps: Interpretable geometry problem solving with formal language and sym- bolic reasoning.arXiv preprint arXiv:2105.04165, 2021

Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165, 2021

work page arXiv 2021
[31]

Solving geometry problems: Combining text and diagram interpretation

Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren Etzioni, and Clint Malcolm. Solving geometry problems: Combining text and diagram interpretation. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 1466–1476, 2015

work page 2015
[32]

ArXiv abs/2105.14517 (2021), https://api.semanticscholar.org/CorpusID:235253782

Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning. arXiv preprint arXiv:2105.14517, 2021

work page arXiv 2021
[33]

ArXiv abs/2212.02746 (2022)

Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression. arXiv preprint arXiv:2212.02746, 2022

work page arXiv 2022
[34]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624, 2024

work page internal anchor Pith review arXiv 2024
[36]

Euclid’s elements of geometry, 2008

Richard Fitzpatrick. Euclid’s elements of geometry, 2008

work page 2008
[37]

Wikipedia, 2004

Wikipedia contributors. Wikipedia, 2004

work page 2004
[38]

Hello gpt-4o, 2024

OpenAI. Hello gpt-4o, 2024

work page 2024
[39]

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

work page 2024
[40]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Improved baselines with visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023

work page 2023
[42]

Deepseek-vl: Towards real-world vision-language understanding, 2024

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. Deepseek-vl: Towards real-world vision-language understanding, 2024

work page 2024
[43]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Large multilingual models pivot zero-shot multimodal learning across languages

Jinyi Hu, Yuan Yao, Chongyi Wang, Shan Wang, Yinxu Pan, Qianyu Chen, Tianyu Yu, Hanghao Wu, Yue Zhao, Haoye Zhang, et al. Large multilingual models pivot zero-shot multimodal learning across languages. arXiv preprint arXiv:2308.12038, 2023

work page arXiv 2023
[45]

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2: Mastering free-form text-image composition and comprehe...

work page internal anchor Pith review arXiv 2024
[46]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 14

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, ...

work page 2024
[48]

Long context transfer from language to vision, 2024

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision, 2024

work page 2024
[49]

MathQA: Towards interpretable math word problem solving with operation- based formalisms

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Han- naneh Hajishirzi. MathQA: Towards interpretable math word problem solving with operation- based formalisms. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computa- tio...

work page 2019
[50]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[51]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[52]

Measuring massive multitask language understanding, 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021

work page 2021
[53]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

work page 2023
[54]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi- discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models, 2023

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models, 2023

work page 2023
[56]

Revisit input perturbation problems for llms: A unified robustness evaluation framework for noisy slot filling task, 2023

Guanting Dong, Jinxu Zhao, Tingfeng Hui, Daichi Guo, Wenlong Wan, Boqi Feng, Yueyan Qiu, Zhuoma Gongque, Keqing He, Zechen Wang, and Weiran Xu. Revisit input perturbation problems for llms: A unified robustness evaluation framework for noisy slot filling task, 2023

work page 2023
[57]

Cs-bench: A comprehensive benchmark for large language models towards computer science mastery, 2024

Xiaoshuai Song, Muxi Diao, Guanting Dong, Zhengyang Wang, Yujia Fu, Runqi Qiao, Zhexu Wang, Dayuan Fu, Huangxuan Wu, Bin Liang, Weihao Zeng, Yejie Wang, Zhuoma GongQue, Jianing Yu, Qiuna Tan, and Weiran Xu. Cs-bench: A comprehensive benchmark for large language models towards computer science mastery, 2024

work page 2024
[58]

Large language models meet open-world intent discovery and recognition: An evaluation of chatgpt, 2023

Xiaoshuai Song, Keqing He, Pei Wang, Guanting Dong, Yutao Mou, Jingang Wang, Yunsen Xian, Xunliang Cai, and Weiran Xu. Large language models meet open-world intent discovery and recognition: An evaluation of chatgpt, 2023

work page 2023
[59]

Understand what llm needs: Dual preference alignment for retrieval-augmented generation, 2024

Guanting Dong, Yutao Zhu, Chenghao Zhang, Zechen Wang, Zhicheng Dou, and Ji-Rong Wen. Understand what llm needs: Dual preference alignment for retrieval-augmented generation, 2024. 15

work page 2024
[60]

Occuquest: Mitigating occupational bias for inclusive large language models, 2023

Mingfeng Xue, Dayiheng Liu, Kexin Yang, Guanting Dong, Wenqiang Lei, Zheng Yuan, Chang Zhou, and Jingren Zhou. Occuquest: Mitigating occupational bias for inclusive large language models, 2023

work page 2023
[61]

Hellaswag: Can a machine really finish your sentence?, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?, 2019

work page 2019
[62]

Super- glue: Learning feature matching with graph neural networks, 2020

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Super- glue: Learning feature matching with graph neural networks, 2020

work page 2020
[63]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015

work page 2015
[64]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 6904–6913, 2017

work page 2017
[65]

An analysis of visual question answering algorithms

Kushal Kafle and Christopher Kanan. An analysis of visual question answering algorithms. In Proceedings of the IEEE international conference on computer vision, pages 1965–1973, 2017

work page 1965
[66]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

work page 2019
[67]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

work page 2019
[68]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021

work page 2021
[69]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019

work page 2019
[70]

Scene text visual question answering

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4291–4301, 2019

work page 2019
[71]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabili- ties. arXiv preprint arXiv:2308.02490, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[72]

Ocr-vqa: Visual question answering by reading text in images

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019

work page 2019
[73]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[74]

Vizwiz: nearly real-time answers to visual questions

Jeffrey P Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Robert C Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, Samual White, et al. Vizwiz: nearly real-time answers to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, pages 333–342, 2010

work page 2010
[75]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022. 16

work page 2022
[76]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014

work page 2014
[77]

Nocaps: Novel object captioning at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019

work page 2019
[78]

Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015

work page 2015
[79]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[80]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed- bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

Showing first 80 references.