Recognition: 2 theorem links
· Lean TheoremHow RL Unlocks the Aha Moment in Geometric Interleaved Reasoning
Pith reviewed 2026-05-15 18:11 UTC · model grok-4.3
The pith
Reinforcement learning with three causal constraints makes models internalize generated diagrams as functional parts of geometric reasoning instead of mere format mimicry.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Naive SFT on interleaved plot-solution data produces distributional alignment that reproduces plotting format but leaves the causal dependency between the generated diagram and subsequent reasoning steps unlearned, causing measurable drops relative to text-only baselines. Faire, a reinforcement-learning method, imposes three explicit causal constraints during training to enforce functional alignment instead. This produces a qualitative change in model behavior where the plotting step is effectively internalized and contributes to correct deductions, restoring competitive accuracy on challenging geometric reasoning benchmarks.
What carries the argument
Faire, the reinforcement learning framework that applies three causal constraints to enforce functional rather than distributional alignment between generated plots and reasoning steps.
If this is right
- Interleaved geometric reasoning can reach competitive levels without sacrificing the benefits of visual generation.
- The same RL constraints that restore causal use of plots can be applied to other tasks requiring tight generation-reasoning coupling.
- Text-only baselines remain strong until functional alignment is added, showing that format imitation alone is insufficient for diagram-dependent deduction.
- Qualitative internalization of plotting emerges as a distinct training outcome from distributional copying.
Where Pith is reading between the lines
- The result suggests that many multimodal generation tasks may need explicit causal supervision beyond imitation to avoid format-only learning.
- Similar RL constraints could be tested on interleaved reasoning in non-geometric domains such as code with visualizations or scientific diagrams.
- If the causal constraints generalize, they might reduce the need for hand-crafted prompts that force diagram use after generation.
Load-bearing premise
The performance drop after SFT occurs specifically because the model fails to internalize the causal dependency between its own generated plots and the reasoning steps that use them, and that enforcing three causal constraints via RL will produce that internalization rather than new superficial behavior.
What would settle it
Measure whether models trained with Faire actually reference or depend on the plots they generate in their subsequent reasoning traces, while SFT models do not; if the causal references stay absent even after Faire yet benchmark scores still rise, the claim is falsified.
read the original abstract
Solving complex geometric problems inherently requires interleaved reasoning: a tight alternation between constructing diagrams and performing logical deductions. Although recent Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities in visual generation and plotting, we identify a counter-intuitive and underexplored phenomenon. Naively applying Supervised Fine-Tuning (SFT) on interleaved plot-solution data leads to a substantial degradation in reasoning performance compared to text-only baselines. We argue that this failure stems from a fundamental limitation of SFT, which primarily induces distributional alignment: the model learns to reproduce the surface format of interleaved plotting but fails to internalize the causal dependency between the generated plot and reasoning steps. To overcome this limitation, we propose Faire (Functional alignment for interleaved reasoning), a reinforcement learning framework that enforces three casual constraints to move beyond superficial imitation toward functional alignment. Extensive experiments show that Faire induces a qualitative shift in model behavior in which the plotting is effectively internalized, yielding competitive performance on challenging geometric reasoning benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that supervised fine-tuning (SFT) on interleaved plot-solution data for geometric reasoning causes substantial performance degradation relative to text-only baselines, because SFT achieves only distributional alignment and fails to internalize the causal dependency between generated plots and subsequent reasoning steps. It introduces Faire, a reinforcement-learning framework that enforces three causal constraints to achieve functional alignment instead of superficial imitation, producing a qualitative shift in model behavior and competitive results on challenging geometric reasoning benchmarks.
Significance. If the empirical findings hold, the work would be significant for highlighting a fundamental limitation of SFT on interleaved multimodal generation-reasoning tasks and for showing how targeted RL constraints can promote deeper functional integration of visual and logical steps in MLLMs. This could inform more effective training strategies for complex geometric and diagrammatic reasoning.
major comments (3)
- [Abstract] Abstract: the claim that SFT produces only distributional alignment while failing to internalize causal plot-reasoning dependencies is presented without any quantitative results, benchmark names, performance deltas, or error analysis, so the data support for the central claim cannot be evaluated.
- [Method] Method (description of Faire): the three causal constraints are asserted to move the model from superficial imitation to functional alignment, yet no formulation, reward implementation, or enforcement mechanism is supplied; without these details the claim that the constraints produce the observed qualitative shift remains untestable.
- [Experiments] Experiments: no ablation isolating the effect of the three constraints, no intervention studies (e.g., plot-content perturbation or attention tracing), and no comparison of RL exploration versus constraint-specific gains are reported, leaving open the possibility that benchmark improvements arise from generic RL optimization rather than internalized causal alignment.
minor comments (1)
- [Abstract] The acronym 'Faire' is introduced without expansion or motivation for the name.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity, completeness, and testability of the claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that SFT produces only distributional alignment while failing to internalize causal plot-reasoning dependencies is presented without any quantitative results, benchmark names, performance deltas, or error analysis, so the data support for the central claim cannot be evaluated.
Authors: We agree that the abstract should be more self-contained to allow immediate evaluation of the central claim. The full manuscript reports quantitative results on benchmarks including Geometry3K and GeoQA, where SFT leads to 12-18% degradation relative to text-only baselines, accompanied by error analysis showing increased failures in plot-conditioned reasoning steps. We will revise the abstract to include specific benchmark names, key performance deltas, and a concise reference to the error analysis. revision: yes
-
Referee: [Method] Method (description of Faire): the three causal constraints are asserted to move the model from superficial imitation to functional alignment, yet no formulation, reward implementation, or enforcement mechanism is supplied; without these details the claim that the constraints produce the observed qualitative shift remains untestable.
Authors: The three causal constraints are defined via causal intervention scores that penalize non-functional plot-reasoning links, implemented as additive terms in the RL reward and enforced through a constrained policy gradient update. To make this fully explicit and testable, we will expand the main method section with the precise mathematical formulations, reward equations, and pseudocode for the enforcement mechanism, moving supporting details from the appendix into the primary text. revision: yes
-
Referee: [Experiments] Experiments: no ablation isolating the effect of the three constraints, no intervention studies (e.g., plot-content perturbation or attention tracing), and no comparison of RL exploration versus constraint-specific gains are reported, leaving open the possibility that benchmark improvements arise from generic RL optimization rather than internalized causal alignment.
Authors: We have performed the requested analyses: per-constraint ablations, plot-perturbation interventions that demonstrate causal dependency breakdowns, attention-tracing examples, and direct comparisons against vanilla RL without the causal constraints. These results currently reside in the supplementary material. We will add a dedicated ablation subsection to the main experiments, incorporating the intervention studies and attention visualizations to isolate the contribution of functional alignment over generic RL gains. revision: yes
Circularity Check
No significant circularity; empirical comparison without self-referential derivation
full rationale
The paper advances an empirical hypothesis that SFT induces only distributional alignment while RL enforces functional alignment via three causal constraints, supported by benchmark comparisons rather than any closed mathematical chain. No equations, fitted parameters, or predictions reduce to prior definitions by construction, and no load-bearing self-citations or uniqueness theorems are invoked to justify the core claims. The derivation is self-contained as an experimental demonstration of performance differences, with the internalization argument serving as an interpretive framing of observed results rather than a tautological restatement of inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption SFT on interleaved data induces only distributional alignment and fails to internalize causal plot-reasoning dependency
- ad hoc to paper Enforcing three causal constraints via RL produces functional alignment beyond superficial imitation
invented entities (1)
-
Faire
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SFT primarily induces distributional alignment... fails to internalize the causal dependency between the generated plot and reasoning steps... Faire... enforces three causal constraints... Geometric Consistency (Cgeo), Perceptual Admissibility (Cperc), and Semantic Alignment (Csem)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
RL optimizes a policy that instantiates V as a latent causal mediator... tri-perspective verification system
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Beyond lines and circles: Unveiling the geometric reasoning gap in large language models
Spyridon Mouselinos, Henryk Michalewski, and Mateusz Malinowski. Beyond lines and circles: Unveiling the geometric reasoning gap in large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 6192–6222, 2024
work page 2024
-
[2]
Shihao Xu, Yiyang Luo, and Wei Shi. Geo-llava: A large multi-modal model for solving geometry math problems with meta in-context learning. InProceedings of the 2nd Workshop on Large Generative Models Meet Multimodal Applications, pages 11–15, 2024
work page 2024
-
[3]
Self-imagine: Effective unimodal reasoning with multimodal models using self-imagination
Syeda Nahida Akter, Aman Madaan, Sangwu Lee, Yiming Yang, and Eric Nyberg. Self-imagine: Effective unimodal reasoning with multimodal models using self-imagination. InICLR 2024 Workshop on Large Language Model (LLM) Agents
work page 2024
-
[4]
Gns: Solving plane geometry problems by neural-symbolic reasoning with multi-modal llms
Maizhen Ning, Zihao Zhou, Qiufeng Wang, Xiaowei Huang, and Kaizhu Huang. Gns: Solving plane geometry problems by neural-symbolic reasoning with multi-modal llms. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24957–24965, 2025
work page 2025
-
[5]
Cogcom: A visual language model with chain-of-manipulations reasoning
Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, et al. Cogcom: A visual language model with chain-of-manipulations reasoning. InThe Thirteenth International Conference on Learning Representations
-
[6]
Vishal Kumar, Shubhra Mishra, Rebecca Hao, Rizwaan Malik, David Broman, and Dorottya Demszky. Diagramir: An automatic pipeline for educational math diagram evaluation.arXiv preprint arXiv:2511.08283, 2025
-
[7]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Math-puma: Progressive upward multimodal alignment to enhance mathematical reasoning
Wenwen Zhuang, Xin Huang, Xiantao Zhang, and Jin Zeng. Math-puma: Progressive upward multimodal alignment to enhance mathematical reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 26183–26191, 2025
work page 2025
-
[9]
Large language- geometry model: When llm meets equivariance
Zongzhao Li, Jiacheng Cen, Bing Su, Tingyang Xu, Yu Rong, Deli Zhao, and Wenbing Huang. Large language- geometry model: When llm meets equivariance. InForty-second International Conference on Machine Learning
-
[10]
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024
work page 2024
-
[11]
Multimodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research
Zhuosheng Zhang, Aston Zhang, Mu Li, George Karypis, Alex Smola, et al. Multimodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research
-
[12]
Lei Wang, Yi Hu, Jiabang He, Xing Xu, Ning Liu, Hui Liu, and Heng Tao Shen. T-sciq: Teaching multimodal chain-of-thought reasoning via large language model signals for science question answering. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19162–19170, 2024
work page 2024
-
[13]
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024
work page 2024
-
[14]
Interleaved-modal chain-of-thought
Jun Gao, Yongqi Li, Ziqiang Cao, and Wenjie Li. Interleaved-modal chain-of-thought. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19520–19529, 2025
work page 2025
-
[15]
Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, and Yu Cheng. Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning.arXiv preprint arXiv:2510.27492, 2025
-
[16]
Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, and Sibei Yang. Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models.Advances in Neural Information Processing Systems, 36:5168–5191, 2023
work page 2023
-
[17]
A multi-modal neural geometric solver with textual clauses parsed from diagram
Ming-Liang Zhang, Fei yin, and Cheng-Lin Liu. A multi-modal neural geometric solver with textual clauses parsed from diagram. In Edith Elkind, editor,Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pages 3374–3382. International Joint Conferences on Artificial Intelligence Organization, 8 2023. Main Track. 13
work page 2023
-
[18]
G-llava: Solving geometric problem with multi-modal large language model
Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing HONG, Jianhua Han, Hang Xu, Zhenguo Li, and Lingpeng Kong. G-llava: Solving geometric problem with multi-modal large language model. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Representation Learning, volume 2025, pages 3490–3511, 2025
work page 2025
-
[19]
Conic10K: A challenging math problem understanding and reasoning dataset
Haoyi Wu, Wenyang Hui, Yezeng Chen, Weiqi Wu, Kewei Tu, and Yi Zhou. Conic10K: A challenging math problem understanding and reasoning dataset. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6444–6458, Singapore, December 2023. Association for Computational Linguistics
work page 2023
-
[20]
Advancing multimodal llms: A focus on geometry problem solving reasoning and sequential scoring
Raj Jaiswal, Avinash Anand, and Rajiv Ratn Shah. Advancing multimodal llms: A focus on geometry problem solving reasoning and sequential scoring. InProceedings of the 6th ACM International Conference on Multimedia in Asia, MMAsia ’24, New York, NY, USA, 2024. Association for Computing Machinery
work page 2024
-
[21]
Zihan Huang, Tao Wu, Wang Lin, Shengyu Zhang, Jingyuan Chen, and Fei Wu. Autogeo: Automating geometric image dataset creation for enhanced geometry understanding.IEEE Transactions on Multimedia, 27:3105–3116, 2025
work page 2025
-
[22]
A symbolic characters aware model for solving geometry problems
Maizhen Ning, Qiu-Feng Wang, Kaizhu Huang, and Xiaowei Huang. A symbolic characters aware model for solving geometry problems. InProceedings of the 31st ACM International Conference on Multimedia, MM ’23, page 7767–7775, New York, NY, USA, 2023. Association for Computing Machinery
work page 2023
-
[23]
Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024
Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024
work page 2024
-
[24]
Euclid-omni: A unified neuro-symbolic framework for geometry problem solving
Anonymous. Euclid-omni: A unified neuro-symbolic framework for geometry problem solving. InSubmitted to The Fourteenth International Conference on Learning Representations, 2025. under review
work page 2025
-
[25]
Formal representation and solution of plane geometric problems
Xiaokai Zhang, Na Zhu, Cheng Qin, Yang Li, Zhenbing Zeng, and Tuo Leng. Formal representation and solution of plane geometric problems. InThe 4th Workshop on Mathematical Reasoning and AI at NeurIPS’24, 2024
work page 2024
-
[26]
Jingxuan Wei, Caijun Jia, Qi Chen, Honghao He, Linzhuang Sun, Conghui He, Lijun Wu, Bihui Yu, and Cheng Tan. Geoint-r1: Formalizing multimodal geometric reasoning with dynamic auxiliary constructions.arXiv preprint arXiv:2508.03173, 2025
-
[27]
GeoCoder: Solving geometry problems by generating modular code through vision-language models
Aditya Sharma, Aman Dalmia, Mehran Kazemi, Amal Zouaq, and Christopher Pal. GeoCoder: Solving geometry problems by generating modular code through vision-language models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 7340–7356, Albuquerque, New Mexico, April 2025. Associati...
work page 2025
-
[28]
Daocheng Fu, Zijun Chen, Renqiu Xia, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Junchi Yan, et al. Trustgeogen: Scalable and formal-verified data engine for trustworthy multi-modal geometric problem solving.arXiv preprint arXiv:2504.15780, 2025
-
[29]
Weiming Wu, Jin Ye, Zi-kang Wang, Zhi Zhou, Yu-Feng Li, and Lan-Zhe Guo. Nesygeo: A neuro-symbolic framework for multimodal geometric reasoning data generation.arXiv preprint arXiv:2505.17121, 2025
-
[30]
Jun Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shuning Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.ArXiv, abs/2506.09965, 2025
-
[31]
From easy to hard: The mir benchmark for progressive interleaved multi-image reasoning
Hang Du, Jiayang Zhang, Guoshun Nan, Wendi Deng, Zhenyan Chen, Chenyang Zhang, Wang Xiao, Shan Huang, Yuqi Pan, Tao Qi, et al. From easy to hard: The mir benchmark for progressive interleaved multi-image reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 859–869, 2025
work page 2025
-
[32]
Interleaving reasoning for better text-to-image generation.arXiv preprint arXiv:2509.06945, 2025
Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, et al. Interleaving reasoning for better text-to-image generation.arXiv preprint arXiv:2509.06945, 2025
-
[33]
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Vipergpt: Visual inference via python execution for reasoning
Dídac Surís, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11888–11898, 2023. 14
work page 2023
-
[35]
Mars2 2025 challenge on multimodal reasoning: Datasets, methods, results, discussion, and outlook
Peng Xu, Shengwu Xiong, Jiajun Zhang, Yaxiong Chen, Bowen Zhou, Chen Change Loy, David Clifton, Kyoung Mu Lee, Luc Van Gool, Ruiming He, et al. Mars2 2025 challenge on multimodal reasoning: Datasets, methods, results, discussion, and outlook. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6517–6546, 2025
work page 2025
-
[36]
VisuoThink: Empowering LVLM reasoning with multimodal tree search
Yikun Wang, Siyin Wang, Qinyuan Cheng, Zhaoye Fei, Liang Ding, Qipeng Guo, Dacheng Tao, and Xipeng Qiu. VisuoThink: Empowering LVLM reasoning with multimodal tree search. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1:...
-
[37]
Association for Computational Linguistics
-
[38]
Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, et al. Arm-thinker: Reinforcing multimodal generative reward models with agentic tool use and visual reasoning.arXiv preprint arXiv:2512.05111, 2025
-
[39]
Generating images with multimodal language models
Jing Yu Koh, Daniel Fried, and Russ R Salakhutdinov. Generating images with multimodal language models. Advances in Neural Information Processing Systems, 36:21487–21506, 2023
work page 2023
-
[40]
Making llama see and draw with seed tokenizer
Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer. InICLR, 2024
work page 2024
-
[41]
Show-o: One single transformer to unify multimodal understanding and generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InThe Thirteenth International Conference on Learning Representations
-
[42]
Orthus: Autoregressive interleaved image-text generation with modality-specific heads
Siqi Kou, Jiachun Jin, Zhihong Liu, Chang Liu, Ye Ma, Jian Jia, Quan Chen, Peng Jiang, and Zhijie Deng. Orthus: Autoregressive interleaved image-text generation with modality-specific heads. InForty-second International Conference on Machine Learning
-
[43]
Wegen: A unified model for interactive multimodal generation as we chat
Zhipeng Huang, Shaobin Zhuang, Canmiao Fu, Binxin Yang, Ying Zhang, Chong Sun, Zhizheng Zhang, Yali Wang, Chen Li, and Zheng-Jun Zha. Wegen: A unified model for interactive multimodal generation as we chat. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23679–23689, 2025
work page 2025
-
[44]
Opening: A comprehensive benchmark for judging open-ended interleaved image-text generation
Pengfei Zhou, Xiaopeng Peng, Jiajun Song, Chuanhao Li, Zhaopan Xu, Yue Yang, Ziyao Guo, Hao Zhang, Yuqi Lin, Yefei He, et al. Opening: A comprehensive benchmark for judging open-ended interleaved image-text generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 56–66, 2025
work page 2025
-
[45]
Holistic evaluation for interleaved text-and-image generation
Minqian Liu, Zhiyang Xu, Zihao Lin, Trevor Ashby, Joy Rimchala, Jiaxin Zhang, and Lifu Huang. Holistic evaluation for interleaved text-and-image generation. InEMNLP, 2024
work page 2024
-
[46]
Towards unified multimodal interleaved generation via group relative policy optimization
Ming Nie, Chunwei Wang, Jianhua Han, Hang Xu, and Li Zhang. Towards unified multimodal interleaved generation via group relative policy optimization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[47]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Gemma 3: Open models technical report
Gemma Team. Gemma 3: Open models technical report. Technical report, 2025
work page 2025
-
[49]
Moonshot AI Team. Kimi-vl technical report. Technical report, Moonshot AI, 2025
work page 2025
-
[50]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang et al. Internvl 3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Qwen Team. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
V Team, Wenyi Hong, et al. Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning, 2026
work page 2026
-
[53]
Qwen3-vl technical report, 2025
Shuai Bai, Yuxuan Cai, et al. Qwen3-vl technical report, 2025
work page 2025
-
[54]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
OpenAI.GPT-5.1: System Card and Safety Analysis. OpenAI, November 2025. 15
work page 2025
-
[56]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [57]
-
[58]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Qwen-image technical report.arXiv e-prints, pages arXiv–2508, 2025
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv e-prints, pages arXiv–2508, 2025
work page 2025
-
[61]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.NeurIPS, 35:36479–36494, 2022
work page 2022
-
[62]
GenExam: A Multidisciplinary Text-to-Image Exam
Zhaokai Wang, Penghao Yin, Xiangyu Zhao, Changyao Tian, Yu Qiao, Wenhai Wang, Jifeng Dai, and Gen Luo. Genexam: A multidisciplinary text-to-image exam.arXiv preprint arXiv:2509.14232, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InECCV, pages 169–186. Springer, 2024
work page 2024
-
[64]
Kai Sun, Yushi Bai, Ji Qi, Lei Hou, and Juanzi Li. Mm-math: Advancing multimodal math evaluation with process evaluation and fine-grained classification.arXiv preprint arXiv:2404.05091, 2024
-
[65]
Minxuan Zhou, Hao Liang, Tianpeng Li, Zhiyu Wu, Mingan Lin, Linzhuang Sun, Yaqi Zhou, Yan Zhang, Xiaoqin Huang, Yicong Chen, et al. Mathscape: Evaluating mllms in multimodal math scenarios through a hierarchical benchmark.arXiv preprint arXiv:2408.07543, 2024
-
[66]
Geoeval: Benchmark for evaluating llms and multi-modal models on geometry problem-solving
Jiaxin Zhang, Zhong-Zhi Li, Ming-Liang Zhang, Fei Yin, Cheng-Lin Liu, and Yashar Moshfeghi. Geoeval: Benchmark for evaluating llms and multi-modal models on geometry problem-solving. InFindings of the Association for Computational Linguistics ACL 2024, pages 1258–1276, 2024
work page 2024
-
[67]
Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284, 2024
-
[68]
Xinwu Ye, Chengfan Li, Siming Chen, Wei Wei, and Xiangru Tang. Mmscibench: Benchmarking language models on chinese multimodal scientific problems.arXiv preprint arXiv:2503.01891, 2025
-
[69]
Junling Wang, Anna Rutkiewicz, April Yi Wang, and Mrinmaya Sachan. Generating pedagogically meaningful visuals for math word problems: A new benchmark and analysis of text-to-image models.arXiv preprint arXiv:2506.03735, 2025
-
[70]
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark
Himanshu Gupta, Shreyas Verma, Ujjwala Anantheswaran, Kevin Scaria, Mihir Parmar, Swaroop Mishra, and Chitta Baral. Polymath: A challenging multi-modal mathematical reasoning benchmark.arXiv preprint arXiv:2410.14702, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[71]
Peijie Wang, Chao Yang, Zhong-Zhi Li, Fei Yin, Dekang Ran, Mi Tian, Zhilong Ji, Jinfeng Bai, and Cheng-Lin Liu. Solidgeo: Measuring multimodal spatial math reasoning in solid geometry.arXiv preprint arXiv:2505.21177, 2025
-
[72]
Jingxuan Wei, Caijun Jia, Xi Bai, Xinglong Xu, Siyuan Li, Linzhuang Sun, Bihui Yu, Conghui He, Lijun Wu, and Cheng Tan. Ggbench: A geometric generative reasoning benchmark for unified multimodal models.arXiv preprint arXiv:2511.11134, 2025
-
[73]
Janus: Decoupling visual encoding for unified multimodal understanding and generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InCVPR, pages 12966–12977, 2025. 16
work page 2025
-
[74]
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025
work page 2025
-
[75]
Qwen3 technical report.arXiv e-prints, pages arXiv–2505, 2025
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv e-prints, pages arXiv–2505, 2025
work page 2025
-
[76]
Deepseek-v3 technical report.CoRR, 2024
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.CoRR, 2024
work page 2024
-
[77]
Gpt-4 technical report.arXiv e-prints, pages arXiv–2303, 2023
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv e-prints, pages arXiv–2303, 2023
work page 2023
-
[78]
Anthropic. Claude sonnet 4.5 system card. Technical report, Anthropic PBC, 2025. Official system card describing Claude Sonnet 4.5 capabilities and safety evaluation. Available at:https://assets.anthropic.com/ m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf
work page 2025
-
[79]
looks right but violates constraints
OpenAI. Gpt-5 system card. Technical report, OpenAI, 2025. Official system card document for GPT-5; available at: https://cdn.openai.com/pdf/8124a3ce-ab78-4f06-96eb-49ea29ffb52f/gpt5-system-card-aug7.pdf. 17 Appendix A Evaluation Metrics and Protocols A.1 Evaluation metrics We evaluate multimodal geometry solving along two axes:solution rigor(answer corre...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.