MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning
Pith reviewed 2026-06-29 21:10 UTC · model grok-4.3
The pith
MuCRASP prunes vision-language models by targeting CoT pivot tokens and modality-specific activations to preserve reasoning accuracy under compression.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MuCRASP is a structured pruning framework that targets reasoning-critical components while preserving cross-modal alignment and accounting for layer-wise sensitivity under a global parameter budget; experiments on four VLMs across three reasoning benchmarks show it consistently preserves reasoning quality under increasing compression, maintains high reasoning consistency up to 50 percent pruning, and exhibits lower perplexity degradation than prior methods.
What carries the argument
MuCRASP, the structured pruning framework that identifies and protects reasoning-critical components by locating pivot tokens in CoT trajectories and respecting activation-distribution differences between visual and textual modalities under a single global budget.
If this is right
- Reasoning consistency remains high even after 50 percent of parameters are removed.
- Perplexity rises more slowly than under CoT-agnostic pruning.
- The same global budget can be allocated across layers according to measured sensitivity rather than uniform ratios.
- Cross-modal alignment is retained because pruning decisions explicitly consider both visual and textual activation statistics.
Where Pith is reading between the lines
- The same pivot-token logic could be applied to prune models that generate long chains of tool calls or multi-step plans in non-visual domains.
- If pivot tokens prove stable across tasks, a one-time identification pass might suffice for repeated pruning at different compression targets.
- Layer-wise sensitivity maps computed once on a calibration set could guide pruning for families of related VLMs without re-running the full search.
Load-bearing premise
CoT consistency in VLMs depends on sparse pivot tokens whose removal by existing pruning methods destroys accuracy, and visual versus textual activations differ enough that unimodal pruning rules cannot handle them.
What would settle it
A side-by-side evaluation on the same physical-reasoning benchmark and the same Qwen2.5-VL-7B model at 30 percent pruning in which MuCRASP does not produce a higher LLM-as-a-Judge score than the strongest baseline.
Figures
read the original abstract
Vision-language models (VLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex multimodal tasks, but their large parameter sizes make deployment expensive. Structured pruning offers a natural solution; however, existing methods fail to preserve CoT reasoning accuracy in VLMs. We identify two key reasons: (1) CoT consistency depends on sparse transition points (pivot tokens) in the generation trajectory, while existing pruning methods are CoT-agnostic; and (2) pruning methods designed for unimodal LLMs do not account for activation-distribution differences across visual and textual modalities. Motivated by these observations, we propose MuCRASP, a structured pruning framework that targets reasoning-critical components while preserving cross-modal alignment and accounting for layer-wise sensitivity under a global parameter budget. Experiments on four VLMs across three reasoning benchmarks show that MuCRASP consistently preserves reasoning quality under increasing compression. At 30% pruning on Qwen2.5-VL-7B, MuCRASP achieves an LLM-as-a-Judge score of 8.87 versus 7.32 for the strongest baseline on physical reasoning tasks. Furthermore, MuCRASP maintains high reasoning consistency up to 50% pruning, significantly outperforming prior pruning approaches while exhibiting lower perplexity degradation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes MuCRASP, a structured pruning framework for vision-language models that incorporates awareness of chain-of-thought (CoT) reasoning. It identifies two failure modes in prior pruning methods—ignorance of sparse pivot tokens critical for CoT consistency and failure to handle cross-modal activation differences—and designs pruning to target reasoning-critical components while preserving cross-modal alignment under a global parameter budget with layer-wise sensitivity. Experiments across four VLMs and three reasoning benchmarks claim that MuCRASP preserves reasoning quality better than baselines under compression, including an LLM-as-a-Judge score of 8.87 versus 7.32 for the strongest baseline at 30% pruning on Qwen2.5-VL-7B physical reasoning tasks, with maintained consistency up to 50% pruning and lower perplexity degradation.
Significance. If the results hold under rigorous evaluation, the work addresses a practical barrier to deploying reasoning-capable VLMs by offering a pruning approach that better maintains CoT performance than existing methods, potentially enabling more efficient multimodal systems.
major comments (2)
- [Abstract] Abstract: The sole quantitative evidence for the central claim that MuCRASP 'preserves reasoning quality' and 'significantly outperforming prior pruning approaches' is the LLM-as-a-Judge score (8.87 vs. 7.32 at 30% pruning). No correlation with human judgments, exact-match task accuracy, or inter-judge agreement is reported, so it is unclear whether the gap demonstrates true CoT preservation or merely stylistic preference by the judge model.
- [Abstract] Abstract: The motivation rests on the assertions that (1) CoT consistency depends on sparse pivot tokens and (2) unimodal pruning methods ignore visual/textual activation differences. These are presented as the reasons existing methods fail, yet the abstract supplies no supporting analysis, ablation, or measurement showing these factors are load-bearing for the observed performance gaps.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We address each major comment below and will make revisions to strengthen the presentation of evidence and motivation while remaining faithful to the manuscript's content.
read point-by-point responses
-
Referee: [Abstract] Abstract: The sole quantitative evidence for the central claim that MuCRASP 'preserves reasoning quality' and 'significantly outperforming prior pruning approaches' is the LLM-as-a-Judge score (8.87 vs. 7.32 at 30% pruning). No correlation with human judgments, exact-match task accuracy, or inter-judge agreement is reported, so it is unclear whether the gap demonstrates true CoT preservation or merely stylistic preference by the judge model.
Authors: We acknowledge the referee's point that the abstract centers the LLM-as-a-Judge score for the reasoning preservation claim. The manuscript does report supplementary metrics including lower perplexity degradation and maintained reasoning consistency up to 50% pruning. However, we agree that the absence of reported human correlation, exact-match accuracy, or inter-judge agreement leaves the interpretation of the judge score open to the concern raised. We will revise the abstract to reference these additional metrics explicitly and add a brief discussion in the experiments section on the LLM judge's use as a proxy, noting the lack of human validation as a limitation to be addressed in future work. revision: yes
-
Referee: [Abstract] Abstract: The motivation rests on the assertions that (1) CoT consistency depends on sparse pivot tokens and (2) unimodal pruning methods ignore visual/textual activation differences. These are presented as the reasons existing methods fail, yet the abstract supplies no supporting analysis, ablation, or measurement showing these factors are load-bearing for the observed performance gaps.
Authors: The abstract's brevity limits detailed support, but the full manuscript provides empirical analysis of activation patterns identifying pivot tokens and cross-modal differences in Sections 3 and 4. We agree the abstract would benefit from tighter linkage. We will revise the abstract to include a concise phrase referencing these observations from our preliminary analysis, thereby better grounding the motivation without altering the manuscript's core claims. revision: yes
Circularity Check
No circularity: empirical claims rest on benchmark comparisons without self-referential derivations or fitted predictions.
full rationale
The paper proposes MuCRASP as a structured pruning method motivated by two observations about CoT pivot tokens and cross-modal activations, then reports experimental results on VLMs and benchmarks using LLM-as-a-Judge scores. No equations, derivations, or parameter-fitting steps appear in the abstract or described content that reduce the performance claims to inputs by construction. Claims of superiority are supported by direct comparisons (e.g., 8.87 vs 7.32 scores) rather than any self-definitional, self-citation load-bearing, or ansatz-smuggling mechanisms. The derivation chain is self-contained as an empirical framework without the enumerated circular patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Explicit reasoning over end-to-end neural architectures for visual question answering, 2018
Somak Aditya, Yezhou Yang, and Chitta Baral. Explicit reasoning over end-to-end neural architectures for visual question answering, 2018. URL https://arxiv.org/abs/1803. 08896
2018
-
[2]
Spatial knowledge distillation to aid visual reasoning
Somak Aditya, Rudra Saha, Yezhou Yang, and Chitta Baral. Spatial knowledge distillation to aid visual reasoning. In2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 227–235. IEEE, 2019
2019
-
[3]
GQA: Training generalized multi-query transformer models from multi-head checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yinfei Yang, Siddhartha Jayanti, and Santiago Ontañón. GQA: Training generalized multi-query transformer models from multi-head checkpoints. InEMNLP, 2023
2023
-
[4]
Fluctuation-based adaptive structured pruning for large language models
Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. Fluctuation-based adaptive structured pruning for large language models. InAAAI, 2024
2024
-
[5]
Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman
Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. SliceGPT: Compress large language models by deleting rows and columns. In ICLR, 2024
2024
-
[6]
Humans or llms as the judge? a study on judgement biases, 2024
Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? a study on judgement biases, 2024. URL https://arxiv.org/abs/2402. 10669
2024
-
[7]
Mint-cot: Enabling interleaved visual tokens in mathematical chain-of-thought reasoning
Xinyan Chen, Renrui Zhang, Dongzhi Jiang, Aojun Zhou, Shilin Yan, Weifeng Lin, and Hongsheng Li. Mint-cot: Enabling interleaved visual tokens in mathematical chain-of-thought reasoning.arXiv preprint arXiv:2506.05331, 2025
-
[8]
Nlki: A lightweight natural language knowledge integration framework for improving small vlms in commonsense vqa tasks
Aritra Dutta, Swapnanil Mukherjee, Deepanway Ghosal, and Somak Aditya. Nlki: A lightweight natural language knowledge integration framework for improving small vlms in commonsense vqa tasks. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 10543–10563, 2025
2025
-
[9]
SparseGPT: Massive language models can be accurately pruned in one-shot
Elias Frantar and Dan Alistarh. SparseGPT: Massive language models can be accurately pruned in one-shot. InICML, 2023
2023
-
[10]
Gemma 3: Open models for responsible AI
Gemma Team, Google DeepMind. Gemma 3: Open models for responsible AI. Technical report, 2025
2025
-
[11]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alexander J
Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alexander J. Smola. A kernel method for the two-sample problem. InNeurIPS, 2006
2006
-
[13]
Borgwardt, Malte J
Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.Journal of Machine Learning Research, 13:723–773, 2012
2012
-
[14]
Song Han, Jeff Pool, John Tung, and William J. Dally. Learning both weights and connections for efficient neural networks. InNeurIPS, 2015
2015
-
[15]
Stork, and Gregory J
Babak Hassibi, David G. Stork, and Gregory J. Wolff. Optimal brain surgeon and general network pruning. InIEEE International Conference on Neural Networks, 1993
1993
-
[16]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[17]
Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperform- ing larger language models with less training data and smaller model sizes.arXiv preprint arXiv:2305.02301, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
IVTP: Instruction-guided visual token pruning for large vision-language models
Kai Huang, Hao Zou, Ye Xi, BoChen Wang, Zhen Xie, and Liang Yu. IVTP: Instruction-guided visual token pruning for large vision-language models. InECCV, 2024
2024
-
[19]
An analysis of visual question answering algorithms
Kushal Kafle and Christopher Kanan. An analysis of visual question answering algorithms. In ICCV, 2017
2017
-
[20]
Shortened llama: Depth pruning for large language models with comparison of retraining methods, 2024
Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung-Kyu Song. Shortened llama: Depth pruning for large language models with comparison of retraining methods, 2024. URLhttps://arxiv.org/abs/2402.02834
-
[21]
Minsu Kim, Jiwan Chung, et al. TAMP: Token-adaptive layerwise pruning in multimodal large language models.arXiv preprint arXiv:2504.09897, 2025
-
[22]
Large language models are zero-shot reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InNeurIPS, 2022
2022
-
[23]
Measuring Faithfulness in Chain-of-Thought Reasoning
Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil ˙e Lukoši¯ut˙e, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Lar- son, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, 12 Timot...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Denker, and Sara A
Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. InNeurIPS, 1989
1989
-
[25]
Layer-adaptive sparsity for the magnitude-based pruning.arXiv preprint arXiv:2010.07611, 2021
Jaeho Lee, Sejun Park, Sangwoo Mo, Sungsoo Ahn, and Jinwoo Shin. Layer-adaptive sparsity for the magnitude-based pruning.arXiv preprint arXiv:2010.07611, 2021
-
[26]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, et al. LLaV A-OneVision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, et al. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS, 2022
2022
-
[28]
Faithful chain-of-thought reasoning
Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apid- ianaki, and Chris Callison-Burch. Faithful chain-of-thought reasoning. In Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi, editors,Proceedings of the 13th International Joint Conference on Natural Language Pro- cessin...
-
[29]
doi: 10.18653/v1/2023.ijcnlp-main.20
Association for Computational Linguistics. doi: 10.18653/v1/2023.ijcnlp-main.20. URL https://aclanthology.org/2023.ijcnlp-main.20/
-
[30]
LLM-Pruner: On the structural pruning of large language models
Xinyin Ma, Gongfan Fang, and Xinchao Wang. LLM-Pruner: On the structural pruning of large language models. InNeurIPS, 2023
2023
-
[31]
LLM-Pruner: Support for GQA and Llama-3
Xinyin Ma, Gongfan Fang, and Xinchao Wang. LLM-Pruner: Support for GQA and Llama-3. https://github.com/horseee/LLM-Pruner, 2024
2024
-
[32]
OK-VQA: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A visual question answering benchmark requiring external knowledge. InCVPR, 2019
2019
-
[33]
Knapsack problems: Algorithms and computer implementa- tions.Wiley, 1990
Silvano Martello and Paolo Toth. Knapsack problems: Algorithms and computer implementa- tions.Wiley, 1990
1990
-
[34]
Are sixteen heads really better than one? In NeurIPS, 2019
Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? In NeurIPS, 2019
2019
-
[35]
Pruning convolutional neural networks for resource efficient inference
Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. InICLR, 2017
2017
-
[36]
Importance estimation for neural network pruning
Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. InCVPR, 2019
2019
-
[37]
Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution
Qwen Team. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution. Technical report, 2024
2024
-
[38]
Qwen2.5-VL: Technical report
Qwen Team. Qwen2.5-VL: Technical report. Technical report, 2025
2025
-
[39]
A-OKVQA: A benchmark for visual question answering using world knowledge
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-OKVQA: A benchmark for visual question answering using world knowledge. In ECCV, 2022
2022
-
[40]
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024
2024
-
[41]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer, 2020. URL https://arxiv.org/abs/ 2002.05202
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[42]
Zico Kolter
Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A simple and effective pruning approach for large language models. InICLR, 2024. 13
2024
-
[43]
ECoFLaP: Efficient coarse-to-fine layer-wise pruning for vision-language models
Yi-Lin Sung, Jaehong Yoon, and Mohit Bansal. ECoFLaP: Efficient coarse-to-fine layer-wise pruning for vision-language models. InICLR, 2024
2024
-
[44]
The information bottleneck method
Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000
work page internal anchor Pith review Pith/arXiv arXiv 2000
-
[45]
Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned
Elena V oita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. InACL, 2019
2019
-
[46]
Exploring intrinsic dimension for vision- language model pruning
Hanzhang Wang, Jiawen Zhang, and Qingyuan Ma. Exploring intrinsic dimension for vision- language model pruning. InICML, 2024
2024
-
[47]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022
2022
-
[48]
Task-specific compression for multi-task language models using attribution-based pruning
Nakyeong Yang, Yunah Jang, Hwanhee Lee, Seohyeong Jeong, and Kyomin Jung. Task-specific compression for multi-task language models using attribution-based pruning. InFindings of EACL, 2023
2023
-
[49]
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
ATP-LLaV A: Adaptive token pruning for large vision language models
Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Yansong Tang. ATP-LLaV A: Adaptive token pruning for large vision language models. InarXiv preprint arXiv:2412.00447, 2024
-
[51]
Outlier weighed layerwise sparsity (OWL): A missing secret sauce for pruning LLMs to high sparsity
Lu Yin, You Wu, Zhenyu Zhang, et al. Outlier weighed layerwise sparsity (OWL): A missing secret sauce for pruning LLMs to high sparsity. InICML, 2024
2024
-
[52]
Chain of preference optimization: Improving chain-of-thought reasoning in LLMs
Xuan Zhang, Chao Du, Tianyu Pang, Qian Liu, Wei Gao, and Min Lin. Chain of preference optimization: Improving chain-of-thought reasoning in LLMs. InNeurIPS, 2024
2024
-
[53]
Multimodal Chain-of-Thought Reasoning in Language Models
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Mul- timodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Multimodal chain-of-thought reasoning in language models
Zhuosheng Zhang, Aston Zhang, Mu Li, et al. Multimodal chain-of-thought reasoning in language models. InACL, 2023
2023
-
[55]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https://arxiv.org/ abs/2306.05685
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
To prune, or not to prune: exploring the efficacy of pruning for model compression
Michael Zhu and Suyog Gupta. To prune, or not to prune: Exploring the efficacy of pruning for model compression.arXiv preprint arXiv:1710.01878, 2017. 14 A Ablation studies A.1µCRASP: Is each component necessary? We systematically ablate each component of µCRASP on Qwen2.5-VL-7B at 50% pruning, an aggressive regime that exposes the contribution of each me...
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[57]
I observe
Answer (if the answer is incorrect, ignore it and generate the correct answer based on the image and question) Please generate a reasoning chain with 2–3 concise steps where needed, referencing specific visual cues (“I observe...”, “Because I see...”), then provide the final answer. Format: Step-by-Step Thought:
-
[58]
one”, “two
Therefore ... deduced_answer: {final_answer} Constraints: • Since this is a quantitative reasoning question, write numbers in words (e.g., “one”, “two”) We generate CoT traces using greedy decoding (temperature = 0) to ensure reproducibility. To ensure annotation quality, all generated CoT traces were manually reviewed by two human annotators, including o...
-
[59]
A question about the image
-
[60]
A generated chain-of-thought (CoT) reasoning trace
-
[61]
How many cars are there?
A final predicted answer Your task is to evaluate the quality of the reasoning trace along two dimensions: • Grounding Score (0–10):Evaluate how well the reasoning is supported by observ- able visual evidence in the image. High scores indicate that the reasoning correctly references visible objects, attributes, spatial relations, and scene details. • Hall...
-
[62]
I observe the player in the foreground wearing a white shirt with the number “15” on it
-
[63]
The player appears to be leading the action because they are closest to the ball and actively engaged in play
-
[64]
Final Answer: white
Therefore, the color of the lead player’s shirt is white. Final Answer: white. µCRASP (Ours): Step-by-Step Thought:
-
[65]
Identify the lead player in the scene
-
[66]
The lead player appears to be the individual closest to the ball
-
[67]
Final Answer: The color of the lead player’s shirt is white
The player is wearing a white shirt. Final Answer: The color of the lead player’s shirt is white. FLAP: The generated response becomes fragmented and semantically incoherent, failing to correctly identify the player’s shirt color. ...[truncated] LLM-Pruner: The image shows a football player in action, likely within a soccer match scenario. However, the ge...
-
[68]
New York City Bus
I observe the text “New York City Bus” displayed on the rear of the bus, indicating that the bus operates in New York City
-
[69]
The architecture of the surrounding buildings and the design of the bus are also consistent with a typical New York City street environment
-
[70]
New York City Bus,
Therefore, the most likely setting for this street is New York City. Final Answer: New York. µCRASP (Ours): The image contains a bus with signage indicating “New York City Bus, ” which strongly suggests that the scene takes place in New York City. The surrounding urban architecture and overall street layout are also visually consistent with a dense metrop...
-
[71]
I observe a bright yellow-green object in the air above the person, which appears to be a frisbee
-
[72]
The frisbee’s colour is distinctly greenish-yellow, matching the typical appearance of a frisbee
-
[73]
Limitations
Therefore, the colour of the Frisbee is green. Final Answer: green. µCRASP (Ours): The frisbee in the image appears to be yellow with some green markings, which is typical of a standard frisbee used in beach settings or similar environments. Final colour: Yellow with green markings. FLAP: The light brown object being held suggests the person’s clothing, w...
-
[74]
I observe four individuals seated in a ski lift chair
-
[75]
Each person is wearing winter clothing and appears to be enjoying the ride
-
[76]
Final Answer: four
Therefore, the total number of people in the picture is four. Final Answer: four. µCRASP (Ours): There are four people in the picture. The image shows a group of individuals sitting together on a ski lift, which is a common mode of transport used to carry skiers or snow riders. Each person appears to be wearing winter or snow gear, and they are all facing...
-
[77]
I observe three forks in the image: one placed on the left side of the pizza plate, another on the right side of the pizza plate, and a third one near the bottom right corner of the image
-
[78]
I infer that these are all distinct forks visible in the photo, as they are positioned separately and there are no overlapping or hidden forks
-
[79]
Final Answer: three
Therefore, the total count of forks in the photo is three. Final Answer: three. µCRASP (Ours): There are three forks visible in the photo. One fork is placed on the plate to the left side of the pizza, another fork is on the top right plate near the edge of the table, and a third fork is partially visible on the bottom right plate, extending from the edge...
-
[80]
I observe a man sitting on a red vehicle with large wheels and handlebars, which is characteristic of an all-terrain vehicle (ATV)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.